The evolution of YOLO: Object detection algorithms

SuperAnnotate
6 min readOct 12, 2021

--

If you ever ask, what are the most innovative ideas that came out of the computer vision community, YOLO will probably be in the top 10. The algorithm has come a long way and culminated in multiple versions with further improvements per iteration. In this blog, we’ll take a deeper look into the evolution and current applications of the YOLO algorithm, focusing on the following:

What is YOLO?

Standing for You Only Look Once, YOLO is a regression algorithm that falls under the class of real-time object detection methods with a multitude of computer vision applications. This algorithm uses a single bounding box regression to identify elements like height, width, center, and object classes. It cornered the market because of its accuracy, demonstrated speed, and ability to detect objects in a single run, surpassing Fast R-CNN, RetinaNet, and Single-Shot MultiBox Detector (SSD).

YOLO: Object detection algorithms

Why is YOLO useful?

The RCNN family was too slow. It took longer to find the proposed region for the bounding box, train a model, detect and classify regions, and then check for refined outputs in separate steps. In many tasks, extreme levels of accuracy (as the ones provided by CNNs) are not imperative, so it is reasonable to rely on less accurate but faster-to-train methods. Hence, YOLO’s unprecedented emergence. First, it improves the detection time given that it predicts objects in real-time. Second, YOLO provides accurate results with minimal background errors. And finally, the algorithm has wonderful learning capabilities that enable it to learn the representations of objects and implement them in object detection tasks. The accumulation of all these characteristics explains YOLO’s massive implementation.

YOLO history and milestones

Originally introduced by Joseph Redmon in Darknet, YOLO has come a long way. Here are a few things that made the YOLO’s first version break the competition over R-CNN and DPM:

  • Real-time frames processing at 45 fps
  • Less false positive on the background
  • Higher detection accuracy (although lower accuracy on localization)

The algorithm has continued to evolve ever since its initial release in 2016. Both YOLOv2 and YOLOv3 were written by Joseph Redmon. After YOLOv3, there came new authors who anchored their own goals in every other YOLO release.

YOLOv2: Released in 2017, this version earned an honorable mention at CVPR 2017 because of significant improvements on anchor boxes and higher resolution.

YOLOv3: The 2018th release had an additional objectivity score to the bounding box prediction and connections to the backbone network layers. It also provided an improved performance on tiny objects because of the ability to run predictions at three different levels of granularity.

YOLOv4: April’s release of 2020 became the first paper not authored by Joseph Redmon. Here Alexey Bochkovski introduced novel improvements, including mind activation, improved feature aggregation, etc.

YOLOv5: Glenn Jocher continued to make further improvements in his June 2020 release, focusing on the architecture itself.

How it works

YOLO-based models do not seize to take over the space, and the way these models operate is based on three fundamental techniques:

1) Residual blocks: At this stage, the model divides the incoming image into grids of equal dimension, where each grid is responsible for detecting an object or a part of the object that appears inside the grid.

2) Bounding box regression: Objects in each cell are highlighted with a bounding box that has attributes such as weight, height, class, and center. YOLO predicts these with a bounding box regression, representing the probability of an object appearing in the bounding box.

3) Intersection over Union (IoU): IoU describes the overlap of bounding boxes. Each grid cell is responsible for predicting the bounding boxes and their confidence scores. The IoU is calculated by dividing the area of the overlap by the area of union. The IoU is equal to 1 if the predicted bounding box is the same as the ground-truth bounding box. This way, it becomes easier to eliminate the bounding boxes that are too different from the real box.

After dividing the image into grid cells, each cell predicts bounding boxes with particular probability scores as well as class probabilities for each object. Hypothetically, you can have three objects of different classes (say boy, tree, and ball), and still, all your predictions would be made simultaneously. The IoU ensures that the predictions are in line with the ground truth so that the final detection ends up in unique bounding boxes that enclose objects perfectly.

Applications of YOLO

As expected, YOLO’s applications expand as far as the use cases of object detection, including to the following:

  • Autonomous driving: here object detection is required to avoid accidents on the road since there is no human managing the wheel. In this case, YOLO helps detect people, cars, or any external hazards appearing on the road.
  • Wildlife detection: this is as applicable for trees and biodiversity as it is for different species of animals to track their growth and migration.
  • Robotics: depending on the industry the robot operates in, some robots do require computer vision to detect objects on their path and perform a particular instruction.
  • Retail: visual product search or reverse image search are becoming increasingly popular in retail, which wouldn’t have been possible without object detection algorithms like YOLO.

As versatile and prevalent as YOLO can be, it is not the only object detection algorithm computer vision engineers rely on.

Other object detection algorithms

Now when you’re aware of YOLO implications, let’s quickly cover some of the so-to-say ancestors or alternatives of the YOLO family:

HOG

Histogram of Oriented Gradients ( HOG) is one of the traditional object detection methods, first introduced in 1986. It has undergone an evolution ever since, finding a way out in a multitude of disciples. Current HOG uses a feature descriptor to detect objects of interest. While revolutionary in the beginning, this method is time-consuming for complex computer vision tasks.

Fast R-CNN

Fast R-CNN is an improved version of the initial R-CNN, where most of the upgrades concern speed. Advantages include higher detection quality, as compared to R-CNN, single-stage training using a multi-task loss, no disk storage for feature caching, and more.

Faster R-CNN

What can be faster than the advanced version Fast R-CNN? Sport on! The Faster R-CNN model is one of the most robust versions of the R-CNN family. The Faster R-CNN method takes the selective search algorithm adopted by R-CNN and Faster R-CNN to the next level by using the superior region proposal network. This means that the algorithm computes the image with a wider range to generate more accurate results.

SSD

The Single Shot Detector or SSD is one of the fastest real-time object detection models. It can achieve an astonishing five to tenfold increase in speed, compared to RCNNs, by taking advantage of multi-scale features and default boxes.

RetinaNet

Introduced in 2017, Retinanet became one of YOLO’s main competitors as a single-run object detection model. Indeed, when RetinaNet was first released, its architecture amazingly surpassed YOLOv2, and also challenged R-CNN’s accuracy. These characteristics made the model widely applicable in satellite and aerial imagery, in particular.

Key insights

The ongoing innovation will continue to generate more demand for computer vision models, where YOLO will still hold its special place for a number of reasons: YOLO heavily relies on a unified detection mechanism that consolidates different object detection elements into a single neural network to effectively perform computer vision tasks. Thanks to YOLO, the models can be trained with a single neural network into an entire detection pipeline. It’s not surprising that the algorithm found application in countless industries, eventually becoming the nexus for projects invested in object detection. Sounds like the right solution for your project? SuperAnnotate’s end-to-end solution will eliminate the headache of annotating, training, and automating your AI. Feel free to explore further opportunities in our marketplace.

--

--

SuperAnnotate

The fastest annotation platform and services for training AI. Learn more — https://superannotate.com/