Introduction to object detection with deep learning

8 min readOct 19, 2021

Applications of object detection have become embedded into our daily lives, covering security, automated vehicle systems, and more. Object detection models are fed input visuals (image or video) and output a labeled version of it with bounding boxes around each corresponding object. It is easier said than done since object detection models consider complex algorithms and datasets that are polished and developed as we speak.

Here’s everything we’ll cover today to give you a comprehensive introduction to object detection:

Basics of object detection

Before diving into object detection applications, use cases, and basic object detection methods, it’s crucial to establish a clear-cut understanding of object detection itself. The term is often used interchangeably with techniques, such as image classification, object recognition, segmentation, and more. However, it must be acknowledged that many of those mentioned above are separate tasks that commonly fall under object detection. It is inaccurate to use them synonymously with one another since each pertains to an equally vital task.

What is object detection?

Object detection is a profound computer vision technique that focuses on identifying and labeling objects within images, videos, and even live footage. Object detection models are trained with a surplus of annotated visuals in order to carry out this process with new data. It becomes as simple as feeding input visuals and receiving a fully marked-up output visual. We will discuss object detection models in more depth a little later. A key component is the object detection bounding box which identifies the edges of the object tagged with a clear-cut quadrilateral — typically either a square or rectangle. They are accompanied by a label of the object, whether it be a person, a car, or a dog to describe the target object. Bounding boxes can overlap to showcase multiple objects in a given shot as long as the model has prior knowledge of items it is tagging.

Object detection vs. other tasks

Let’s break down the other computer vision tasks individually for a greater understanding of each one:

Image classification — This is the prediction of the class of an item in an image. For example, when you carry out a reverse image search on Google, you are likely to receive a note that says “might include ‘x,” with the ‘x’ being anything that the technology detects as the primary object of the image. Image classification can show that a particular object exists in the image, but it involves one primary object and does not provide the location of the object within the visual.
Segmentation — Otherwise known as semantic segmentation, it is the task of grouping pixels with comparable properties together instead of bounding boxes to identify objects.
Object localization — The difference with object detection is very subtle yet evident. Object localization seeks to identify the location of one or more objects in an image, whereas object detection identifies all objects and their borders without much focus on placement.

Deep learning vs. machine learning

Now that you have a grasp of our basic introduction to object detection, it’s time to look at the two main models of object detection: deep learning and machine learning. Data analysts commonly rate deep learning approaches as the relatively state-of-the-art approach since it is considered to be much more intuitive and does not require as much human interference. In the end result, both approaches yield accurate results, but we will concentrate on object detection with deep learning this time around.

What is object detection with deep learning?

What sets object detection with deep learning apart from alternative approaches is the employment of convolutional neural networks (CNN). The neural networks mimic that of the complex neural architecture of the human mind. They primarily consist of an input layer, hidden inner layers, and an output layer. The learning for these neural networks can be supervised, semi-supervised, and unsupervised, referring to how much of the training data is annotated, if at all (unsupervised). Deep neural networks for object detection yield by far the quickest and most accurate results for single and multiple object detection since CNNs are capable of automated learning with less manual engineering involved. There is a world to unpack regarding deep learning and CNNs, but today we will only focus on key points that regard object detection algorithms and models.

Methods and algorithms

Object detection is not possible without models designed especially for handling that task. These object detection models are trained with hundreds of thousands of visual content to optimize the detection accuracy on an automatic basis later on. Training and refining models are made efficient through the help of readily available datasets like COCO (Common Objects in Context) to help give you a headstart in scaling your annotation pipeline.

Let’s take a closer look at a few of the most prominent types of object detection algorithms and methods.

R-CNN, Fast R-CNN, & Faster R-CNN

The first largely successful family of methods was R-CNN (Region-Based Convolutional Neural Network), which was proposed in 2014. It surpassed its predecessors by extracting merely 2,000 regions from the image, which were referred to as region proposals, instead of an exceedingly large number of regions prior to this. The flowchart of R-CNN is the following: the input image is selected, of which 2,000 region proposals are extracted. Next, the features would be extracted from each individual region, which would then go on to be classified as one of the known classes. The primary shortcoming of R-CNN lies in the fact that although it extracted 2,000 region proposals, it was nonetheless a lengthy process. That is what paved the way to the new and improved Fast R-CNN.

Not only is the process of object detection with a vast number of regions time consuming, but so is the training of a CNN with that many regions. Fast R-CNN significantly cut down on process times by inputting the image to the pre-trained CNN in order to generate a convolutional feature map, eliminating the process of breaking the image down into 2,000 region proposals. Instead, the region proposals may be easily identified from the feature map, sending them to the RoI pooling layer, which extracts the features from a given region. Then, the output from the previous layer is processed by a fully connected layer, where the model splits into two outputs: one for class prediction via a softmax layer and another for bounding box prediction via a linear output.

How significant is the jump from R-CNN to Fast R-CNN? Training hours for the CNN drop from ~84 hours to ~9 hours with Fast R-CNN. Additionally, the test time drops from ~50 seconds to ~2.5 seconds.

A third and even more upgraded model was later introduced that would be known as Faster R-CNN. The architecture was similar to that of Fast R-CNN with a few noteworthy tweaks. Faster R-CNN does not use Selective Search, which is based on hierarchical grouping among similar regions. The Region Proposal Network would come in its place to allow identifications of the region proposals at record speed. It brought down the ~2.5-second test speed from Fast R-CNN to an unparalleled ~0.2 seconds, making it the fastest of its predecessors and optimal for real-time object detection.

YOLO (You Only Look Once)

What if we said there was even a faster convolutional neural network than Faster R-CNN? Well, there is! In 2015, a family of neural networks was proposed with the abbreviation YOLO in reference to the notable phrase “You only live once”. That relies on the simple fact that the network only takes “one look” or one pass through the network before outputting the final image. This allows object detection with real-time footage, which is considerably preferable for surveillance-related applications. The accuracy of the detected objects is lower than the previously mentioned models due to its exceptional speed, but it nonetheless succeeds in being a top contender among the rest.

Use cases and applications

Object detection with deep learning is very much prevalent in our daily lives, as we’ve seen with a handful of examples. The extent of its significance in our modern world is far greater than many may initially assume.

Surveillance, security, and traffic

Data labeling aside, object detection in video and real-time footage is the cornerstone of state-of-the-art surveillance. Computer vision aims to constantly excel expectations and innovate theft detection, traffic violations, suspicious human activity, and so on. All of these processes are gradually being monitored more efficiently than ever before.

Automobile

For autonomous driving, object detection is mandatory for the automobile to determine whether to accelerate, brake, or turn within the very next moment. That necessitates object detection that recognizes a collection of things such as cars, pedestrians, traffic signals, road signs, bicycles, motorbikes, and so on.

Medical

Object detection is showing immaculate development in the field of medicine, especially in radiology. Although the technology will not entirely replace the need for expertise for radiologists and other specialists, it will considerably cut down on time spent analyzing hundreds to thousands of ultrasound scans, even x-rays, MRI-s, and CT scans each day.

Retail

Smart inventory management without the need for manual inventory checks, cashier-less shopping experiences, and more are in store for retailers who implement object detection computer vision to their stores.

Key takeaways

Object detection is where image classification and object localization meet to interpret and label a variety of visuals, from images to real-time footage. Models of object detection with deep learning have significantly cut down in process time and speed over the course of only the past decade, which wouldn’t be feasible without CNNs. We can clearly see that applications of object detection are prevalent, from the security features on our smartphones to the basis that next-generation smart automobiles rely on. After all, object detection models evolve, grow, and innovate every single day to become more accurate and solve more real-time problems in the modern world.

We hope this basic introduction to object detection with deep learning will act as a foundation to be built upon further.

Originally published at https://blog.superannotate.com.

Follow SuperAnnotate on LinkedIn, Twitter, Facebook

Introduction to object detection with deep learning

Basics of object detection

What is object detection?

Object detection vs. other tasks

Deep learning vs. machine learning

What is object detection with deep learning?

Methods and algorithms

R-CNN, Fast R-CNN, & Faster R-CNN

YOLO (You Only Look Once)

Use cases and applications

Surveillance, security, and traffic

Automobile

Medical

Retail

Key takeaways

Read more by SuperAnnotate:

The evolution of YOLO: Object detection algorithms

Find out how your computer vision project can benefit from YOLO. Explore further its evolution and applications in our…

Supervised learning and other machine learning tasks

Ever wondered how ML models operate? This comprehensive guide on supervised learning and other machine learning tasks…

Complete guide to semantic segmentation

Want to learn more about how to label specific regions of an image? Check out our comprehensive guide on semantic…

Written by SuperAnnotate