Why pixel precision is the future of the Image Annotation

6 min readMay 5, 2019

Should the computer vision industry continue using bounding box annotations?

Author: Vahan Petrosyan, CTO at SuperAnnotate

In this post, I will share some ideas related to image annotation that I accumulated during my PhD research. Specifically, I will discuss the current state-of-the-art annotation methods, their trends, and future directions. Finally, I will briefly talk about the annotation software we are building and give a little preview about our company — SuperAnnotate.

Outline

Introduction to Image Annotation
Mainstream Annotation Methods: Bounding Box
Pixel-Precision in Image Annotation
About SuperAnnotate

1. Introduction to Image Annotation

Image annotation is the process of selecting objects in images and labeling them by their names. This is the backbone of the AI computer vision, where, for example, in order for your self-driving car software to accurately identify any object in the image, say a pedestrian, one needs hundreds of thousands to millions of annotated pedestrians. Other use cases include drone/satellite footage analytics, security and surveillance, medical imaging, e-commerce, online image/video analytics, AR/VR, etc.

The increase in image data and computer vision applications requires a huge amount of training data. Data preparation and engineering tasks represent over 80% of the time consumed in AI and Machine Learning projects. Therefore over the last few years, many data annotation services and tools have been created to cover the needs of this market. As a result, the data labeling became $1.5B market in 2018 and is expected to grow to $5B by 2023.

2. Mainstream Annotation Methods: Bounding Box

The most common annotation technique is the bounding box, which is the process of fitting a tight rectangle around the target object. This is the most used annotation approach since bounding boxes are relatively straight forward and many object detection algorithms were developed with this method in mind (YOLO, Faster R-CNN, etc). Therefore, all annotation companies offer solutions for bounding box annotation (services or software). However, box annotation suffers from major drawbacks:

One needs a relatively large (usually in the order of 100.000s) number of bounding boxes to reach over 95% detection accuracies. For example, for the autonomous driving industry, one generally gathers millions of bounding boxes of cars, pedestrians, street lights, lanes, cones, etc.
Bounding box annotation doesn’t usually allow reaching superhuman detection accuracies no matter how much data you use. This is mainly because of the additional noise around the object that is included in the box area.
The detection becomes extremely complicated for occluded objects. In many cases, the target object covers less than 20% of the bounding box area making the rest as a noise which confuses the detection algorithm to find the right object (see the example in a green box below).

Examples of how bounding box may fail: green box — case of a highly occluded pedestrian. red box — high noise annotation

3. Pixel-Precision in Image Annotation

The above issues with bounding boxes can be solved with a pixel accurate annotation. Yet, the most common tools for such annotations heavily rely on slow point-by-point object selection tools, where the annotator has to go through the edges of the objects. This is not only extremely time-consuming and costly but also is very sensitive to human errors. For comparison, such annotation tasks usually cost around 10x more than a bounding box annotation. In addition, it can take 10x more time to annotate the same amount of data pixel accurately. As a result, bounding boxes still remain the most common annotation type for various applications.

However, deep learning algorithms have progressed substantially over the last seven years. While in 2012, the state-of-the-art algorithm (Alexnet) was only able to categorize images, current algorithms can already identify objects accurately in pixel level (see the image below). For such accurate object detection, pixel-perfect annotation is the key.

Evolution of Deep Learning over the last 7 years.

3.1. AI/segmentation based approaches

There have been approaches that use segmentation based solutions (i.e. SLIC Superpixels, GrabCut based segmentation) for pixelwise annotation. However, these approaches perform segmentation based on the pixel colors and often show poor performance and unsatisfactory results in real-life scenarios such as autonomous driving. Hence, they are not commonly used for such annotation tasks.

Over the last 3 years, NVIDIA has done extensive research with the U of Toronto towards pixel accurate annotation solutions. Their research mainly concentrates on generating pixel accurate polygons from the given bounding box and includes the following papers — Polygon RNN, Polygon RNN++, Curve-GCN — , published at CVPR in 2017, 2018, 2019, respectively. In the best case scenario, generating a polygon with these tools requires at least two precise clicks (i.e. generating a bounding box) and hope that it will capture the target object accurately. However, the proposed polygons are usually inaccurate and it can take much more time than expected (see the example below).

An Example of a Polygon RNN++ tool on the occluded object (2x faster video)

Another problem with such polygon based approaches is the difficulty of selecting ‘Donut’ like objects (topologically speaking), where one needs at least two polygons to describe such objects.

3.2. A novel approach to pixelwise annotation

The easiest and fastest way for pixelwise annotation would be the ability to select objects with just one click. I was specifically working on this problem during my PhD research at KTH Sweden. By the end of my PhD in November 2018, we prototyped a simple tool which allowed selecting objects with just a click. Our initial experiments showed that the pixelwise annotation can be accelerated by 10–20x without compromising the selection quality. Here is an example of how it works on the same image presented above.

SuperAnnotate Annotation (2x faster video).

We also carefully analyzed the advantages of our solution compared to other AI or segmentation-based approaches:

The speed of our algorithm allows to segment and annotate up to 10-megapixel images in real time
Unlike SLIC superpixels, our segmentation solution accurately generates non-homogeneous regions, allowing users to select both large and small objects with just one click
Our software allows us to change the number of segments instantly that enables selecting even the smallest objects.
Self-learning feature of our algorithm even further improves the segmentation accuracy. Even with a few hundred annotations, dramatic changes in the segmentation accuracy can be observed. This further accelerates the annotation process.
Compared to Box-to-Polygon based techniques discussed above, our software allows selecting donut style objects with just a click.
Most importantly, as the amount of annotated data increases, our software allows automatic pixel-accurate annotation.

Even compared to the basic bounding box annotation, which requires at least 2 precise clicks to annotate one object, we need only 1 approximate click within the segment making it even faster than generating a bounding box.

By this, we drop down the cost of pixelwise annotation to the cost level of the bounding box at the same time allowing to reach superhuman accuracy levels of detection otherwise not reachable with bounding boxes.

Furthermore, since pixel precision doesn’t include noise, one would need at least 10x fewer data to reach a certain level of accuracy compared to bounding box annotations.

Finishing Remarks

As our software hit the mainstream (launching in June 2019), we expect that the demand for bounding boxes will eventually disappear. Pixel accurate annotation will become the new norm.

4. About SuperAnnotate

We are a venture-backed team with investors including Berkeley Skydeck, Plug and Play and SmartGateVC — backed by Tim Draper. Our team consists of PhD researchers from top US, European, and Asian Universities, who came together to provide new approaches in the field of image and video annotation and make the “Human in the loop” tasks up to 100x more efficient in the most accurate level.

If you are a company whose competitive advantage depends on accurate image annotation, you can reach me via vahan@superannotate.com or request a demo in our website.
Stay tuned and follow this page for more posts/updates.