Real-Time Object Detection with YOLO

6 min readJan 6, 2022

Introduction

The phrase “Object detection” has become a buzzword in recent years and the entire field has advanced rapidly. The diversity of applications of object detection is astounding, from things like tracking objects and counting people, to automating CCTV surveillance systems and image and video annotation.

There is an equal diversity in the various approaches used to tackle the problem of object detection, with two major categories. These are neural network approaches and non-neural approaches.

Nonneural approaches rely on using various techniques for feature identification, such as Scale Invariant Feature Transform(SIFT) and Histogram of Oriented Gradient (HOG) for feature recognition, and then using techniques such as support vector machines(SVMs) for classification.

On the other end of the spectrum, neural networks are able to perform end-to-end object detection without specifically defining any features. These generally rely on Convolutional Neural Networks. Some examples are R-CNNs, Single Shot Multi-Box Detector (SSD), and You Only Look Once (YOLO).

Real-Time Object Detection and Its Advantages

Consider an object detection model that takes a few seconds per image to detect objects. If we deploy this model in a situation where low latency is crucial such as a self-driving car, this latency is far too high to be of practical use. A delay of a few microseconds could mean the difference between a fatal accident and a safe journey. Hence for such scenarios, we need a model that will give us near real-time results. It should be able to detect objects and perform inferences in a matter of milliseconds, if not microseconds.

Slower models such as R-CNN, Faster R-CNN, etc work really well when there is no need for real-time detection. However, they can be wildly unreliable when low latency is key.

Real-time models need to sense the environment, understand what is happening in a scene and react accordingly. The model should be able to both identify and locate the presence of the objects by defining a bounding box around each object. In essence, we are performing two separate tasks here: identifying the objects in an image (object detection) and locating the objects with a bounding box (object localization).

YOLO

YOLO was introduced in 2015 by Joseph Redmon.
YOU ONLY LOOK ONCE
In contrast to most deep-learning-based object detectors at the time, YOLO used a one-stage detector strategy.

The algorithm frames object detection as a regression problem. It does this by taking a given image and simultaneously predicting bounding boxes and class probabilities in a single evaluation. YOLO uses features from the entire image to predict each bounding box. This approach made it prone to more localization errors but it was far less likely to cause detect false positives.

Why the YOLO algorithm is important?

YOLO algorithm is important because of the following reasons:

Speed: This algorithm improves the speed of detection because it can predict objects in real-time.
High accuracy: YOLO is a predictive technique that provides accurate results with minimal background errors.
Learning capabilities: The algorithm has excellent learning capabilities that enable it to learn the representations of objects and apply them in object detection.

Working of YOLO

The main aim of YOLO is to predict the class of an object, and the bounding box that specifies the location of the object.

There are 4 main attributes that can be used to describe a bounding box:

◾ Center of the box (with the coordinates bx and by)

◾ Width of the box (bw)

◾ Height of the box (bh)

◾ The class of the object identified (c)

Along with these parameters we also predict the probability that there is an object in the bounding box.

In contrast to many other algorithms, YOLO doesn’t search for regions of interest in the input image that could contain an object. It instead splits the image into an S x S grid. Each cell is then responsible for predicting K bounding boxes.

We consider an object to lie in a particular cell only if the center coordinates of the anchor box lie in that cell. Due to this, the coordinates of the center are always calculated relative to the cell, whereas the height and width are calculated relative to the size of the whole image.

YOLO predicts multiple bounding boxes per grid cell. However, at training time we want only one bounding box predictor to be responsible for each object. Therefore the predictor with the IoU (intersection over union) is chosen. This is done with the help of non-max suppression, which lets us eliminate bounding boxes that are very close.

The value of IoU of all bounding boxes is calculated respective to the one having the highest class probability. Then, the bounding boxes with an IoU over a certain threshold are removed. This signifies that the two bounding boxes are covering the same object but one has a lower probability for the same; hence it is eliminated.

This process then repeats for the bounding box with the next highest class probabilities, and this is repeated until we obtain all the different bounding boxes

At this stage, almost all our work is done. The algorithm outputs the required vector showing the details of the bounding box of the respective classes. The overall architecture of the algorithm is,

The loss function is one of the most important parameters of any algorithm. The loss function used by YOLO learns about all the four parameters it predicts simultaneously.

Applications of YOLO

► Autonomous driving: YOLO algorithm can be used in autonomous cars to detect objects around cars such as vehicles, people, and parking signals. Object detection in autonomous cars is done to avoid collision since no human driver is controlling the car.

► Wildlife: This algorithm is used to detect various types of animals in forests. This type of detection is used by wildlife rangers and journalists to identify animals in videos (both recorded and real-time) and images. Some of the animals that can be detected include giraffes, elephants, and bears.

► Security: YOLO can also be used in security systems to enforce security in an area. Let’s assume that people have been restricted from passing through a certain area for security reasons. If someone passes through the restricted area, the YOLO algorithm will detect him/her, which will require the security personnel to take further action.

Conclusion

This was a brief explanation of the vast and interesting algorithm that is YOLO. We looked over many aspects, including but not limited to what object detection actually is, the problems associated with real-time object detection, as well as the YOLO algorithm and its working. We saw how earlier models failed to provide adequate real-time detection, and then saw how YOLO was able to outperform all other models in the challenges faced.

Moreover, YOLO is constantly evolving. There are multiple versions of it available, ranging from YOLO, YOLO9000, YOLOv3, YOLOv4, YOLOv5 and many scaled versions. The model has gotten more nimble and accurate with each release, and it hopefully advances even further in the future.

I hope we were able to give you a through grounding in the base workings of the YOLO algorithm and the various concepts related to object detection.