YOLOv3: Improving Object Detection with Darknet-53

Introduction to YOLOv3

YOLOv3, or You Only Look Once version 3, is an incremental improvement over its predecessor, YOLOv2. The authors of YOLOv3 made several key modifications to the architecture, resulting in improved performance. In this article, we will explore the changes made to YOLOv2 to create YOLOv3 and discuss how to implement the model architecture from scratch using PyTorch.

What Makes YOLOv3 Better Than YOLOv2

The main modification made to YOLOv2 was the introduction of a new backbone model called Darknet-53. This model consists of 52 convolution layers and a single fully-connected layer at the end. The authors replaced the maxpooling operations with convolutions of stride 2, which helps to reduce image resolution while capturing spatial information with specific weightings.

The Vanilla Darknet-53

The Darknet-53 architecture is an improvement upon the Darknet-19 used in YOLOv2. The model is equipped with residual blocks, which were originated from ResNet. The activation function within the residual block is placed after the weight layer, rather than after the element-wise summation.

Darknet-53 With Detection Heads

The YOLOv3 architecture is designed for detection tasks, rather than classification. The model has three detection heads, each with a different specialization: the leftmost head detects large objects, the middle head detects medium-sized objects, and the rightmost head detects small objects. Each detection head has 255 channels, which is calculated based on the number of prior boxes and object classes.

Multi-Label Classification

YOLOv3 uses a multi-label classification paradigm, rather than a standard multiclass classification paradigm. This allows the model to detect multiple objects within a single image.