Object Detection Must Reads(Part 2): YOLO, YOLO9000, and RetinaNet

In previous article, we reviewed 2-stage state-of-art object detectors: Fast RCNN, Faster RCNN, R-FCN, and FPN. We’ll introduce 1-stage object detection models in this one.

(YOLO)You Only Look Once: Unified, Real Time Object Detection - Redmon et al. - CVPR 2016

Info

Title: You Only Look Once: Unified, Real Time Object Detection
Task: Object Detection
Author: J. Redmon, S. Divvala, R. Girshick, and A. Farhadi
Arxiv: https://arxiv.org/abs/1506.02640
Date: June. 2015
Published: CVPR 2016

Highlights & Drawbacks

Fast.
Global processing makes background errors relatively small compared to local (regional) based methods such as Fast RCNN.
Generalization performance is good, YOLO performs well when testing on art works.
The idea of YOLO meshing is still relatively rough, and the number of boxes generated by each mesh also limits its detection of small objects and similar objects.

Design

You Only Look Once: Unified, Real Time Object Detection

The loss function is divided into three parts: coordinate error, object error, and class error. In order to balance the effects of category imbalance and large and small objects, weights are added to the loss and the root length is taken.

Performance & Ablation Study

You Only Look Once: Unified, Real Time Object Detection

Compared to Fast-RCNN, YOLO’s background false detections account for a small proportion of errors, while position errors account for a large proportion (no log coding).

Code

Check full introduction at You Only Look Once: Unified, Real Time Object Detection - Redmon et al. - CVPR 2016.

YOLO9000: Better, Faster, Stronger - Redmon et al. - 2016

Info

Title: YOLO9000: Better, Faster, Stronger
Task: Object Detection
Author: J. Redmon and A. Farhadi
Arxiv: 1612.08242
Date: Dec. 2016

Highlights & Drawbacks

A significant improvement for YOLO).

Design

Add BN to the convolutional layer and discard Dropout
Higher size input
Use Anchor Boxes and replace the fully connected layer with convolution in the head
Use the clustering method to get a better a priori for generating Anchor Boxes
Refer to the Fast R-CNN method for log/exp transformation of position coordinates to keep the loss of coordinate regression at the appropriate order of magnitude.
Passthrough layer: Similar to ResNet’s skip-connection, stitching feature maps of different sizes together
Multi-scale training
More efficient network Darknet-19, a VGG-like network, achieves the same accuracy as the current best on ImageNet with fewer parameters.

After this improvement, YOLOv2 absorbs the advantages of a lot of work, achieving the accuracy and faster inference speed of SSD.

Performance & Ablation Study

YOLO9000: Better, Faster, Stronger

Code

Check full introduction at YOLO9000: Better, Faster, Stronger - Redmon et al. - 2016.

(RetinaNet)Focal loss for dense object detection - Lin - ICCV 2017

Info

Title: Focal loss for dense object detection
Task: Object Detection
Author: T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár
Date: Aug. 2017
Arxiv: 1708.02002
Published: ICCV 2017(Best Student Paper)

Highlights & Drawbacks

Loss function improvement
For Dense samples from single-stage models like SSD

Design

In single-stage models, a massive number of training samples are calculated in the loss function at the same time, because of the lack of proposing candidate regions. Based on findings that the loss of a single-stage model is dominated by easy samples(usually backgrounds), Focal Loss introduces a suppression factor on losses from these easy samples, in order to let hard cases play a bigger role in the training process.

(RetinaNet)Focal loss for dense object detection

Utilizing focal loss term, a dense detector called RetinaNet is designed based on ResNet and FPN:

(RetinaNet)Focal loss for dense object detection

Performance & Ablation Study

The author’s experiments show that a single-stage detector can achieve comparable accuracy like two-stage models thanks to the proposed loss function. However, Focal Loss function brings in two additional hyper-parameters. The authors use a grid search to optimize these two hyper-parameters, which is not inspiring at all since it provides little experience when using the proposed loss function on other datasets or scenarios. Focal Loss optimizes the weight between easy and hard training samples in the loss function from the perspective of sample imbalance.

(RetinaNet)Focal loss for dense object detection