Object Detection Must Reads(Part 3): SNIP, SNIPER, OHEM, and DSOD

In part 1 and part 2 of object detection posts, we reviewed 1-stage and 2-stage object detectors. In this one, we introduce tricks aiming fast, accurate object detection works, including training strategy(SNIP & SNIPER), sampling strategy(OHEM) and scratch training(DSOD).

An analysis of scale invariance in object detection - SNIP - Singh - CVPR 2018

Info

Title: An analysis of scale invariance in object detection - SNIP
Task: Object Detection
Author: B. Singh and L. S. Davis
Date: Nov. 2017
Arxiv: 1711.08189
Published: CVPR 2018

Highlights & Drawbacks

Training strategy optimization, ready to integrate with other tricks
Informing experiments for multi-scale training trick

Design

The process of SNIP:

Select 3 image resolutions: (480, 800) to train [120, ∞) proposals, (800, 1200) to train [40, 160] proposals, (1400, 2000) to train [0, 80] for proposals
For each resolution image, BP only returns the gradient of the proposal within the corresponding scale.
This ensures that only one network is used, but the size of each training object is the same, and the size of the object of ImageNet is consistent to solve the problem of domain shift, and it is consistent with the experience of the backbone, and the training and test dimensions are consistent, satisfying “ ImageNet pre-trained size, an object size, a network, a receptive field, these four match each other, and the train and test dimensions are the same.
A network, but using all the object training, compared to the scale specific detector, SNIP is fully conducive to the data
During the test, the same detector is measured once on each of the three resolution images, and only the detected boxes of the corresponding scale are retained for each resolution image, and then merged to execute SoftNMS.

Performance & Ablation Study

The authors conducted experiments for RFCN and Faster-RCNN and SNIP improves performance for both meta architectures.

An analysis of scale invariance in object detection

Check full introduction at An analysis of scale invariance in object detection - SNIP - Singh - CVPR 2018.

SNIPER: efficient multi-scale training - Singh - NIPS 2018 - MXNet Code

Info

Title: SNIPER: efficient multi-scale training
Task: Object Detection
Author: B. Singh, M. Najibi, and L. S. Davis
Date: May 2018
Arxiv: 1805.09300
Published: NIPS 2018

Highlights & Drawbacks

Efficient version of SNIP training strategy for object detection
Select ROIs with proper size only inside a batch

Design

SNIPER: efficient multi-scale training

Following SNIP, the authors put crops of an image which contain objects to be detected(called chips) into training instead of the entire image. This design also makes large-batch training possible, which accelerates the training process. This training method utilizes the context of the object, which can save unnecessary calculations for simple background(such as the sky) so that the utilization rate of training data is improved.

SNIPER: efficient multi-scale training

The core design of SNIPER is the selection strategy for ROIs from a chip(a crop of entire image). The authors use several hyper-params to filter boxes with proper size in a batch, hopping that the detector network only learns features beyond object size.

Due to its memory efficient design, SNIPER can benefit from Batch Normalization during training and it makes larger batch-sizes possible for instance-level recognition tasks on a single GPU. Hence, there is no need to synchronize batch-normalization statistics across GPUs.

Performance & Ablation Study

An improvement of the accuracy of small-size objects was reported according to the author’s experiments.

SNIPER: efficient multi-scale training

Code

MXNet

(OHEM)Training Region-based Object Detectors with Online Hard Example Mining - Shrivastava et al. - CVPR 2016

Info

Title: Training Region-based Object Detectors with Online Hard Example Mining
Task: Object Detection
Author: A. Shrivastava, A. Gupta, and R. Girshick
Date: Apr. 2016
Arxiv: 1604.03540
Published: CVPR 2016

Highlights & Drawbacks

Learning-based design for balancing examples for ROI in 2-stage detection network
Plug-in ready trick, easy to be integrated
Additional Parameters for Training

Motivation & Design

There is a 1:3 strategy in Faster-RCNN network, which samples negative ROIs(backgrounds) to balance the ratio for positive and negative data in a batch. It’s empirical and hand-designed(need additional effort when setting hyper-params).

(OHEM)Training Region-based Object Detectors with Online Hard Example Mining

The authors designed an additional sub-network to “learn” the sampling process for negative ROIs, forcing the network focus on ones which are similar to objects(the hard ones), such as backgrounds contain part of objects.

The ‘hard’ examples are defined using probability from detection head, which means that the sample network is exactly the classification network. In practice, the selecting range is set to [0.1, 0.5].

Performance & Ablation Study

(OHEM)Training Region-based Object Detectors with Online Hard Example Mining

OHEM can improve performance even after adding bells and whistles like Multi-scale training and Iterative bbox regression.

Code

caffe

DSOD: learning deeply supervised object detectors from scratch - Shen - ICCV 2017 - Caffe Code

Info

Title: DSOD: learning deeply supervised object detectors from scratch
Task: Object Detection
Author: Z. Shen, Z. Liu, J. Li, Y. Jiang, Y. Chen, and X. Xue
Date: Aug. 2017
Arxiv: 1708.01241
Published: ICCV 2017

Highlights & Drawbacks

Object Detection without pre-training
DenseNet-like network

Design

A common practice that used in earlier works such as R-CNN is to pre-train a backbone network on a categorical dataset like ImageNet, and then use these pre-trained weights as initialization of detection model. Although I have once successfully trained a small detection network from random initialization on a large dataset, there are few models trained from scratch when the number of instances in a dataset is limited like Pascal VOC and COCO. Actually, using better pre-trained weights is one of the tricks in detection challenges. DSOD attempts to train the detection network from scratch with the help of “Deep Supervision” from DenseNet.

The 4 principles authors argued for object detection networks:

Proposal-free
Deep supervision
Stem Block
Dense Prediction Structure

DSOD: learning deeply supervised object detectors from scratch

Performance & Ablation Study

DSOD outperforms detectors with pre-trained weights. DSOD: learning deeply supervised object detectors from scratch

Ablation Study on parts: DSOD: learning deeply supervised object detectors from scratch

Code

Caffe

PREVIOUSThe Perception-Distortion Tradeoff - Blau - CVPR 2018 - Matlab

NEXTSemantic Photo Manipulation with a Generative Image Prior - Bau - SIGGRAPH 2019 - PyTorch

An analysis of scale invariance in object detection - SNIP - Singh - CVPR 2018

Info

Highlights & Drawbacks

Design

Performance & Ablation Study

SNIPER: efficient multi-scale training - Singh - NIPS 2018 - MXNet Code

Info

Highlights & Drawbacks

Design

Performance & Ablation Study

Code

(OHEM)Training Region-based Object Detectors with Online Hard Example Mining - Shrivastava et al. - CVPR 2016

Info

Highlights & Drawbacks

Motivation & Design

Performance & Ablation Study

Code

DSOD: learning deeply supervised object detectors from scratch - Shen - ICCV 2017 - Caffe Code

Info

Highlights & Drawbacks

Design

Performance & Ablation Study

Code

Related