Object Detection Must Reads(Part 3): SNIP, SNIPER, OHEM, and DSOD

 

In part 1 and part 2 of object detection posts, we reviewed 1-stage and 2-stage object detectors. In this one, we introduce tricks aiming fast, accurate object detection works, including training strategy(SNIP & SNIPER), sampling strategy(OHEM) and scratch training(DSOD).

An analysis of scale invariance in object detection - SNIP - Singh - CVPR 2018

Info

  • Title: An analysis of scale invariance in object detection - SNIP
  • Task: Object Detection
  • Author: B. Singh and L. S. Davis
  • Date: Nov. 2017
  • Arxiv: 1711.08189
  • Published: CVPR 2018

Highlights & Drawbacks

  • Training strategy optimization, ready to integrate with other tricks
  • Informing experiments for multi-scale training trick

Design

The process of SNIP:

  1. Select 3 image resolutions: (480, 800) to train [120, ∞) proposals, (800, 1200) to train [40, 160] proposals, (1400, 2000) to train [0, 80] for proposals

  2. For each resolution image, BP only returns the gradient of the proposal within the corresponding scale.

  3. This ensures that only one network is used, but the size of each training object is the same, and the size of the object of ImageNet is consistent to solve the problem of domain shift, and it is consistent with the experience of the backbone, and the training and test dimensions are consistent, satisfying “ ImageNet pre-trained size, an object size, a network, a receptive field, these four match each other, and the train and test dimensions are the same.

  4. A network, but using all the object training, compared to the scale specific detector, SNIP is fully conducive to the data

  5. During the test, the same detector is measured once on each of the three resolution images, and only the detected boxes of the corresponding scale are retained for each resolution image, and then merged to execute SoftNMS.

Performance & Ablation Study

The authors conducted experiments for RFCN and Faster-RCNN and SNIP improves performance for both meta architectures.

An analysis of scale invariance in object detection

Check full introduction at An analysis of scale invariance in object detection - SNIP - Singh - CVPR 2018.

SNIPER: efficient multi-scale training - Singh - NIPS 2018 - MXNet Code

Info

  • Title: SNIPER: efficient multi-scale training
  • Task: Object Detection
  • Author: B. Singh, M. Najibi, and L. S. Davis
  • Date: May 2018
  • Arxiv: 1805.09300
  • Published: NIPS 2018

Highlights & Drawbacks

  • Efficient version of SNIP training strategy for object detection
  • Select ROIs with proper size only inside a batch

Design

SNIPER: efficient multi-scale training

Following SNIP, the authors put crops of an image which contain objects to be detected(called chips) into training instead of the entire image. This design also makes large-batch training possible, which accelerates the training process. This training method utilizes the context of the object, which can save unnecessary calculations for simple background(such as the sky) so that the utilization rate of training data is improved.

SNIPER: efficient multi-scale training

The core design of SNIPER is the selection strategy for ROIs from a chip(a crop of entire image). The authors use several hyper-params to filter boxes with proper size in a batch, hopping that the detector network only learns features beyond object size.

Due to its memory efficient design, SNIPER can benefit from Batch Normalization during training and it makes larger batch-sizes possible for instance-level recognition tasks on a single GPU. Hence, there is no need to synchronize batch-normalization statistics across GPUs.

Performance & Ablation Study

An improvement of the accuracy of small-size objects was reported according to the author’s experiments.

SNIPER: efficient multi-scale training

Code

MXNet

(OHEM)Training Region-based Object Detectors with Online Hard Example Mining - Shrivastava et al. - CVPR 2016

Info

  • Title: Training Region-based Object Detectors with Online Hard Example Mining
  • Task: Object Detection
  • Author: A. Shrivastava, A. Gupta, and R. Girshick
  • Date: Apr. 2016
  • Arxiv: 1604.03540
  • Published: CVPR 2016

Highlights & Drawbacks

  • Learning-based design for balancing examples for ROI in 2-stage detection network
  • Plug-in ready trick, easy to be integrated
  • Additional Parameters for Training

Motivation & Design

There is a 1:3 strategy in Faster-RCNN network, which samples negative ROIs(backgrounds) to balance the ratio for positive and negative data in a batch. It’s empirical and hand-designed(need additional effort when setting hyper-params).

 (OHEM)Training Region-based Object Detectors with Online Hard Example Mining

The authors designed an additional sub-network to “learn” the sampling process for negative ROIs, forcing the network focus on ones which are similar to objects(the hard ones), such as backgrounds contain part of objects.

The ‘hard’ examples are defined using probability from detection head, which means that the sample network is exactly the classification network. In practice, the selecting range is set to [0.1, 0.5].

Performance & Ablation Study

 (OHEM)Training Region-based Object Detectors with Online Hard Example Mining

 (OHEM)Training Region-based Object Detectors with Online Hard Example Mining

OHEM can improve performance even after adding bells and whistles like Multi-scale training and Iterative bbox regression.

Code

caffe

DSOD: learning deeply supervised object detectors from scratch - Shen - ICCV 2017 - Caffe Code

Info

  • Title: DSOD: learning deeply supervised object detectors from scratch
  • Task: Object Detection
  • Author: Z. Shen, Z. Liu, J. Li, Y. Jiang, Y. Chen, and X. Xue
  • Date: Aug. 2017
  • Arxiv: 1708.01241
  • Published: ICCV 2017

Highlights & Drawbacks

  • Object Detection without pre-training
  • DenseNet-like network

Design

A common practice that used in earlier works such as R-CNN is to pre-train a backbone network on a categorical dataset like ImageNet, and then use these pre-trained weights as initialization of detection model. Although I have once successfully trained a small detection network from random initialization on a large dataset, there are few models trained from scratch when the number of instances in a dataset is limited like Pascal VOC and COCO. Actually, using better pre-trained weights is one of the tricks in detection challenges. DSOD attempts to train the detection network from scratch with the help of “Deep Supervision” from DenseNet.

The 4 principles authors argued for object detection networks:

1. Proposal-free
2. Deep supervision
3. Stem Block
4. Dense Prediction Structure

DSOD: learning deeply supervised object detectors from scratch

Performance & Ablation Study

DSOD outperforms detectors with pre-trained weights. DSOD: learning deeply supervised object detectors from scratch

Ablation Study on parts: DSOD: learning deeply supervised object detectors from scratch

Code

Caffe