Object Detection Must Reads(Part 1): Fast RCNN, Faster RCNN, R-FCN and FPN

Fast RCNN - Grishick - ICCV 2015 - Caffe Code

Info

Title: Fast RCNN
Task: Object Detection
Author: Ross Girshick
Arxiv: 1504.08083
Date: April 2015
Published: ICCV 2015

Highlights

An improvement to [R-CNN] (https://blog.ddlee.cn/posts/415f4992/), ROI Pooling Design
Article structure is clear

R-CNN’s Drawbacks

Training is a multi-stage process (Proposal, Classification, Regression)
Training takes time and effort
Infer time-consuming

The reason of time-consuming is that CNN is performed separately on each Proposal, with no shared calculations.

Architecture

Fast RCNN

The picture above shows the architecture of Fast R-CNN. The image is generated by the feature extractor, and the Selective Search algorithm is used to map the RoI (Region of Interest) to the feature map. Then, the RoI Pooling operation is performed for each RoI to obtain the feature vector of the same length. Classification and BBox Regression.

This structure of Fast R-CNN is the prototype of the meta-structure used in the main 2-stage method of the detection task. The entire system consists of several components: Proposal, Feature Extractor, Object Recognition & Localization. The Proposal part is replaced by RPN (Faster R-CNN), the Feature Extractor part uses SOTA’s classified CNN network (ResNet, etc.), and the last part is often a parallel multitasking structure (Mask R-CNN, etc.).

Performance & Ablation Study

Fast RCNN

Code

Caffe(Official)

Check full introduction at Fast RCNN - Grishick - ICCV 2015 - Caffe Code

Faster R-CNN: Towards Real Time Object Detection with Region Proposal - Ren - NIPS 2015

Info

Title: Faster R-CNN: Towards Real Time Object Detection with Region Proposal
Task: Object Detection
Author: Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun
Date: June 2015
Arxiv: 1506.01497
Published: NIPS 2015

Highlights

Faster R-CNN is the mainstream method of 2-stage method. The proposed RPN network replaces the Selective Search algorithm so that the detection task can be completed end-to-end by the neural network. Roughly speaking, Faster R-CNN = RPN + Fast R-CNN, the nature of the convolution calculation shared with RCNN makes the calculations introduced by RPN very small, allowing Faster R-CNN to run at 5fps on a single GPU. Reach SOTA in terms of accuracy.

Regional Proposal Networks

Faster R-CNN: Towards Real Time Object Detection with Region Proposal

The RPN network models the Proposal task as a two-category problem.

The first step is to generate an anchor box of different size and aspect ratio on a sliding window, determine the threshold of the IOU, and calibrate the positive and negative of the anchor box according to Ground Truth. Thus, the sample that is passed into the RPN network is the anchor box and whether there is an object in each anchor box. The RPN network maps each sample to a probability value and four coordinate values. The probability value reflects the probability that the anchor box has an object, and the four coordinate values are used to regress the position of the defined object. Finally, the two classifications and the coordinates of the Loss are unified to be the target training of the RPN network.

The RPN network has a large number of super-parameters, the size and length-to-width ratio of the anchor box, the threshold of IoU, and the ratio of Proposal positive and negative samples on each image.

Performance

Faster R-CNN: Towards Real Time Object Detection with Region Proposal

Check full introduction at Faster R-CNN: Towards Real Time Object Detection with Region Proposal - Ren - NIPS 2015.

R-FCN: Object Detection via Region-based Fully Convolutional Networks - Dai - NIPS 2016 - MXNet Code

Info

Title: R-FCN: Object Detection via Region-based Fully Convolutional Networks
Task: Object Detection
Author: Jifeng Dai, Yi Li, Kaiming He, and Jian Sun
Arxiv: 1605.06409
Published: NIPS 2016

Highlights

Full convolutional network, sharing weights across ROIs

Design

R-FCN: Object Detection via Region-based Fully Convolutional Networks

The article points out that there is an unnatural design of the framework before the detection task, that is, the feature extraction part of the full convolution + the fully connected classifier, and the best performing image classifier is a full convolution structure (ResNet, etc.). One point is caused by the contradiction between the translation invariance of the classification task and the translation sensitivity of the detection task. In other words, the detection model uses the feature extractor of the classification model, and the position information is lost. This article proposes to solve this problem by using a “location-sensitive score map” approach.

Performance & Ablation Study

The comparison with Faster R-CNN shows that R-FCN achieves better accuracy while maintaining shorter inference time. R-FCN: Object Detection via Region-based Fully Convolutional Networks

Code

MXNet

Check full introduction at R-FCN: Object Detection via Region-based Fully Convolutional Networks - Dai - NIPS 2016

(FPN)Feature Pyramid Networks for Object Detection - Lin - CVPR 2017

Info

Title: Feature Pyramid Networks for Object Detection
Task: Object Detection
Author: Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie
Date: March 2016
Arxiv: 1612.03144
Published: CVPR 2017

Highlights

Image pyramid to feature pyramid

Feature Pyramid Networks

(FPN)Feature Pyramid Networks for Object Detection

Starting from the picture, the cascading feature extraction is performed as usual, and a return path is added: starting from the highest feature map, the nearest neighbor is sampled down to get the return feature map of the same size as the low-level feature map. A lateral connection at the element position is then made to form features in this depth.

The belief in this operation is that the low-level feature map contains more location information, and the high-level feature map contains better classification information, combining the two to try to achieve the location classification dual requirements of the detection task.

Performance & Ablation Study

The main experimental results of the article are as follows:

(FPN)Feature Pyramid Networks for Object Detection

Comparing the different head parts, the input feature changes do improve the detection accuracy, and the lateral and top-down operations are also indispensable.

Code

Caffe2(FAIR’s Detectron)

Check full introduction at Faster R-CNN: Towards Real Time Object Detection with Region Proposal - Ren - NIPS 2015.

PREVIOUS(PixelRNN & PixelCNN)Pixel Recurrent Neural Networks - van den Oord - ICML 2016

NEXTConditional Image Generation with PixelCNN Decoders - van den Oord - NIPS 2016 - TensorFlow & PyTorch Code

Fast RCNN - Grishick - ICCV 2015 - Caffe Code

Info

Highlights

R-CNN’s Drawbacks

Architecture

Performance & Ablation Study

Code

Faster R-CNN: Towards Real Time Object Detection with Region Proposal - Ren - NIPS 2015

Info

Highlights

Regional Proposal Networks

Performance

R-FCN: Object Detection via Region-based Fully Convolutional Networks - Dai - NIPS 2016 - MXNet Code

Info

Highlights

Design

Performance & Ablation Study

Code

(FPN)Feature Pyramid Networks for Object Detection - Lin - CVPR 2017

Info

Highlights

Feature Pyramid Networks

Performance & Ablation Study

Code

Related