Fast RCNN - Grishick - ICCV 2015 - Caffe Code
Info
- Title: Fast RCNN
- Task: Object Detection
- Author: Ross Girshick
- Arxiv: 1504.08083
- Date: April 2015
- Published: ICCV 2015
Highlights
- An improvement to [R-CNN] (https://blog.ddlee.cn/posts/415f4992/), ROI Pooling Design
- Article structure is clear
R-CNN’s Drawbacks
- Training is a multi-stage process (Proposal, Classification, Regression)
- Training takes time and effort
- Infer time-consuming
The reason of time-consuming is that CNN is performed separately on each Proposal, with no shared calculations.
Architecture

The picture above shows the architecture of Fast R-CNN. The image is generated by the feature extractor, and the Selective Search algorithm is used to map the RoI (Region of Interest) to the feature map. Then, the RoI Pooling operation is performed for each RoI to obtain the feature vector of the same length. Classification and BBox Regression.
This structure of Fast R-CNN is the prototype of the meta-structure used in the main 2-stage method of the detection task. The entire system consists of several components: Proposal, Feature Extractor, Object Recognition & Localization. The Proposal part is replaced by RPN (Faster R-CNN), the Feature Extractor part uses SOTA’s classified CNN network (ResNet, etc.), and the last part is often a parallel multitasking structure (Mask R-CNN, etc.).
Performance & Ablation Study

Code
Check full introduction at Fast RCNN - Grishick - ICCV 2015 - Caffe Code
Faster R-CNN: Towards Real Time Object Detection with Region Proposal - Ren - NIPS 2015
Info
- Title: Faster R-CNN: Towards Real Time Object Detection with Region Proposal
- Task: Object Detection
- Author: Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun
- Date: June 2015
- Arxiv: 1506.01497
- Published: NIPS 2015
Highlights
Faster R-CNN is the mainstream method of 2-stage method. The proposed RPN network replaces the Selective Search algorithm so that the detection task can be completed end-to-end by the neural network. Roughly speaking, Faster R-CNN = RPN + Fast R-CNN, the nature of the convolution calculation shared with RCNN makes the calculations introduced by RPN very small, allowing Faster R-CNN to run at 5fps on a single GPU. Reach SOTA in terms of accuracy.
Regional Proposal Networks

The RPN network models the Proposal task as a two-category problem.
The first step is to generate an anchor box of different size and aspect ratio on a sliding window, determine the threshold of the IOU, and calibrate the positive and negative of the anchor box according to Ground Truth. Thus, the sample that is passed into the RPN network is the anchor box and whether there is an object in each anchor box. The RPN network maps each sample to a probability value and four coordinate values. The probability value reflects the probability that the anchor box has an object, and the four coordinate values are used to regress the position of the defined object. Finally, the two classifications and the coordinates of the Loss are unified to be the target training of the RPN network.
The RPN network has a large number of super-parameters, the size and length-to-width ratio of the anchor box, the threshold of IoU, and the ratio of Proposal positive and negative samples on each image.
Performance

Check full introduction at Faster R-CNN: Towards Real Time Object Detection with Region Proposal - Ren - NIPS 2015.
R-FCN: Object Detection via Region-based Fully Convolutional Networks - Dai - NIPS 2016 - MXNet Code
Info
- Title: R-FCN: Object Detection via Region-based Fully Convolutional Networks
- Task: Object Detection
- Author: Jifeng Dai, Yi Li, Kaiming He, and Jian Sun
- Arxiv: 1605.06409
- Published: NIPS 2016
Highlights
- Full convolutional network, sharing weights across ROIs
Design

The article points out that there is an unnatural design of the framework before the detection task, that is, the feature extraction part of the full convolution + the fully connected classifier, and the best performing image classifier is a full convolution structure (ResNet, etc.). One point is caused by the contradiction between the translation invariance of the classification task and the translation sensitivity of the detection task. In other words, the detection model uses the feature extractor of the classification model, and the position information is lost. This article proposes to solve this problem by using a “location-sensitive score map” approach.
Performance & Ablation Study
The comparison with Faster R-CNN shows that R-FCN achieves better accuracy while maintaining shorter inference time.

Code
Check full introduction at R-FCN: Object Detection via Region-based Fully Convolutional Networks - Dai - NIPS 2016
(FPN)Feature Pyramid Networks for Object Detection - Lin - CVPR 2017
Info
- Title: Feature Pyramid Networks for Object Detection
- Task: Object Detection
- Author: Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie
- Date: March 2016
- Arxiv: 1612.03144
- Published: CVPR 2017
Highlights
- Image pyramid to feature pyramid
Feature Pyramid Networks

Starting from the picture, the cascading feature extraction is performed as usual, and a return path is added: starting from the highest feature map, the nearest neighbor is sampled down to get the return feature map of the same size as the low-level feature map. A lateral connection at the element position is then made to form features in this depth.
The belief in this operation is that the low-level feature map contains more location information, and the high-level feature map contains better classification information, combining the two to try to achieve the location classification dual requirements of the detection task.
Performance & Ablation Study
The main experimental results of the article are as follows:

Comparing the different head parts, the input feature changes do improve the detection accuracy, and the lateral and top-down operations are also indispensable.
Code
Check full introduction at Faster R-CNN: Towards Real Time Object Detection with Region Proposal - Ren - NIPS 2015.
Related
- Object Detection Must Reads(2): YOLO, YOLO9000, and RetinaNet
- Object Detection Must Reads(3): SNIP, SNIPER, OHEM, and DSOD
- RoIPooling in Object Detection: PyTorch Implementation(with CUDA)
- Bounding Box(BBOX) IOU Calculation and Transformation in PyTorch
- Object Detection: Anchor Generator in PyTorch
- Assign Ground Truth to Anchors in Object Detection with Python
- (Soft)NMS in Object Detection: PyTorch Implementation(with CUDA)
- From Classification to Panoptic Segmentation: 7 years of Visual Understanding with Deep Learning
-
Convolutional Neural Network Must Reads: Xception, ShuffleNet, ResNeXt and DenseNet
-
Anchor-Free Object Detection(Part 1): CornerNet, CornerNet-Lite, ExtremeNet, CenterNet
- Anchor-Free Object Detection(Part 2): FSAF, FoveaBox, FCOS, RepPoints