Assume the output from a layer in CNN is N × N × d dimension, which is the output of d filters for N × N spatial cells. Each spatial cell is computed from a receptive field in the input image.
The receptive fields of all the spatial cells in the input image can highly overlap with each other. The size of one receptive field can be computed layer by layer in CNN. In a convolution (pooling) layer, if the filter (pooling) size is a×a and the stride is s, then T ×T cells in the output of this layer corresponds to [s*(T − 1) + a] × [s*(T − 1) + a] cells in the input of this layer. For example, one cell in the CONV5 (the 5th convolutional)layer of CNN model (imagenet-vgg-m) [40] corresponds to a 139 × 139 receptive field in the 224 × 224 input image (cf. Fig. 4).