The pooling operation involves sliding a two-dimensional filter over each channel of feature map and summarising the features lying within the region covered by the filter.
For a feature map having dimensions nh x nw x nc, the dimensions of output obtained after a pooling layer is
(nh - f + 1)/s *(nw - f+ 1)/s *nc
where
nh - height of feature map
nw - width of feature map
nc - number of channels in the feature map
f - size of filter
s - stride length
A common CNN model architecture is to have a number of convolution and pooling layers stacked one after the other.
Pooling Layers?
- Pooling layers are used to reduce the dimensions of the feature maps. Thus, it reduces the number of parameters to learn and the amount of computation performed in the network.
- The pooling layer summarises the features present in a region of the feature map generated by a convolution layer. So, further operations are performed on summarised features instead of precisely positioned features generated by the convolution layer. This makes the model more robust to variations in the position of the features in the input image.
Types of Pooling:
- MaxPooling
- Average Pooling
- Global Pooling
Max Pooling
Max pooling is a pooling operation that selects the maximum element from the region of the feature map covered by the filter. Thus, the output after max-pooling layer would be a feature map containing the most prominent features of the previous feature map.
Average Pooling
Average pooling computes the average of the elements present in the region of feature map covered by the filter. Thus, while max pooling gives the most prominent feature in a particular patch of the feature map, average pooling gives the average of features present in a patch.
Global pooling reduces each channel in the feature map to a single value. Thus, an nh x nw x nc feature map is reduced to 1 x 1 x nc feature map. This is equivalent to using a filter of dimensions nh x nw i.e. the dimensions of the feature map.
Further, it can be either global max pooling or global average pooling.
MaxPooling is a down-sampling operation often used in Convolutional Neural Networks (CNNs) to reduce the spatial dimensions of the input volume. It is a form of pooling layer, and it helps in retaining the most important information while discarding less important details. MaxPooling is typically applied after convolutional layers in a CNN.
The basic idea behind MaxPooling is to divide the input image into non-overlapping rectangular regions and, for each region, output the maximum value. This operation is performed independently for each channel in the input.
Here’s a simple explanation of how MaxPooling works:
Input Region:
- The input image is divided into small regions (usually 2x2 or 3x3).
- For each region, the maximum value is computed.
Output Feature Map:
- The maximum value for each region is taken and forms the output of that region.
- The result is a down-sampled version of the input, with reduced spatial dimensions.
Mathematically, if we denote the input as X and the output as Y, the MaxPooling operation can be defined as:
Y[i,j,k]=max(X[2i:2i+2,2j:2j+2,k])
where i and j iterate over the height and width dimensions of the input, and k iterates over the channels.
Common choices for the size of the pooling window are 2x2 or 3x3, and the stride (the step size when moving the pooling window) is often set to be equal to the size of the window for non-overlapping pooling.
import numpy as np
from keras.models import Sequential
from keras.layers import MaxPooling2D# define input image
image = np.array([[2, 2, 7, 3],
[9, 4, 6, 1],
[8, 5, 2, 4],
[3, 1, 2, 6]])
image = image.reshape(1, 4, 4, 1)
# define model containing just a single max pooling layer
model = Sequential(
[MaxPooling2D(pool_size = 2, strides = 2)])
# generate pooled output
output = model.predict(image)
# print output image
output = np.squeeze(output)
print(output)
[[9. 7.]
[8. 6.]]
Let’s go through a simple example of MaxPooling with a 2x2 pooling window. Consider a small 4x4 input matrix:
Now, let’s apply 2x2 MaxPooling to this input matrix. The pooling operation involves moving a 2x2 window across the input and, for each window, taking the maximum value. The output matrix, Y, will have reduced spatial dimensions.
Y[i,j]=max(X[2i:2i+2,2j:2j+2])
Let’s calculate Y step by step:
- For i=0 and j=0:
[0,0]=max(X[0:2,0:2])
=max([1 5
3 6]) =6
- For i=0 and j=1:
Y[0,1]=max(X[0:2,2:4])
=max([2 7
4 8])=8
- For i=1 and j=0:
Y[1,0]=max(X[2:4,0:2])
max([9 13
10 14])=14
- For i=1 and j=1:
Y[1,1]=max(X[2:4,2:4])
=max([11 15
12 16])=16
The resulting output matrix Y is:Y=[ 6 14
8 16 ]
Max pooling offers several benefits in the context of CNNs:
- Feature Invariance: Max pooling helps the model to become invariant to the location and orientation of features. This means that the network can recognize an object in an image no matter where it is located.
- Dimensionality Reduction: By downsampling the input, max pooling significantly reduces the number of parameters and computations in the network, thus speeding up the learning process and reducing the risk of overfitting.
- Noise Suppression: Max pooling helps to suppress noise in the input data. By taking the maximum value within the window, it emphasizes the presence of strong features and diminishes the weaker ones.
In practice, max pooling layers are placed after convolutional layers in a CNN. After a convolutional layer extracts features from the input image, the max pooling layer reduces the spatial size of the convolved feature map, keeping only the most salient information. This process is repeated for multiple convolutional and pooling layers, allowing the network to learn a hierarchy of features at various levels of abstraction.
Max pooling is a simple yet effective technique that has been instrumental in the success of CNNs in various applications, particularly in image and video recognition tasks. Its ability to reduce the computational burden while maintaining the essential features has made it a staple component in deep learning architectures.
Despite its benefits, max pooling is not without its challenges. One criticism is that it can sometimes be too aggressive, discarding potentially useful information that could be important for the classification task. Moreover, max pooling is a fixed operation and does not learn from the data, unlike convolutional layers that have learnable parameters.
As a result, some modern CNN architectures have started to move away from traditional max pooling layers, using alternatives like strided convolutions for downsampling or incorporating learnable pooling operations that can adapt to the data.
The link to the last article which contains the initial part of the article
from tensorflow import keras
from tensorflow.keras import layersmodel = keras.Sequential([
layers.Conv2D(filters=64, kernel_size=3), # activation is None
layers.MaxPool2D(pool_size=2),
# More layers follow
])
A MaxPool2D layer is much like a Conv2Dlayer, except that it uses a simple maximum function instead of a kernel, with the pool_size parameter analogous to kernel_Size. A MaxPool2D layer doesn't have any trainable weights like a convolutional layer does in its kernel, however.
Let’s take another look at the extraction figure from the last lesson. Remember that MaxPool2D is the Condense step.
Notice that after applying the ReLU function (Detect) the feature map ends up with a lot of “dead space,” that is, large areas containing only 0’s (the black areas in the image). Having to carry these 0 activations through the entire network would increase the size of the model without adding much useful information. Instead, we would like to condense the feature map to retain only the most useful part — the feature itself.
This in fact is what maximum pooling does. Max pooling takes a patch of activations in the original feature map and replaces them with the maximum activation in that patch.
When applied after the ReLU activation, it has the effect of “intensifying” features. The pooling step increases the proportion of active pixels to zero pixels.
Translation Invariance
We called the zero-pixels “unimportant”. Does this mean they carry no information at all? In fact, the zero-pixels carry positional information. The blank space still positions the feature within the image. When MaxPool2D
removes some of these pixels, it removes some of the positional information in the feature map. This gives a convnet a property called translation invariance. This means that a convnet with maximum pooling will tend not to distinguish features by their location in the image. ("Translation" is the mathematical word for changing the position of something without rotating it or changing its shape or size.)
Watch what happens when we repeatedly apply maximum pooling to the following feature map.
The two dots in the original image became indistinguishable after repeated pooling. In other words, pooling destroyed some of their positional information. Since the network can no longer distinguish between them in the feature maps, it can’t distinguish them in the original image either: it has become invariant to that difference in position.
In fact, pooling only creates translation invariance in a network over small distances, as with the two dots in the image. Features that begin far apart will remain distinct after pooling; only some of the positional information was lost, but not all of it.
This invariance to small differences in the positions of features is a nice property for an image classifier to have. Just because of differences in perspective or framing, the same kind of feature might be positioned in various parts of the original image, but we would still like for the classifier to recognize that they are the same.
Other Pooling Layers
import numpy as np
from keras.models import Sequential
from keras.layers import AveragePooling2D# define input image
image = np.array([[2, 2, 7, 3],
[9, 4, 6, 1],
[8, 5, 2, 4],
[3, 1, 2, 6]])
image = image.reshape(1, 4, 4, 1)
# define model containing just a single average pooling layer
model = Sequential(
[AveragePooling2D(pool_size = 2, strides = 2)])
# generate pooled output
output = model.predict(image)
# print output image
output = np.squeeze(output)
print(output)
[[4.25 4.25]
[4.25 3.5 ]]
import numpy as np
from keras.models import Sequential
from keras.layers import GlobalMaxPooling2D
from keras.layers import GlobalAveragePooling2D# define input image
image = np.array([[2, 2, 7, 3],
[9, 4, 6, 1],
[8, 5, 2, 4],
[3, 1, 2, 6]])
image = image.reshape(1, 4, 4, 1)
# define gm_model containing just a single global-max pooling layer
gm_model = Sequential(
[GlobalMaxPooling2D()])
# define ga_model containing just a single global-average pooling layer
ga_model = Sequential(
[GlobalAveragePooling2D()])
# generate pooled output
gm_output = gm_model.predict(image)
ga_output = ga_model.predict(image)
# print output image
gm_output = np.squeeze(gm_output)
ga_output = np.squeeze(ga_output)
print("gm_output: ", gm_output)
print("ga_output: ", ga_output)
This ends the basic understanding of MaxPooling Layer in CNN architecture