Deep Learning: GoogLeNet Explained | by Richmond Alake | Dec, 2020
Characteristics and features of GoogLeNet configuration table (figure 1)
- The input layer of the GoogLeNet architecture takes in an image of the dimension 224 x 224.
- Type: This refers to the name of the current layer of the component within the architecture
- Patch Size: Refers to the size of the sweeping window utilised across conv and pooling layers. Sweeping windows have equal height and width.
- Stride: Defines the amount of shift the filter/sliding window takes over the input image.
- Output Size: The resulting output dimensions(height, width, number of feature maps) of the current architecture component after the input is passed through the layer.
- Depth: Refer to the number of levels/layers within an architecture component.
- #1×1 #3×3 #5×5: Refers to the various convolutions filters used within the inception module.
- #3X3 reduce #5×5 reduce: Refers to the numbers of 1×1 filters used before the convolutions.
- Pool Proj: This is the number of 1×1 filters used after pooling within an inception module.
- Params: Refers to the number of weights within the current architecture component.
- Ops: Refers to the number of mathematical operations carried out within the component.
At its inception, the GoogLeNet architecture was designed to be a powerhouse with increased computational efficiency compared to some of its predecessors or similar networks created at the time.
One method the GoogLeNet achieves efficiency is through reduction of the input image, whilst simultaneously retaining important spatial information.
The first conv layer in figure 2 uses a filter(patch) size of 7×7, which is relatively large compared to other patch sizes within the network. This layer’s primary purpose is to immediately reduce the input image, but not lose spatial information by utilising large filter sizes.
The input image size(height and width) is reduced by a factor of four at the second conv layer and a factor of eight before reaching the first inception module, but a larger number of feature maps are generated.
The second conv layer has a depth of two and leverages the 1×1 conv block, which as the effect of dimensionality reduction. Dimensionality reduction through 1×1 conv block allows the decrease of computational load by lessening the layers’ number of operations.
The GoogLeNet architecture consists of nine inception module as depicted in figure 3.
Notably, there are two max-pooling layers between some inception modules.
The purpose of these max-pooling layers is to downsample the input as it’s fed forward through the network. This is achieved through the reduction of the height and width of the input data.
Reducing the input size between the inception module is another effective method of lessening the network’s computational load.
The average pooling layer takes a mean across all the feature maps produced by the last inception module and reduced the input height and width to 1×1.
A dropout layer(40%) is utilised just before the linear layer. The dropout layer is a regularisation technique that is used during training to prevent overfitting of the network.
Dropout technique works by randomly reducing the number of interconnecting neurons within a neural network. At every training step, each neuron has a chance of being left out, or rather, dropped out of the collated contribution from connected neurons.
The linear layer consists of 1000 hidden units, which corresponds to the 1000 classes present within the Imagenet dataset.
The final layer is the softmax layer; this layer uses the softmax function, an activation function utilised to derive the probability distribution of a set of numbers within an input vector.
The output of a softmax activation function is a vector in which its set of values represents the probability of a class or event occurrence. The values within the vector all add up to 1.
Before bringing the exploration of the GoogLeNet architecture to a close, there’s one more component that was implemented by the creators of the network to regularise and prevent overfitting. This additional component is known as an Auxilary Classifier.
One main problem of an extensive network is that they suffer from vanishing gradient descent. Vanishing gradient descent occurs when the update to the weights that arises from backpropagation is negligible within the bottom layers as a result of relatively small gradient value. Simply kept, the network stops learning during training.
Auxilary Classifiers are added to the intermediate layers of the architecture, namely the third(Inception 4[a]) and sixth (Inception4[d]).
Auxilary Classifiers are only utilised during training and removed during inference. The purpose of an auxiliary classifier is to perform a classification based on the inputs within the network’s midsection and add the loss calculated during the training back to the total loss of the network.
The loss of auxiliary classifiers is weighted, meaning the calculated loss is multiplied by 0.3.
Below is an image that depicts the structure of an auxiliary classifier.
An auxiliary classifier consists of an average pool layer, a conv layer, two fully connected layers, a dropout layer(70%), and finally a linear layer with a softmax activation function.
Each of the included auxiliary classifiers receives as input the activations from the previous inception modules.
The image below illustrates the complete GoogLeNet architecture, with all its bells and whistles.
Read More …