Part 5: Optimising our CNN

In our previous section, we both trained our network on a training set and tested it on a testing set and our accuracy on the training set (0.972) was higher than on our testing set (0.922). In an ideal design the training set should have the same accuracy as the testing set. Today, we will be both closing the gap as well as improving our overall accuracy.

  1. Introduction
  2. Getting Started
  3. Transforming Kaggle Data and Convolutional Neural Networks (CNNs)
  4. Training our Neural Network
  5. Optimising our CNN (Current)
  6. Converting and Freezing our CNN
  7. Quanitising our CNN
  8. Compiling our CNN
  9. Running our code on the DPU
  10. Conclusion Part 1: Improving Convolutional Neural Networks: The weaknesses of the MNIST based datasets and tips for improving poor datasets
  11. Conclusion Part 2: Sign Language Recognition: Hand Object detection using R-CNN and YOLO

The Sign Language MNIST Github

If you would like to not manually change any code for this tutorial, you may swap to this branch which contains all the relevant changes:

git checkout optimal_nn

What is overfitting and underfitting?

Imagine we provide an AI with a word it has never seen before and ask it to come up with a prediction of what the word means. To solve this task, we provide a bunch of words and definitions for it to train with. What the AI will need to do is figure out the patterns in the words to provide a good prediction of a word it has never seen before. For instance, it may learn to look for the common greek or latin roots of words, so if a word has a suffix ology the AI can predict that the new word is some sort of field of study. It is clear if we don’t provide the AI with many examples of a word ending in ology, it will not learn that association. In fact, it can be problematic if we only provide a single example of a word, such as biology. The AI may interpret any word with the suffix ology as the study of life and living organisms. This would be an example of overfitting. The AI has not learned how to spot general features and instead it has become a dictionary, where it can only predict words it has seen before. The opposite of overfitting is underfitting. This is when the model has all the data it needs but fails to achieve the accuracy it can do. This can be caused by our AI being too simple so it is unable to learn any associations or not being given enough training time to learn the patterns. The solutions to underfitting are quite simple: increase the complexity of our neural network or increase the training time.

The best way to fix overfitting is to provide enough data so that the AI can start spotting those general features. If we can provide enough examples of words ending in ology eventually our AI will pick up on the suffix and associate it with fields of study. Unfortunately, many times it is impractical to gather more data, meaning that we need to work in the confines of what we already have. There are two methods for dealing with overfitting:

  • Simplifying the model
  • Constraining the model through regularisation

Optimising our model

To optimise our model we are first going to increase the complexity of the model to see if we can increase our testing accuracy. We need to be careful that we do not increase the complexity to the point of overfitting as one reason why a neural network may turn dictionary-like is that the model has too much “capacity.” This is when our AI has too many learnable parameters, so instead of learning general features, it learns the very specific patterns within the training data, which are not in the general case. Fortunately, in neural networks it is quite easy to tune the model capacity, by lowering the number of layers or by reducing the depth of each layer. We are now going to explore some of these options. Our model has regularization layers in it that we will deactivate until we start needing to deal with overfitting.

In nn_model.py comment out the batch normalization layers (layers.BatchNormalization()(net)) and dropout layers (layers.Dropout(0.4)(net)):

inputs = layers.Input(shape=(28, 28, 1))
net = layers.Conv2D(28, kernel_size=(3, 3), padding='same')(inputs)
net = layers.Activation('relu')(net)
#net = layers.BatchNormalization()(net)
net = layers.MaxPooling2D(pool_size=(2,2))(net)

net = layers.Conv2D(256, kernel_size=(3, 3), padding='same')(net)
net = layers.Activation('relu')(net)
#net = layers.BatchNormalization()(net)
net = layers.MaxPooling2D(pool_size=(2,2))(net)
#net = layers.Dropout(0.4)(net)

net = layers.Conv2D(128, kernel_size=(3, 3), padding='same')(net)
net = layers.Activation('relu')(net)
#net = layers.BatchNormalization()(net)
net = layers.MaxPooling2D(pool_size=(2,2))(net)
#net = layers.Dropout(0.4)(net)

net = layers.Conv2D(64, kernel_size=(3, 3), padding='same')(net)
net = layers.Activation('relu')(net)
#net = layers.BatchNormalization()(net)
net = layers.MaxPooling2D(pool_size=(2,2))(net)
#net = layers.Dropout(0.4)(net)

net = layers.Flatten(input_shape=(28, 28,1))(net)
net = layers.Dense(1024)(net)
net = layers.Activation('relu')(net)

net = layers.Dense(512)(net)
net = layers.Activation('relu')(net)

#net = layers.Dropout(0.4)(net)
net = layers.Dense(25)(net)

prediction = layers.Activation('softmax')(net)

Since we have removed batch normalization which affects our learning rate, we also need to increase our number of epochs to 10 and our learning rate to 0.0002 and our decay rate to 3e-6. Running our model should give us results like this:

Epoch 12/12 23455/23455 [==============================] - 11s 452us/step - loss: 4.4192e-04 - acc: 1.0000 - val_loss: 3.5192e-04 - val_acc: 1.0000
7172/7172 [==============================] - 0s 54us/step
Loss: 0.423
Accuracy: 0.913

Our training accuracy is 1.0 but our testing accuracy is 0.913, which is a difference of 0.087. Our goal is to both increase our testing accuracy and make this difference smaller. We can first try to increase our testing accuracy through increasing the complexity and then narrow our testing and training difference through incrementally simplifying the network.

Let us create a far deeper CNN:

inputs = layers.Input(shape=(28, 28, 1))
net = layers.Conv2D(28, kernel_size=(3, 3), padding='same')(inputs)
net = layers.Activation('relu')(net)
#net = layers.BatchNormalization()(net)
net = layers.MaxPooling2D(pool_size=(2,2))(net)

net = layers.Conv2D(256, kernel_size=(3, 3), padding='same')(net)
net = layers.Activation('relu')(net)
#net = layers.BatchNormalization()(net)
net = layers.MaxPooling2D(pool_size=(2,2))(net)
#net = layers.Dropout(0.4)(net)

net = layers.Conv2D(128, kernel_size=(3, 3), padding='same')(net)
net = layers.Activation('relu')(net)
#net = layers.BatchNormalization()(net)
net = layers.MaxPooling2D(pool_size=(2,2))(net)
#net = layers.Dropout(0.4)(net)

net = layers.Conv2D(64, kernel_size=(3, 3), padding='same')(net)
net = layers.Activation('relu')(net)
#net = layers.BatchNormalization()(net)
net = layers.MaxPooling2D(pool_size=(2,2))(net)
#net = layers.Dropout(0.4)(net)

net = layers.Flatten(input_shape=(28, 28,1))(net)
net = layers.Dense(1024)(net)

net = layers.Activation('relu')(net)
net = layers.Dense(512)(net)

net = layers.Activation('relu')(net)
#net = layers.Dropout(0.4)(net)

net = layers.Dense(25)(net)
prediction = layers.Activation('softmax')(net)
Running this network gives the following results:

Epoch 10/10
23455/23455 [==============================] - 16s 691us/step - loss: 0.0225 - acc: 0.9936 - val_loss: 5.1095e-04 - val_acc: 1.0000
7172/7172 [==============================] - 1s 154us/step
Loss: 0.240
Accuracy: 0.945
Our more complex network has increased our testing data accuracy by 0.032, but there is still a significant difference between our testing and training. Let us try simplify our network to see if we can further increase the data accuracy by removing one convolutional layer and reducing the depths of our other layers :
inputs = layers.Input(shape=(28, 28, 1))
net = layers.Conv2D(28, kernel_size=(3, 3), padding='same')(inputs)
net = layers.Activation('relu')(net)
#net = layers.BatchNormalization()(net)
net = layers.MaxPooling2D(pool_size=(2,2))(net)

net = layers.Conv2D(64, kernel_size=(3, 3), padding='same')(net)
net = layers.Activation('relu')(net)
#net = layers.BatchNormalization()(net)
net = layers.MaxPooling2D(pool_size=(2,2))(net)
#net = layers.Dropout(0.4)(net)

net = layers.Conv2D(64, kernel_size=(3, 3), padding='same')(net)
net = layers.Activation('relu')(net)
#net = layers.BatchNormalization()(net)
net = layers.MaxPooling2D(pool_size=(2,2))(net)
#net = layers.Dropout(0.4)(net)

net = layers.Flatten(input_shape=(28, 28,1))(net)
net = layers.Dense(1024)(net)
net = layers.Activation('relu')(net)

net = layers.Dense(512)(net)
net = layers.Activation('relu')(net)
#net = layers.Dropout(0.4)(net)

net = layers.Dense(25)(net)
Our results:
Epoch 10/10
23455/23455 [==============================] - 9s 391us/step - loss: 2.6152e-04 - acc: 1.0000 - val_loss: 2.4047e-04 - val_acc: 1.0000
7172/7172 [==============================] - 0s 66us/step
Loss: 0.207
Accuracy: 0.952
See if you can experiment more with these layers to push the testing accuracy up even further.

Constraining the model through Regularisation

Simplification is not our only solution to overfitting. We can also implement regularisation, which is where we restrain the model to reduce the risk of overfitting. Here we will be looking at batch normilization layers and dropout.

Batch Normalization

Batch Normalization was not originally intended to prevent overfitting. Its first purpose was to prevent the saturation of activation functions, which allows us to use larger learning rates, increasing our learning processes. It also turns our to be an effective normalization technique which makes it a very popular choice of layer. We can place this layer before or after an activation layer. The way batch normalization works is by zero-centering and normalising each input, after which it scales and shifts it. Zero-centering and normalising requires an estimation of the mean and standard deviation, which it does by finding the value across an entire batch, hence the name batch normalization.

Dropout

Going back to our AI analogy, if we want to stop our AI from becoming a dictionary, every time it learns a word, we could rip out the pathways that define that word . By ripping out pathways randomly, we could stop the AI being reliant on those definitions and instead start relying on generalisations of those words. We can do this through dropout layers. Each training step, there is a probability (in our case 40%) that a neuron will dropout and not be part of that step. This stops the CNN from becoming too reliant on specific pathways and so must look for generalised features. During the inference phase we need to compensate for the fact that on average each neuron now has 40% more inputs (since all the neurons are now live). We can do this by reducing the weight by a proportional amount.

Putting it all together

We can introduce regularisation to our network simply by uncommenting the relevant layers in our code. Running our result give us:

Epoch 10/10
23455/23455 [==============================] - 13s 568us/step - loss: 0.0212 - acc: 0.9929 - val_loss: 0.0031 - val_acc: 0.9998
7172/7172 [==============================] - 1s 114us/step
Loss: 0.099
Accuracy: 0.972

We have managed to boost our test data accuracy from 0.913 to 0.972 by first increasing the complexity of our network to stop any underfitting. We then reduced the overfitting by removing a convolutional layer and adjusting the depths of our other layers and then finally introducing regularisation techniques. In our next tutorial we will start looking at converting our model for use on FPGAs through freezing our model.

Related Post

Leave a Comment

Getting Started

Follow Us

Beetlebox Limited is a

company registered in

England & Wales with

Company Number 11215854

and VAT no. GB328268288