Transfer Learning Ride

13 min readJun 28, 2020

Not the typical piggyback…but it gets the point across.

Abstract

This paper explores Transfer Learning (TL), a powerful technique in the Deep Learning realm. In Transfer Learning networks architectures, and trained parameters obtained with a particular Dataset are borrowed and used for a new task. We examine different TL approaches for image recognition on the CIFAR-10 data set using pre-trained network architectures on the ImageNet data set. Do not be disguise by the academic tone, this is not an academic paper. There are code snippets on tensorFlow Keras at the end.

Introduction

We, humans, can apply what we have learned while performing a particular task to another loosely related one. From the day we are born (some argue that even before that), we are transferring knowledge we acquired in a particular situation to the new cases we keep facing in our lives. We often surprise ourselves with the speed with which an individual “gets” a subject. What we are not aware of is that she is probably transferring some skills learned before in a different context.

As an example, we use the knowledge we have of our mother tongue to learn a new language or the mastery in our first foreign language to learn a second one. The skill to ride a bicycle are of great help to ride a motorbike. The more closer the tasks are, the more transferable is the knowledge. Your knowledge of Spanish will help you with Portuguese. But it will be of considerably less advantage while trying to learn German. We rarely learn something from scratch. There is always some prior knowledge.

On the other hand, in academic Machine Learning, scientist and researchers consider problems isolated. That is, they try a new devised architecture and train the model starting from random values. For that, they make use of big curated data sets and massive calculation budgets. It may take them one or several weeks of training on amultiple GPU infraestructure, but they will get there, a highly accurate model.

Let us now move to reality or better say, to business production environments. Here, our two preconditions, namely, large data sets and huge calculations budgets, are the exception rather than the rule. Luckily we can use Transfer Learning and leverage that hard-earned knowledge, in the form of pre-trained models, put at our disposal by the scientific community.

This paper is an exploration of some of the Transfer learning techniques based on a hypothetical challenge, namely, to classify the CIFAR-10 data set. We will be talking about our starting point, our goal, how we got there, the obstacles we found along the way and, the workarounds and tweaks we found. Learning to ride does not come without a couple of scratches, but is sure worthed it. Let us ride!

Materials and Methods

Data set

We will be working with the CIFAR-10 data set. It consists of 60.000 32x32 color images in 10 classes or categories, with 6.000 images per class. There are 50.000 training images and 10.000 test images. The ten (10) different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks.

To begin with, we will use Keras Data sets to load the CIFAR10 small images classification data set. Take a look at the attached code in the Appendices. It shows the loading process, what form the data has and, how it changes after the preprocessing. Notice the training set input shape (50000, 32, 32, 3), which means we have 50.000 32x32 images with three channels corresponding to RGB. Our test data has 1.000 images of the same size. We have as well labels for both the training and test sets. They come as an array of integers, to see what they mean we have to map them to the labels list. I always like to literally “see” what I am dealing with, so with the help of Matplolib, let us visualize some of the images. I have no problem seeing the two horses and the truck, the bird is somehow more difficult, but I need to rub my eyes to make out the deer.

Usually, we would be ready to feed our data set to the model and run the whole thing. But in Transfer Learning, we want to mimic the training conditions of the base model as close as possible. Keras comes with a built-in to do just that. It has one for each of the available pre-trained models. The second part of the code shows what this preprocessing does. The original data set had numbers from 0 to 255 corresponding to the RGB range. After the preprocessing, they get converted to small negative numbers. It is a usual practice to normalize the values, that is to divide by 255, but notice that preprocessing does more than that, the documentation says it “normalizes” the data to the ImageNet data set. If we visualize this new array, via Matplolib, we will get an image with a lot of black in it. That is because Matplolib takes either 0 to 1 for float input or 0 to 255 for integer. We have a lot of entries outside the range, and Matplotlib converts them to zero.

We can not see what the preprocessing does, but it makes a big difference for the training. Test it yourself.

The Base Model

To solve the present problem Keras present us with Keras Applications, which are deep learning models that are made available alongside pre-trained weights. We use these models for prediction, feature extraction, and fine-tuning. They vary in their depth (it refers to the topological depth of the network), the number of trainable parameters, size, and accuracy. The models’ weights result from training on a subset of ImageNet containing 1.000 categories and 1.2 million images. Take notice of how our task requires considerably less effort than that of the ImageNet challenge. We go from a thousand (1.000) distinct categories to only ten (10). But the data set available for training is also orders of magnitude smaller.

I won’t discuss or go into any depth about the virtues of every model architecture. It is not the purpose of this blog, if you are interested, take a glimpse at my post: Krizhevsky, Sutskever, Hinton and ImageNet about the AlexNet, read the corresponding paper or other review like DenseNet — Dense Convolutional Network (Image Classification). That being said and taking all of them as an equally appropriate choice for our current challenge let us move to the mechanics of their use.

The first decision is to include or not the last fully-connected layer at the top of the network (include_top argument). This layer is the layer responsible for performing the classification. In the original data set (ImageNet) there were 1000 categories, we want to have only ten (10). Therefore, we do not include it to train our new one. If our goal were to classify a new data set among the very same 1.000 categories we would leave it. We do not perform any training and take the whole model and use it to make predictions. We could leave it as well if we want to further train the model in an expanded data set, always with 1000 categories.

Now let us move to the weights. Here we are deciding what values our parameters are going to have. We will use ‘imagenet’ as the argument. It could be ‘none’ for random initialization or the path to other parameters we previously found via training on a different data set.

Next is to decide whether we will freeze all the layers of the base model, or train some or all of them. If we freeze them all, the base model acts as a feature extractor. Think of it as a function that maps our input tensor representing the images into feature maps. These feature maps are also images on which we will perform the classification using the layer we will add next. To have a better idea, let us visualize them using the attached code. We will choose MobileNetV2 as our model. Notice that we increased the input size from 32x32 to 96x96. The reason is that this is the minimum input size that MobileNetV2 takes. We did it only to one image, not to the whole data set. This operation is very RAM intensive. If we were to do it to the complete data set, then we use a lambda layer to do it in batches. To illustrate that, let us talk about upsampling. Upsampling is the inverse of pooling and also has the effect of increasing the output size. Keras also has a builtin in the form of a layer method to do it.

Classification layers

In essence, we use the base mode (MobileNet, DenseNet, etc. ) as a feature mapper and train the final layers of our composed model to discriminate among those features. These are the classification layers which are of the densely connected type. The model is complete once we add the final layers, a flattening layer with a softmax activation. We can add more layers with relu activation to improve the classification.

Original input shape, resized shape, feature masp shape. visualization of the first 16 feature maps.

Our model is complete. The first part is the base model (we can decide to train it or not). The important thing is that it has the parameters obtained via the training with the Imagenet data set. And the second part or top is responsible for the classification were we place at least one classification layer. We have our preprocessed input and, we can now compile the complete model. Before that, let us take a look at the model summary.We start with a 32 x 32 and tree (3) RGB channels input and resize it to 96 x 96 x 3 (see the figure). Then we let it go throughout our frozen base model based upon MobileNetV2 that outputs 1.280 3 x 3 feature maps. We later flatten it to a one-dimension array of size 115.210 to finally run it pass our softmax layer and get our ten (10) classification categories. Notice that from the 2'373.194 parameters the model has, we are only training 115.210 parameters, corresponding to the softmax classification layer. That is, under 5% of the total.

Optimization and Training

We will use an Adam optimizer with cross-entropy as the loss-parameter. We will use a learning rate scheduler, the Tensorboard callback with its corresponding logs directory to visualize the process later, early stopping with patience argument of 5 epochs and, model checkpoint to save our best model. You can find the code in the appendices. Learning rate schedules showed a significant impact on the validation accuracy.

Take a look at my posts Peek into ML Optimization Techniques and ML Regularization Techniques for further material on the subject.

Data Augmentation

Quality labeled data is usually scarce. In the present challenge, we have 50.000 training samples, which is already considerable. Data augmentation is a strategy where, by introducing small variation to the original data in the form of rotation, shearing, translation, zooming, etc., we artificially increase the size of our data set. It is not as good as new data, but it helps to prevent overfitting. The code to produce the images shown below can be found in the appendices.

Infrastructure

We will use Google Colab, which is a free Jupyter notebook environment that runs in the cloud and stores its notebooks on Google Drive. It comes with the option to run on GPU, which significantly reduces the processing time.

Results and Discussion

Base model with MobilenetV2

MobileNetV2 results. [Test accuracy: pink, Validation accuracy: green]

Already in the first epoch, we get a validation accuracy above 0.77 and, with only two (2) epochs of training, we climb above accuracy of 0.8. The accuracy peaks in epoch 8, but there is no sensible improvement already from the second epoch onwards. The fact that the test accuracy approaches values close to 1 very early on suggests that there is not more actual learning taking place. And the slight worsening after the eight epoch suggests overfitting. One strategy to counter this is data augmentation.

MobileNetV2 with data augmentation

Both testing and validation accuracy metrics start at lower values (0.68 and .75). We train now on a four (4) fold augmented dataset (20.000 samples). The test accuracy does not approach one as quickly as before, and together with the validation accuracy keeps improving into the 14th epoch where it plateaus. The final result for the validation accuracy is virtually the same as for the trial without data augmentation.

MobileNetV2 with data augmentation and additional dense layer

MobileNetV2 with additional dense layer and data augmentation results. [Test accuracy: orange, Validation accuracy: blue]

By introducing another dense layer, this time with relu activation, we now have 2,951,946 trainable parameters in comparison with the 115.210 we had before. That is 25 times more. In our previous runs, the validation accuracy crosses the test accuracy already after the first epoch, in this trial, it happens only on the eight epoch. We enhanced the classification capabilities of our model which shows in a new validation accuracy of 0.82850.

Frozen MobileNetV2 architecture with additional dense layer

Trainable MobilenetV2 with data augmentation and additional dense layer

Trainable MobileNetV2 arquitecture with one additional dense layer results. [Test accuracy: blue, Validation accuracy: red]

By unfreezing the base model layers, we will now train the complete model. The number of trainable parameters is 5,175,818, which is almost double what we had in the previous trial. Now that we are also training the base model, it remains transfer learning because we used the pre-learned weight as initialization. After five (5) epochs we already achieved a validation accuracy above 0.9

This model takes considerably longer to train. The time per epoch is around 30 minutes, while the previous trial was under 1 minute.

Trainable MobileNetV2 arquitecture with one additional dense layer.

Densenet121 with additional dense layer and subsequently unfreezing of top layer

Densenet121 with additional dense layer. [Test accuracy: orange, Validation accuracy: blue]

The input is 32 x 32 on which we perform no resizing or upsampling, nor data augmentation. It is not surprise that the training is a lot faster(less than a minute per epoch). The model with all layers frozen struggles to get to a final validation accuracy of 0.70210. It is safe to say the information is too compressed (32 x32) and not enought for the model. Nevertheless, we will stay with it to see the effect of unfreezing the base model last layers.

Densenet121 with additional dense layer and last convolutional layer unfreezed. [Test accuracy: orange, Validation accuracy: blue]

We will go from 132.490 trainable parameters to 171,402, the difference 36864, correspondonds to the conv5_block16_2_conv we unfreezed. We improved to0.71490 validation accuraccy. It is not a big improvment but it shows we can optimize the feature extraction borrowed from the DenseNet pretrained model. We are training only one of the 121 layers the model has.

DenseNet 121 architecture with one convilutional layer unfreezed

Trainable Densenet121 with upsampling, data augmentation and additional dense layers

DenseNet121 with additional dense layers results. [Test accuracy: orange, Validation accuracy: blue]

DenseNet is a deeper model and consequently has more parameters. Besides we are increasing the classification capabilities by adding one extra dense layer. We achieve a validation accuracy of 0.9226 for the problem data set.
Training times per epoch were in the range of two minutes using Google Colab, contrary to the imageNet trails we are using upsampling intead of resizing. Instead of 96 x 96, the input size is now 64 x 64, which is about half the size.

DenseNet121 with additional dense layers architecture.

….. and there is more

It is now clear that we can keep experimenting, and I encourage you to so. You can play with the input size, we tried 32, 64, and 96, but nothing stops you from going to 128 or 256 (may take more time to process). We tried two architectures MobileNet and DenseNet, there are many others (Keras applications has 26 different ones), some shallower like the VGG16 (with 23 layers) and some deeper like InceptionResNetV2 (with 512). We tried freezing the complete base model, unfreeze just the very last layer and unfreeze it completely, but you can try everything in between, such as a quarter or half of the net, at the bottom ot at the top. You can add more densely connected layers as well as convolutional ones. Cifar-10 is only one data set, you can try your own with less or more categories. Keep experimenting!

Closing remarks

“nanos gigantum humeris insidentes”

We are standing on the shoulders of giants. The Machine Learning community is not only incredibly diverse and fast-paced but open and generous with knowledge. That is a great treat, for anyone venturing into this world as it allows us to build on previous hard-gained knowledge. Transfer Learning is one of the many ways to do just that.

I hope the ride was pleasant and you are left with more joy than scratches.

Acknowledgments

Machine Learning relays upon deep mathematical concepts. Nevertheless, the approach to hyperparameter selection often comes down to trial and error. In this approach, successful trails from peers are invaluable input for the process. Therefore I want to thank my peers at Holberton, Paulo Morillo, Edwar Ortiz, Emma Gachancipa, Carlos Molano, John Cook, Jorge Zafra, Juan Alberto Londoño, Pierre Beajuge, Hanh Nguyen, Juan Diego Arango, José Alvarez de Lugo and Christian Williams for the valuable discussions we held.

Transfer Learning Ride

Abstract

Introduction

Materials and Methods

Data set

The Base Model

Classification layers

Optimization and Training

Data Augmentation

Infrastructure

Results and Discussion

Base model with MobilenetV2

MobileNetV2 with data augmentation

MobileNetV2 with data augmentation and additional dense layer

Trainable MobilenetV2 with data augmentation and additional dense layer

Densenet121 with additional dense layer and subsequently unfreezing of top layer

Trainable Densenet121 with upsampling, data augmentation and additional dense layers

….. and there is more

Closing remarks

Acknowledgments

Further Reading

Appendices

Dataset uploading and preprocessing

Data augmentation

Upsampling

Trainable MobileNetV2 (resizing and data augmentation)

DenseNet121 (frozen layers)

Fully Trainable DenseNet121 (upsampling and data aumentation)

Written by Santiago Velez Garcia