Image for post
Image for post

In our previous post, we talked about Optimization Techniques. The mantra was speed, in the sense of “take me down -that loss function- but do it fast”. In the present post, we will talk about Regularization Techniques, namely, L1 and L2 regularization, Dropout, Data Augmentation, and Early Stopping. Here our enemy is overfitting and our cure against it is called regularization.

Image for post
Image for post

What is this malicious ailment call overfitting? Well, overfitting is the result of learning too well. That is when the model picks up not only general attributes or features but also the noise in the training data. While “almost perfectly” recognizing and classifying the training data, it will fail to perform well in the test data. If a Neural Network were a student, then overfitting would be the case when she memorized the answers but forgot about the underlying concepts. She can give you the answer to the questions she saw before (the ones learned by heart), but once those questions change, she is in trouble and will perform very poorly, failing the exam.

In a way, we will trick our dear student, the neural network, into not overstudying. Remember that in this learning process, the student starts with almost no knowledge, that is, with Weights Matrices W, of small random numbers (by small, we meant close to zero). As the training or learning process takes place, the W change and their values start to grow (they drift away from zero). Too much of a good thing is generally bad, the same applies here. We need to figure out a way to prevent W from getting too heavy. Let’s take a look at what the Machine Learning physicians came up with.

L1 and L2 Regularization

Sorry to disappoint the football fans but L1 has nothing to do with the English Football League One which is the second-highest division of the English Football League. L1 and L2 are instead ways to measure how heavy our W matrices got. Think about it as indexes like the BMI (body mass index). In L1 we take the sum of the absolute values and in L2 the sum of the square of the values.

How it works? Simply, while training we keep an eye on the norm and penalize the loss function as it grows. By doing this we are effectively imposing a penalty on complexity in the form of an increase in the loss function. The penalty will lower the weight in the nodes that caused the overfitting.

This form of regularization is the most common one. We work with a new loss function resulting from adding the old one with a new regularization term. The regularization term accounts for the norm of the weight matrices, being it, either L1 or L2, and scales it by a new hyperparameter called lambda.

Numpy comes with a prebuilt norm function, you may want to take a look at numpy.linalg.norm. Keep in mind that the L2 norm is also known as the Frobenius norm, given by:

||A||_F = [\sum_{i,j} abs(a_{i,j})²]^{1/2}

In a TensorFlow implementation the kernel_regularizer parameter in the tf.layers.Dense builder is your best friend. You would pass tf.contrib.layers.l2_regularizer(lambda) to it to implement an L2 regularization with lambda as your L2 regularization parameter.

Image for post
Image for post
An overfit function in red, with the regularized one in blue.


Dropout is a bit sneaky. If the nodes with the activation function were to be our hypothetical student notes. This will be the equivalent of temporarily hiding them as he learns. When a node is temporarily hidden (dropped out) all incoming and outgoing connections disappear as well, these connections are the weight in the Weights matrices. So the overall effect is to reduce the size of W. We prevent the network from overrelaying in one particular node, and hence overfitting.

With dropout, we introduce randomness by skipping some nodes, together with their incoming activation -from the previous layers as well as their resulting activation-. The network trains in a simplified version of itself that is better for generalizing instead of overfitting.

In a python/numpy implementation, numpy.random.binomial comes in handy to generate this random multiplier to select the nodes to take off. If you are implementing Dropout in TensorFlow stop by and take a look at tf.layers.Dropout.

Image for post
Image for post

Data Augmentation

In an ideal world, to make a master out of our student, we will get her unlimed material to study, that is, unlimited data. But we live in the real world and unlimited data is not possible and its closest relative abundant data is very expensive. Data augmentation can be seen as mere trickery, we perform some changes (transformations) in the learning material so that the student thinks is new material. Of course, this is not as good as real new data, but it has the effect of reducing the overfitting and hence achieve better performance.

The open-source neural-network library Keras comes with an arsenal of data augmentation functions. So rotating, rescaling, or blurring your training data can be easily done.

Image for post
Image for post

Early Stopping

If we don tell it to stop, or otherwise define a given number of iterations, our model will train forever. This student never gets tired. This will probably get it to a very good fit in the training data, even the noise will be learned. But it will also lead to a bad performance in the test data. So early stopping is a strategy where, while monitoring both training and test performance we tell the training to stop once there is no improvement in the test performance. Early stopping is a watchdog telling the student when to stop.

In this technique we introduce the concept of patience, which is defined as the interval (in terms of epochs or iterations), we are going to wait before stopping the process after we experience no improvement in the test metrics.

Early stopping has the downside that it couples two separate problems. We normally focus first on minimizing the loss function and then focus on not overfitting. Early stopping puts the two together. Some practitioners prefer to keep the two strictly separated and favor techniques such as L2 regularization.

Image for post
Image for post


I hope you enjoyed the article and I am also happy to receive your comments about the matter. For regularization the take away is that even in Machine Learning, too much of a good thing -training- turns to be a bad thing -overfitting-.

Further reading

The videos from Andre Ng’s Deep Learning Specialization, particularly the course 2, are a good starting point to grasp the concepts. You may also want to take a look at:

L2 Regularization and Back-Propagation

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

ML engineer and languages enthusiast.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store