Krizhevsky, Sutskever, Hinton and ImageNet

Santiago Velez Garcia
4 min readJun 12, 2020



In this post, we will cover and briefly summarize the article titled ImageNet Classification with Deep Convolutional Neural Networks authored by Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton of the University of Toronto and published in Advances in Neural Information Processing Systems 25 (NIPS 2012). The official abbreviation for the conference changed from NIPS to NeurIPS in 2018. NeurIPS is a machine learning and computational neuroscience conference held every December since 1987.

Ilya Sutskever (left), Alex Krizhevsky (centre), Geoffrey Hinton (right) The man who revolutionized computer vision, machine translation, games and robotics

The article describes the architecture and training of a deep convolutional neural network to classify the 1.2 million high-resolution images of the ImageNet LSVRC-2010 contest into the 1000 different classes. ImageNet is an ongoing research effort to provide researchers an easily accessible image database. As an organization, it annually (2010–2017) held The ImageNet Large Scale Visual Recognition Challenge (ILSVRC). This research competition evaluates algorithms for large scale object detection and image classification.


The dataset used for the ILSVRC-2010 competition and hence the object of the article is a subset of ImageNet with close to 1000 images in each of 1000 categories. The dataset adds up to about 1.2 million training images, 50,000 validation images, and 150,000 testing images. The University of Toronto team trained the network on the (centered) raw RGB values of the images’ pixels.

The metrics used to assess the model performance were top-1 and top-5, where the top-5 error rate is the fraction of test images for which the correct label is not among the five labels considered most probable by the model.

The Deep Convolutional Neural Network contains eight learned layers. The first five are convolutional and the remaining three are fully-connected. The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels. See the image below for a depiction of the architecture.

An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264– 4096–4096–1000. ImageNet Classification with Deep Convolutional Neural Networks


Besides the anecdotical value of the story about winning the competition, the article provides insight and explanations about the chosen architecture and its performance. They address four different engineering choices. Here is the list in decreasing relevance:

  • ReLU Nonlinearity
  • Training on Multiple GPUs
  • Local Response Normalization
  • Overlapping Pooling

The article shows how the use of the ReLu Nonlinearity greatly busts performance in comparison to the traditionally used sigmoid or tanh activation functions.

To address the Overfitting problematic the researchers used both Data Augmentation and at the time of the article newly introduced Dropout technique. They trained the models using stochastic gradient descent with a batch size of 128 examples, a momentum of 0.9, and a weight decay of 0.0005. The Network took between five and six days to train on two GTX 580 3GB GPUs.

The architecture choice and the use of separate GPUs led to a specialization in the learning accomplish by each unit. One GPU learned features were mainly color agnostic and the other color-specific.


The overall performance (top-1 and top-5) may not seem too impressive today after 10 years. But it is important to bear in mind that it represented a two-digit improvement in comparison to what it was considered state of the art at the time.


Berg Et al. (Sparse coding) 47.1%;

Sánchez Et al. (SIFT + FVs) 45.7%;

Krizhevsky Et al. 37.5%


Sparse coding 28.2%;

SIFT + FVs 25.7%

Krizhevsky Et al.17.0%

(Left) Eight ILSVRC-2010 test images and the five labels considered most probable by our model. The correct label is written under each image, and the probability assigned to the correct label is also shown with a red bar (if it happens to be in the top 5). (Right) Five ILSVRC-2010 test images in the first column. The remaining columns show the six training images that produce feature vectors in the last hidden layer with the smallest Euclidean distance from the feature vector for the test image. ImageNet Classification with Deep Convolutional Neural Networks

The researcher clearly stated:

“All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.”

The ML community did not have to wait long for these foreseeable improvements. Today’s flagship GPU from Nvidia is the GEFORCE GTX 1080 Ti 11 GB, which has almost 4 times more memory than its ancestor the GTX 580 3GB, used for the present study. Memory is only one aspect, but there are countless other advancements in both hardware and software. The size and availability of Datasets have also increased dramatically, social media and IoT devices have become a continuous source of data.

Personal Notes

Reviewing this kind of historical articles always give a perspective on the path that bought us to where we are now. It never stops to amaze me, how little time, only a decade, has elapsed since the introduction of concepts and practices such as ReLu, parallel processing, and Dropout that are considered standard today.

Lenet-5 Architecture. Gradient-Based Learning Applied to Document Recognition (LeNet-5)

Only a little more than a decade before Krizhevsky the LeNet-5 by LeCunn Et al. with its paper Gradient-Based Learning Applied to Document Recognition (LeNet-5) made its debut (see Figure). Today we are talking about the YOLO v5 released only weeks after its predecessor the YOLO v4.

YOLO — You only look once, real time object detection explained

This impressive pace leaves me with the somehow disturbing and exciting question:

Who knows where we will be 10 years from now?



Santiago Velez Garcia

ML engineer and languages enthusiast.