Sequential Model Specialization

Consider recognizing entities such as objects, people, scenes and activities in every frame of video footage of day-to-day life. Such footage may come, for instance, from the media, wearable cameras, movies, or surveillance cameras. In principle, these entities could be drawn from thousands of classes: many of us encounter hundreds to thousands of distinct people, objects, scenes and activities through our life. Recent advances in convolutional neural networks (CNNs) have opened up the possibility of using a single, pre-trained “oracle” classifier to recognize thousands of classes. However, these classifiers are relatively heavyweight, so that applying them to classify every frame of video is costly.

In this paper, we show that day-to-day video exhibits highly skewed class distributions over the short intervals. We demonstrate that when class distribution is highly skewed toward small sets of classes, “specialized” CNNs trained to classify inputs from this distribution can be much simpler than the oracle classifier. We formulate the problem of detecting the short-term skews online and exploiting models based on it as a new sequential decision making problem dubbed the Online Bandit Problem, and present a new algorithm to solve it. When applied to recognizing faces in TV shows and movies, we realize end-to-end classification speedups of 2.4-7.8x/2.6-11.2x (on GPU/CPU) relative to a state-of-the-art convolutional neural network, at competitive accuracy.


Due to copyright, we won’t release the videos or cropped faces. We only publish the label files for the videos that are used in the paper. Each line in the label file represents:

frame left top width height name

where frame is the frame index, (left, top, width, height) is the bounding box of the face in the frame, and name is the actor or actress’s name.

Video Name Video Information Full Length Clip Label
Friends S09E14 640x480, 29.97 fps 24:22 0:00 - 24:22 Download
Good Will Hunting 1280x688, 24 fps 2:06:33 1:41:00 - 1:55:00 Download
The Departed 1920x800, 24 fps 2:31:19 0:50:40 - 0:59:40 Download
Ocean’s Eleven #1 1280x720, 24 fps 2:38 0:00 - 2:38 Download
Ocean’s Eleven #2 1280x720, 24 fps 2:13 0:00 - 2:13 Download
Ocean’s Twelve #1 1280x720, 24 fps 2:42 0:00 - 2:42 Download


  • The Ellen Show video was downloaded from Youtube. However this video is no longer available from Youtube now. Thus, the label file for Ellen Show video is not published.

  • We concatenate 3 clips from movie Ocean’s Eleven and Ocean’s Twelve to increase the length of video. The order is Ocean’s Eleven #1 -> Ocean’s Twelve #1 -> Ocean’s Eleven #2.

Trained Models

The following “compact” CNN models used in the paper are pre-trained on the full, unskewed datasets.

Model Dataset Accuracy* Prototxt Weights Mean
O1 Imagenet 48.9% Download Download Download
O2 47.0% Download Download
S1 Places205 44.0% Download Download Download
S2 40.8% Download Download
F1 VGG Face 84.8% Download Download (99.5503,
F2 80.9% Download Download

* Accuracy is the top-1 accuracy measured on the validation dataset without over-sampling.

Notice that the accuracies of these compact models are significantly lower than that of corresponding oracle models when trained and tested on the uniformly distributed dataset. But if we train a compact model specialzied for a few dominant classes and cascade it with an oracle model, which we call this a specialized model, it can achieve accuracy comparable to oracle models. The figure below shows the accuracies of specialized models trained with skewed datasets (10 dominant classes constitute 50-70% of instances), and tested under various skews. The dashed lines are the accuracy of oracle models (GoogLeNet for object recongition, VGGNet-16 for scene recognition, and VGGFace for face recongition).

Accuracies of specialized models


Coming soon.