Consider recognizing entities such as objects, people, scenes and activities in every frame of video footage of day-to-day life. Such footage may come, for instance, from the media, wearable cameras, movies, or surveillance cameras. In principle, these entities could be drawn from thousands of classes: many of us encounter hundreds to thousands of distinct people, objects, scenes and activities through our life. Recent advances in convolutional neural networks (CNNs) have opened up the possibility of using a single, pre-trained “oracle” classifier to recognize thousands of classes. However, these classifiers are relatively heavyweight, so that applying them to classify every frame of video is costly.
In this paper, we show that day-to-day video exhibits highly skewed class distributions over the short intervals. We demonstrate that when class distribution is highly skewed toward small sets of classes, “specialized” CNNs trained to classify inputs from this distribution can be much simpler than the oracle classifier. We formulate the problem of detecting the short-term skews online and exploiting models based on it as a new sequential decision making problem dubbed the Online Bandit Problem, and present a new algorithm to solve it. When applied to recognizing faces in TV shows and movies, we realize end-to-end classification speedups of 2.4-7.8x/2.6-11.2x (on GPU/CPU) relative to a state-of-the-art convolutional neural network, at competitive accuracy.
Due to copyright, we won’t release the videos or cropped faces. We only publish the label files for the videos that are used in the paper. Each line in the label file represents:
frame left top width height name
frame is the frame index,
(left, top, width, height) is the bounding
box of the face in the frame, and
name is the actor or actress’s name.
|Video Name||Video Information||Full Length||Clip||Label|
|Friends S09E14||640x480, 29.97 fps||24:22||0:00 - 24:22||Download|
|Good Will Hunting||1280x688, 24 fps||2:06:33||1:41:00 - 1:55:00||Download|
|The Departed||1920x800, 24 fps||2:31:19||0:50:40 - 0:59:40||Download|
|Ocean’s Eleven #1||1280x720, 24 fps||2:38||0:00 - 2:38||Download|
|Ocean’s Eleven #2||1280x720, 24 fps||2:13||0:00 - 2:13||Download|
|Ocean’s Twelve #1||1280x720, 24 fps||2:42||0:00 - 2:42||Download|
The Ellen Show video was downloaded from Youtube. However this video is no longer available from Youtube now. Thus, the label file for Ellen Show video is not published.
We concatenate 3 clips from movie Ocean’s Eleven and Ocean’s Twelve to increase the length of video. The order is Ocean’s Eleven #1 -> Ocean’s Twelve #1 -> Ocean’s Eleven #2.
The following “compact” CNN models used in the paper are pre-trained on the full, unskewed datasets.
* Accuracy is the top-1 accuracy measured on the validation dataset without over-sampling.
Notice that the accuracies of these compact models are significantly lower than that of corresponding oracle models when trained and tested on the uniformly distributed dataset. But if we train a compact model specialzied for a few dominant classes and cascade it with an oracle model, which we call this a specialized model, it can achieve accuracy comparable to oracle models. The figure below shows the accuracies of specialized models trained with skewed datasets (10 dominant classes constitute 50-70% of instances), and tested under various skews. The dashed lines are the accuracy of oracle models (GoogLeNet for object recongition, VGGNet-16 for scene recognition, and VGGFace for face recongition).