Computers may rival human visual analytical ability for the first time
Fans of the Where’s Wally picture books (known as Where’s Waldo in the United States and Canada) have for years searched for their hero with nothing but a keen eye and herculean dose of patience. Readers are challenged to locate their man along with his distinctive bobble hat, striped shirt, cane and glasses all while being distracted by other similar objects. It is headache-inducing and a frustrating way to spend an afternoon.
Help may be on the way. The results of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), released on December 10th, show how machines may finally be better at image classification than humans.
The annual competition, championed by scientists from Stanford, Michigan and the University of North Carolina at Chapel Hill has grown steadily since it was launched with six teams in 2010 (when Princeton and Columbia were participants). It attracts global interest and has become the benchmark for object detection and classification.
This year there were 70 teams, from Microsoft, Google, research laboratories, student groups and other companies and academic institutions. They were provided a publicly available image dataset, which allowed them to develop categorical object recognition algorithms. The fully trained and beefed-up algorithms were then let loose in November on the two elements of the competition itself: detection and localisation.
To score a point for detection, the teams had to accurately label within a bounding box objects within 51,294 images (each containing multiple objects), grouped in 200 categories. They were then allowed five guesses at the localisation and classification of objects from 150,000 images across 1000 categories. The classification of the images in the first test needed only to be generic: fish, car, airplane etc. In the second test the classification was much more stringent: there were 189 breeds of dog to choose from to earn a point, for example.
Every team used some variant of a deep neural network. These information-processing models, based on the principles of biological nervous systems, aim to derive or predict meaning from incomplete data (such as images, as in this case). Each network comprises layers of highly interconnected processing elements. In previous iterations of the competition teams had never used more than 20 hidden layers in their algorithms. But this year, the winning team, Microsoft Research Asia (MSRA), used 152 layers; each one slightly transforming the representation of the layer before.
Generally these networks are arranged in layers of artificial neurons or nodes. Adding more layers to a network increases its ability to handle higher order problems. For instance, a small number of layers may be able to recognise spheres, later layers may then be able to ascertain that these are green or orange spheres and further layers may decide that these are in fact apples and oranges. Then perhaps more layers could be added to work out that we were looking at a fruit bowl. As such there is a huge advantage in having more layers when complex tasks need to be performed. The trouble is that these ‘deeper’ networks become rapidly more difficult to train as the available permutations become so vast. A point is reached where the system accuracy degrades when additional layers are added.
What MSRA seems to have identified is that some parts of the image recognition task inherently require a different number of layers than others. If the network has successfully learnt a feature then adding more layers thereafter just dilutes the answer and gets in the way.
To get round this problem MSRA provided short-cuts; connections that can skip across layers that may be redundant for the particular image being analysed. This has allowed them to have a network where the depth is effectively changed dynamically. A side effect of this seems to be that they can greatly increase the number of layers before hitting the limit of the networks ability to learn, which is when everything goes a bit bonkers. That’s how they managed to scale up to 152 layers.
So, the trick seems to be not just increasing the number of layers, but also controlling the resultant computing power by using short cuts. As Assistant Professor Alex Berg of UNC Chapel Hill says: “MSRA had to develop new techniques for managing the complexity of optimising so many layers”.
The results were unequivocal. In the detection test, MSRA won 194 of the 200 categories, with a mean average precision (AP) of 62%. This was a whopping 40% increase on the mean AP achieved by the winner in 2014.
To err is human
In the second test MSRA achieved a classification error rate of 3.5%. That is significant because after the 2014 competition, the human error rate, tested against 1500 images, was estimated to be 5.1%. (At the time the best computer algorithm only managed 6.8% against the same test set of images.)
But computers are not unquestionably better than humanity at image recognition, at least for now. “It is hard to compare human accuracy,” explained Mr Berg, “as computers are not distracted by other things.” And while they may be better at differentiating between hoary and whistling marmots, they cannot, yet, understand context. For instance, a human would recognise a few barely visible feathers near a hand as very likely belonging to a mostly occluded quill; computers would probably miss such nuance.
The long-term goal of this research is to have computers understand the visual context of the world as humans do. ILSVRC is a step towards that future and more will be learned on December 17th when the winning teams reveal their full methodologies at a workshop in Chile. Whether the test set for next year’s competition will contain red and white bobble hats is not yet known.