Microsoft has developed a new caption algorithm that outperforms human accuracy in certain limited tests. The AI system was used to update the company’s assistant app for the visually impaired, Seeing AI. It will soon be integrated with other Microsoft products such as Word, Outlook and PowerPoint. There it is used for tasks such as creating alt text for images – a feature that is particularly important to improve accessibility.
“Ideally, everyone would include alt text for all images in documents, on the web, and on social media because blind people can access the content and join the conversation,”
These apps include Microsoft’s own Seeing AI, which the company first released in 2017. Seeing AI uses computer vision to describe the world as seen by a smartphone camera for the visually impaired. It can identify household items, read and scan text, describe scenes, and even identify friends. It can also be used to describe pictures in other apps, including email clients, social media apps, and messaging apps like WhatsApp.
Microsoft is not disclosing user numbers for Seeing AI, but Eric Boyd, corporate vice president of Azure AI, said The edge The software is “one of the leading apps for people with blindness or low vision”. Seeing AI has been voted Best App or Best Help App for three years in a row by AppleVis, a community of blind and visually impaired iOS users.
Microsoft’s new caption algorithm will greatly improve the performance of Seeing AI as it can not only identify objects, but also more accurately describe the relationship between them. So the algorithm can look at an image and tell not only what items and objects it contains (e.g. “a person, a chair, an accordion”), but also how they interact (e.g. “a person is seated”) on a chair and playing the accordion ”). According to Microsoft, the algorithm is twice as good as its previous caption system, which has been in use since 2015.
The algorithm, described in a pre-print article published in September, had the highest scores ever on a caption benchmark known as “nocaps”. This is an industry leading caption display board, though it has its own limitations.
The Nocaps benchmark consists of more than 166,000 human-created subtitles that describe approximately 15,100 images from the Open Images dataset. These images span a range of scenarios, from sports to vacation photos to food photography and more. (You can get an idea of the mix of images and captions by examining the Nocaps dataset here or by checking the gallery below.) Algorithms are tested for their ability to create captions for these images that match those of humans .
It’s important to note, however, that the Nocaps benchmarks capture only a tiny fraction of the complexity of the caption as a general task. Although Microsoft claims in a press release that its new algorithm “describes images the same way as humans”, this is only true insofar as it applies to a very small subset of images contained in Nocaps.
As Harsh Agrawal, one of the creators of the benchmark, said The edge via email: “Outperforming human performance on no-caps is not an indicator that captions are a problem solved.” Argawal found that the metrics used to evaluate Nocaps’ performance “correlate only roughly with human preferences” and that the benchmark itself “only covers a small percentage of all possible visual concepts”.
“As with most benchmarks [the] The Nocaps benchmark is only a rough indicator of how well the models are performing on this task, ”said Argawal. “Exceeding human performance on knobs by no means shows that AI systems outperform humans in understanding images.”
This problem – assuming the performance of a particular benchmark can broadly be extrapolated as the performance of the underlying task – is common when it comes to overdoing the ability of the AI. In fact, Microsoft has been criticized by researchers in the past for making similar claims about the ability of its algorithms to understand the written word.
Even so, captioning is a task that has seen tremendous improvements in the past few years thanks to artificial intelligence, and Microsoft’s algorithms are sure to be state of the art. The AI with captions will not only be integrated into Word, Outlook and PowerPoint, but will also be available as a stand-alone model via Microsoft’s Azure cloud and AI platform.