Using AI to Improve Photo Descriptions for People Who Are Blind and Visually Impaired
When Facebook users scroll through their News Feed, they find all kinds of content — articles, friends’ comments, event invitations, and of course, photos. Most people are able to instantly see what’s in these images, whether it’s their new grandchild, a boat on a river, or a grainy picture of a band onstage. But many users who are blind or visually impaired (BVI) can also experience that imagery, provided it’s tagged properly with alternative text (or “alt text”). A screen reader can describe the contents of these images using a synthetic voice and enable people who are BVI to understand images in their Facebook feed.
AAT — which was recognized in 2018 with the Helen Keller Achievement Award from the American Foundation for the Blind — utilizes object recognition to generate descriptions of photos on demand so that blind or visually impaired individuals can more fully enjoy their News Feed.
The latest iteration of AAT represents multiple technological advances that improve the photo experience for our users. First and foremost, we’ve expanded the number of concepts that AAT can reliably detect and identify in a photo by more than 10x, which in turn means fewer photos without a description.
Where we Started
The concept of alt text dates back to the early days of the internet, providing slow dial-up connections with a text alternative to downloading bandwidth-intensive images. Of course, alt text also helped people who are blind or visually impaired navigate the internet, since it can be used by screen reader software to generate spoken image descriptions. Unfortunately, faster internet speeds made alt text less of a priority for many users. And since these descriptions needed to be added manually by whoever uploaded an image, many photos began to feature no alt text at all — with no recourse for the people who had relied on it.
Our completed AAT model could recognize 100 common concepts, like “tree,” “mountain,” and “outdoors.” And since people who use Facebook often share photos of friends and family, our AAT descriptions used facial recognition models that identified people (as long as those people gave explicit opt-in consent).
Seeing More of the World
But we knew there was more that AAT could do, and the next logical step was to expand the number of recognizable objects and refine how we described them. To achieve this, we moved away from fully supervised learning with human-labeled data. While this method delivers precision, the time and effort involved in labeling data are extremely high — and that’s why our original AAT model reliably recognized only 100 objects.
Having increased the number of objects recognized while maintaining a high level of accuracy, we turned our attention to figuring out how to best describe what we found in a photo.
We asked users who depend on screen readers how much information they wanted to hear and when they wanted to hear it. They wanted more information when an image is from friends or family, and less when it’s not.
Facebook is for Everyone
Every day, our users share billions of photos. The ubiquity of inexpensive cameras in mobile phones, fast wireless connections, and social media products like Instagram and Facebook have made it easy to capture and share photography and help make it one of the most popular ways to communicate — including for individuals who are blind or visually impaired. While we wish everyone who uploaded a photo would include an alt text description, we recognize that this often doesn’t happen.
Jan 29, 2021 at 23:13