The Covid-19 pandemic accelerated the mass adoption of technologies that would have normally been 5-10 years out. Every industry has seen a digital shift, from an uptake in food delivery services, to digital signatures on contracts to municipalities using zoom to engage with citizens.
In our day-to-day world, which is to make media more accessible, we have also seen some technologies accelerated during the pandemic. One of these technologies, which was fringe and rough around the edges back in 2019, is synthetic voice (aka text-to-speech) software solutions.
Audio description (better known as described video in Canada) assists people who are blind or have low vision by providing a narrative description of visual elements that may be essential to understanding the plot of a story (facial expressions, settings, actions, costumes, etc). Check out a few examples here. During the pandemic, society as a whole increased its consumption of video. According to eMarketing “there was 27.7 million digital viewers in Canada in 2021, a level we previously did not expect to hit until 2024.” As video consumption grows, so is the need to provide more services and innovation to make video content more accessible.
Text-to-speech is a technology that has improved during the pandemic. So the short answer to the question on whether it can used for audio description is yes.
At a high level, the typical audio description process looks like this:
When a synthetic voice is used, the process is basically the same, minus the voice talent being human:
There are advantages and disadvantages to using synthetic voices when doing audio description:
Advantages
Two benefits that can be gained from using synthetic voices versus human narrators are reductions in cost and time. Today, there are several mature synthetic software solutions available in the market, which has reduced prices, while the cost for human narration continues to increase.
The production time can also be faster as you don’t have to source the voice talent, schedule a recording time, and so on. A few clicks and you are on your way. However, the jury is still out as to how much time can actually be saved. A human still needs to verify the work to ensure that the correct words and inflections are used by the synthetic voice, and it’s not possible to know what anything sounds like until the entire audio mix is processed. So there can be significant time spent reviewing and correcting, rather than catching things on the fly during human voice recordings.
Another notable benefit is the variety of voice types. Most software solutions will provide hundreds of different voices to choose from, in multiple languages.
Disadvantages
Although faster production and lower costs sound great, the trade-off is quality.
Synthetic voice software has evolved over the last couple years, and it can feel more human. However, if you are looking for a high-quality production, synthetic voices do have their downside. Google Translate can help you get a quick and dirty translation, but because context can be misunderstood by the algorithm, you will often get a less than perfect translation. The same can be said for synthetic voices, which have lack of subjective judgement.
For example, when a human is describing the visual elements of a story, they may substitute a word that they perceive is more appropriate than the initial word from the script. The human can bring a better understanding of context and the nuances in the story. With synthetic voices, the software will simply read the words exactly as they appear in the description and will not be able to adapt on the fly.
Synthetic voices can have a lack of tone and emotions because context is misunderstood and getting the right pronunciation can also be a challenge. They can be jarring for emotionally-charged shows, and are likely better suited for academic videos/documentaries/content that doesn’t involve too much emotion. Like Google Translate, the technology is amazing, but it serves the need for speed and convenience…so buyer beware.
At the end of the day, it comes down to your expectations for quality and the type of user experience that you are looking for, your budget, and your timelines. Some vendors are using a hybrid model, until the technology improves to the point where it clearly saves time and money.
If you are working with us, or another audio description / described video professional, start by having the conversation on the plus and minuses of using synthetic voices vs human voice talent, so that together you can map the right expectations and outcomes.