Accuracy of Human-generated Captions vs. Adobe Speech-to-text

In the July ‘21 release of Premiere Pro, Adobe introduced its artificial intelligence (AI) powered speech-to-text engine to help creators make their content more accessible to their audiences. Their extensive toolset allows their users to edit, stylize, and export captions in all supported formats straight out of the sequence timeline of a Premiere Pro project. A 3-step process of auto-transcribing, generating, and stylizing captions all within the platform already familiar to its users delivers a seamless experience from beginning to end. But how accurate is the final product?

This blog article was published in March 2022, and since then, ASR (Automatic Speech Recognition) technology has advanced significantly. While AI-powered ASR still does not outperform human writers — which we firmly consider the gold standard — these advancements have been so substantial and continue to improve. This progress gives us the confidence to use ASR as a budget-friendly alternative for specific applications.

Learn more about our ASR services here:

Today, at their best, AI captions have an error rate of 5-10% - much improved over the 80% accuracy we saw just a few years ago. High accuracy is crucial for the deaf and hard-of-hearing audience as each error adds to the possibility of confusing the message. To protect all audiences that rely on captioning to understand television programming, the Federal Communications Commission (FCC) set a detailed list of quality standards that all captions must meet to be acceptable for broadcast back in 2015. Preceding those standards, the Described and Captioned Media Program (DCMP) published its Captioning Key manual over 20 years ago and has since been a valuable reference for captioning of both entertainment and educational media targeted to audiences of all age groups. Simply having captions present on your content isn’t enough, it needs to be accurate and best replicate the experience for all audiences.

Adobe’s speech-to-text engine has been one of the most impressive that our team has seen to date, so we decided to take a deeper look at it and run some tests. We tasked our most experienced Caption Editor with using Adobe’s auto-generated transcript to create & edit the captions to meet the quality standards of the FCC and the deaf and hard of hearing community on two types of video clips: a single-speaker program and one with multiple speakers. Our editor used our Pop-on Plus+ caption product for these examples, which are our middle-tier quality captions that fulfill all quality standard requirements but are not always 100% free of errors.

Did using Adobe’s speech-to-text save time, or did it create more work in the editing process than needed? Here’s how it went…

In-depth comparison documents that evaluate the captions cell-by-cell are available for download here:

Download the Full Comparison

Single Speaker Clip

In this example, we used the perfect scenario for AI: clear audio, a single speaker at an optimal words-per-minute (WPM) speaking rate, and no sound effects or music.

The captions contained the following issues that would need to be corrected by the Caption Editor:

No speaker ID.
Strange leading spaces before the second caption line - Caption Editor can delete easily in their software.
A total of 10 cells out of 95 were formatted the way our Caption Editor did it and would have required no further attention at all.
Several first words were not capitalized.
Some sentences had end-sentence punctuation in the middle of a sentence.
Many words were wrong.
Most cells formatted poorly, resulting in cells with 2-3 words, or cells that were difficult to read.
- Many of these cells can be optimized and lines combined for ease of reading.
Several cells with a duration of less than 1 second.

Here’s the clip with Adobe’s speech-to-text captions overlayed on the top half of the video, and ours on the bottom half.

Multiple Speaker Clip

For the next clip, we went with a more realistic example of television programming where there are multiple speakers, an area where AI is known to struggle and has difficulties identifying the speakers. This clip also features someone with a pronounced accent, commentators speaking over one another, and proper names of athletes – all of which our editors take the time to research and understand.

The same errors detailed in the single-speaker example are present throughout, among the other difficulties we expected it to have. In fact, there were so many errors that our editor was unable to use the transcript from Adobe and started from the beginning using our own workflow.

Here’s a sample of the first 9 cells of captions with what Adobe transcribes in the first column, notes from our Caption Editor, and how it should look.

Adobe’s Automated SRT Caption File	Issue	Formatted by Aberdeen
something you are never seen in your life, correct?	No speaker ID.	(Pedro Martinez) It's something you have never seen in your life,
	“Correct” is spoken by new speaker.	(Matt Vasgersian) Correct!
So it's.	Missing text.	So it's--so it's MVP of the year!
So we're all watching something different. OK		(Pedro) We're all watching something different.
He gets the MVP.		Okay, he gets the MVP.
I'd be better off.	Completely misunderstood music lyrics.	♪ Happy birthday to you ♪
Oh, you, you guys.		(Matt) You guys.
Let me up here to dove into the opening night against the Hall of Fame.	Merged multiple sentences together.	Just left me up here to die.
		You left me up here to die against the hall of famer.

Take a look at the clip. Again, with Adobe's speech-to-text on the top and Aberdeen on the bottom.

In-depth comparison documents that evaluate the captions cell-by-cell are available for download here:

Download the Full Comparison

The Verdict

Overall, the quality of the auto-generated captions exceeded expectations, and we found them to be in the top tier of speech-recognition engines available. The timing and punctuation were particularly impressive. However, when doing a true comparison to the captioning work that we would consider acceptable, AI does not meet Aberdeen’s broadcast quality standard.

Aberdeen's post-production Caption Editors are detail-oriented and grammar-savvy and always strive to portray every element of the program with 100% accuracy so that the viewer misses nothing. For our most experienced Caption Editor, it took a 5:1 ratio in time for them to edit and correct the single-speaker clip; meaning, for every minute of video, it took 5 minutes to clean up the transcript and captions. Assuming your team is educated in the proper timing of caption cells, line breaks, and grammar, a 30-minute program may take over 2.5 hours to bring up to standards with a usable transcript. In the second example, the transcript was unusable and would have taken more time to clean up than it did to transcribe from scratch. Double that timeline now.

Consider all of the above when using this service. Do you have the time and resources to train your staff to know how to edit auto-generated captions and get them up to the appropriate standards? How challenging may your content be for the AI? Whenever and however you make the choice, make sure you deliver the best possible experience to your entire audience.

Now Offering Accurate and Affordable AI-powered Captioning & Translation

Accuracy of Human-generated Captions vs. Adobe Speech-to-text

Single Speaker Clip

Multiple Speaker Clip

The Verdict

Company

Resources

CONTACT