Category Archives: audio transcription

Adobe CS4 Video Transcription

Technology for automatically transcribing speech has been around for some time now. Its even gotten usably accurate in constrained environments like doctor’s offices and legal depositions where systems can be trained on people’s voices and the background noise and environment can be controlled. But the holy grail of speech transcription is unconstrained automatic transcription that can be employed on a wide variety of content. Adobe took a stab at this in their latest release of Adobe Premier CS4 and I was pleasantly surprised with the results.

The transcription engine is conveniently accessible in the new Adobe Media Encoder. Transcription takes about 3x real-time in the “high quality” mode. I didn’t test the medium quality mode. Using 6 videos, all of which were 30 – 60s long I created manual transcriptions (ground truth) and then also ran them through the media encoder to get the automatic transcriptions. The videos ranged from news and documentary segments, to sitcom excerpts, to advertisements. Three of these videos had minimal to no background noise and three had variable amounts of music, laughter, and noise in the background. I compared the transcription output to the ground truth using a standard metric, the Word Error Rate (WER), which is essentially an edit distance measured between the transcripts. Here’s what I found:

The average WER for all six videos was an unimpressive 38.82%. But among the videos that were “clean,” that is they had little to no background noise or music the average WER was 14.87% which is actually pretty darn good for a fully automatic method. Studies have shown that a WER of 25% is about good enough to start being useful in interactive applications, so the 10-20% rate on the clean samples is really exciting since technology is finally good enough to start being useful! Of course, add in music, some laughter, or the ocassional neologism and everything’s out the window. But for that there’s still Audio Puzzler, which for the same six test videos pulled an average WER of 1.24%.