Video Transcription on Google

Yesterday Google announced that they were applying some of their speech transcription research to political videos on YouTube. The philosophy – pushing research into the market to see its value and how it’s used- is great. The implementation however is rather shallow. While searching for keywords within video may be valuable for some users, several other features (such as closed captioning) have been left out of the interface. Also, the feature has not been integrated into YouTube itself and only functions within the google gadget, which makes it less likely to be seen and used by many people.

Speech recognition is a hard problem. In a recent test I did with the sphinx 3 engine from CMU I was lucky to get a 60% correct transcription for a YouTube video – and this was cleanly spoken audio. Studies at the University of Toronto by Cosmin Munteanu suggest that a word error rate (WER) of 25% is needed for the benefits of a transcribed video to be realized. And there’s a LONG way to go until automatically transcribed video achieves that WER on arbitrary internet content. The problems with automatic transcription are manifold but include (1) noisy audio, (2) different speakers with varying accents, (3) poor support for named entities, and (4) high errors in audio to transcript alignment.

It’s hard to evaluate the Google transcription effort, but I will mention that in several searches for keywords that I have done, the markers on the timeline are off by several seconds from where the words are actually spoken in the video. This speaks to difficulty # 4 above. To my knowledge there is no research about the effects on the interactive experience of this type of misalignment error, so it should be interesting to see if Google users find this annoying or not.

I’ve been developing a new technology which addresses the video transcription problem. Check out my post on it here.