06 May 2010

Speech Unrecognition

It is very interesting that there is a huge difference between speech to text, and text to speech. [...]

It is rather amazing that the reverse process, speaking and having the computer record the words, basically doesn't work at all. Optical scanning works quite well, with error rates below 5%. But audio speech-to-text... 80%, tops, and even then you are better off typing it straight from voice, for most purposes.
Is this really that unexpected? How many times do you have to say "I'm sorry, what did you say?" when speaking with someone, especially someone unfamiliar to you? And how many times do you find yourself unable to read someone's hand writing? For me the former is much more common. OCR is typically working off of other computer generated glyphs (i.e. printed material) which makes it even easier than hand-writing recognition.

Maybe I've just been in CS too long, but it would never occur to me that speech-to-text and text-to-speech would be anything but completely separate tasks with totally independent error rates.

Part of the difference, by the way, has to do with the dimensionality of the functions. There are many — but not that many — ways to pronounce each letter. It is an easier task to see a grapheme and then determine which of the corresponding phonemes to use than it is to hear any phoneme in a language and determine which combinations of glyphs should be used to represent that sound. It's not a 1:1, invertible function which means one direction is bound to be harder than the other.

(This is my third (fourth?) post on speech recognition this week. Weird. I guess I just get all excited to see some computer science popping up in the blogosphere. I don't get much of that, especially in the regions of blogspace I hang out in.)

