sgsguru | Speech Recognition (Reply)

This is a topic that I don't really know all that much about. Corrections welcome. This is very much a work in progress.

Speech recognition sucks. There's been a lot of research; why isn't it any better?

Speech consists of an assortment of hisses and buzzes that are interpreted by the brain. The current software approach is to try to go directly from the basic noises to words, using software brute force:

sound → words

Seems to me, if we break this process down, we can get a lot more accuracy. The tool that linguists use is the "international phonetic alphabet" (IPA), which expresses individual phonemes:

sound → IPA → words

This gives us, right at the start, speaker independence.

For that first step, we can use something I call "predictive filtering" (I'm sure there's a "real" name for it). We classify the sounds coming in to one of a finite set of "base sounds"; some small combination of buzzes and hisses that make up a specific sound. To combine these base sounds into a phoneme, we look at the set of all possible sequences of base sounds that start with that base.

Get initial base sound
Generate set of all possible sequences starting with that base
Cache the "most probable" sequence(s)
Get the next base sound
Prune off the sequences that don't have that sound as their second element
Cache "most probable"
Continue until we have a single sequence matching a single phoneme.
Output phoneme
Go to step 1

Generating the set of sequences and the pruning process will parallelize like crazy. It's a classic map-reduce function. We're doing our pattern matching one small step at a time. Also, I expect the set of sequences to converge very rapidly; most steps will simply verify the input against the "expected value" in the cache.

This filtering technique can also be used for going from IPA to words.

Speech Recognition

Post a comment in response: