Entry tags:
Speech Recognition
This is a topic that I don't really know all that much about. Corrections welcome. This is very much a work in progress.
Speech recognition sucks. There's been a lot of research; why isn't it any better?
Speech consists of an assortment of hisses and buzzes that are interpreted by the brain. The current software approach is to try to go directly from the basic noises to words, using software brute force:
This gives us, right at the start, speaker independence.
For that first step, we can use something I call "predictive filtering" (I'm sure there's a "real" name for it). We classify the sounds coming in to one of a finite set of "base sounds"; some small combination of buzzes and hisses that make up a specific sound. To combine these base sounds into a phoneme, we look at the set of all possible sequences of base sounds that start with that base.
This filtering technique can also be used for going from IPA to words.
Speech recognition sucks. There's been a lot of research; why isn't it any better?
Speech consists of an assortment of hisses and buzzes that are interpreted by the brain. The current software approach is to try to go directly from the basic noises to words, using software brute force:
sound → words
Seems to me, if we break this process down, we can get a lot more accuracy. The tool that linguists use is the "international phonetic alphabet" (IPA), which expresses individual phonemes:sound → IPA → words
This gives us, right at the start, speaker independence.
For that first step, we can use something I call "predictive filtering" (I'm sure there's a "real" name for it). We classify the sounds coming in to one of a finite set of "base sounds"; some small combination of buzzes and hisses that make up a specific sound. To combine these base sounds into a phoneme, we look at the set of all possible sequences of base sounds that start with that base.
- Get initial base sound
- Generate set of all possible sequences starting with that base
- Cache the "most probable" sequence(s)
- Get the next base sound
- Prune off the sequences that don't have that sound as their second element
- Cache "most probable"
- Continue until we have a single sequence matching a single phoneme.
- Output phoneme
- Go to step 1
This filtering technique can also be used for going from IPA to words.