I don't understand how this works. Help would be really appreciated.
Let's say we have a 10 second sound input, we make a feature vector every 10ms - so we have 1000 vectors - $\mathbf{o}=o_1, \cdots,o_{1000}$.
For simplicity, let's say that all we want to know, is what sequence of phones $p=p_1,\cdots,p_n$ is the answer to $\text{argmax } P(o|p)$ for $p$, or at least what's a good estimate/close answer. For all phones, we have a seperate left to right HMM model (I think that's what makes sense in this scenario).
How exactly is this done? Will similar enough vectors be "put together" and only treated as 1? The input is probably not one phone or a thousand phones - but we don't know that... so for what $p$ will $P(o|p)$ be computed?
Answer
Will similar enough vectors be "put together" and only treated as 1?
No, this is not how it is done.
so for what p will P(o|p) be computed?
Your assumption that recognition is performed by considering one by one all possible phone or word sequences and scoring them is wrong. For just a few seconds of speech the number of possible words would be extremely large! This approach - scoring each candidate model among the entire space and picking up the one with the largest score, is actually used only for some small vocabulary / single-word applications (say recognizing several options on a voice menu). I think this confusion stems from the fact that the classic Rabiner tutorial on HMM makes a lot of sense for finite-vocabulary applications, but stays away from all the difficulties/technicalities of continuous applications.
In continuous speech recognition there are several layers on top of the phones models. Phone models are joined together to form words models, and words models are joined together to form a language model. The recognition problem is formulated as searching the path of least cost through this graph.
One could do the same by piecing together HMMs, though... Let's take a simpler example... You have 5 phones and you want to recognize any sequence of them (no words models, no language models - let's consider meaningless speech for this example). You train one left-right HMM per phone, say 3 or 4 states - maybe on clean individual recordings of each phone. To perform continuous recognition, you build a big HMM by adding transition states between the last state of each phone and the first state of each phone. Performing recognition with this new model is done by finding the most likely sequence of states given the sequence of observation vectors - with the Viterbi algorithm. You will know exactly which sequence of states (and thus phones) is the most compatible with the observation. Practically, things are never done this way, and this is why I downplayed in my previous reply the importance of Rabiner's "problem 2": in a true continuous application, the size of the vocabulary is such that considering this giant HMM made of all possibles connections between phones to make words, and words to make sentences would make the naive use of the Viterbi algorithm impossible. We stop reasoning in terms of HMM and we start thinking in terms of FST.
No comments:
Post a Comment