In this page we describe how HMM topologies are represented by Kaldi and how we model and train HMM transitions. We briefly mention how this interacts with decision trees; decision trees are covered more fully in How decision trees are used in Kaldi and Decision tree internals. For a list of classes and functions in this group, see Classes and functions related to HMM topology and transition modeling
The class HmmTopology is the way the user specifies to the toolkit the topology of the HMMs the phones. In the normal recipe, the scripts create in a file the text form of the HmmTopology object, which is then given to the command-line programs. To give some idea of what this object contains, below is the text format for the HmmTopology object in the "normal" case (the 3-state Bakis model):
<Topology> <TopologyEntry> <ForPhones> 1 2 3 4 5 6 7 8 </ForPhones> <State> 0 <PdfClass> 0 <Transition> 0 0.5 <Transition> 1 0.5 </State> <State> 1 <PdfClass> 1 <Transition> 1 0.5 <Transition> 2 0.5 </State> <State> 2 <PdfClass> 2 <Transition> 2 0.5 <Transition> 3 0.5 </State> <State> 3 </State> </TopologyEntry> </Topology>
There is one TopologyEntry in this particular HmmTopology object, and it covers phones 1 through 8 (so in this example there are just eight phones and they all share the same topology). There are three emitting states (i.e. states that have pdfs associated with them and 'emit' feature vectors); each has a self-loop and a transition to the next state. There is also a fourth, non-emitting state, state 3 (there is no <PdfClass> entry for it) which has no transitions out of it (implicitly, it connects to the next phone in the sequence). This is a standard feature of these topology entries; Kaldi treats the first state (state zero) as the start state, and the last state, which should always be nonemitting and have no transitions out of it, has final-probability one. You can treat the transition-probability to the last state as equivalent to the "final-probability" in a HMM. All of emitting the states in this particular example can have different pdf's in them (since the PdfClass numbers are all distinct). We can enforce tying by making the <PdfClass> numbers the same. The probabilities given in the HmmTopology object are those that are used to initialize training; the trained probabilities are specific to the context-dependent HMMs and are stored in the TransitionModel object. The TransitionModel stores the HmmTopology object as a class member, but be aware that the transition probabilities in the HmmTopology object are generally not used after initializing the TransitionModel object. There is an exception to this, however; for nonemitting states that are non-final (i.e. those that have transitions out of them but no <PdfClass> entry), Kaldi does not train the transition probabilities and instead it uses the probabilities given in the HmmTopology object. The decision not to support trainable transition probabilities for non-emitting states simplifies our training mechanisms, and since it is not normal to have non-emitting states with transitions, we felt that this was no great loss.
The pdf-class is a concept that relates to the HmmTopology object. The HmmTopology object specifies a prototype HMM for each phone. Each numbered state of a "prototype HMM" has two variables "forward_pdf_class" and "self_loop_pdf_class". The "self_loop_pdf_class" is a kind of pdf-class that is associated with self-loop transition. It is by default identical to "forward_pdf_class", but it can be used to define less-conventional HMM topologies where the pdfs on the self-loop and forward transitions are different. The decision to allow the pdf-class on just the self-loop to be different, while not embracing a fully "arc-based" representation where the pdfs on all transitions in the HMM are potentially independent, was made as a compromise, to allow for compatibility with previous versions of Kaldi while supporting the topology used in our "chain models" AKA lattice-free MMI. If two states have the same pdf_class variable, then they will always share the same probability distribution function (p.d.f.) if they are in the same phonetic context. This is because the decision-tree code does not get to "see" the HMM-state directly, it only gets to see the pdf-class. In the normal case the pdf-class is the same as the HMM state index (e.g. 0, 1 or 2), but pdf classes provide a way for the user to enforce sharing. This would mainly be useful if you wanted richer transition modeling but wanted to leave the acoustic model otherwise the same. Another function of the pdf-class is to specify nonemitting states. If the pdf-class for some HMM state is set to the constant kNoPdf = -1, then the HMM state is nonemitting (it has no associated pdf). This can be achieved simply by omitting the <PdfClass> tag and associated number, in the text form of the object.
The set of pdf-classes for a particular prototype HMM is expected to start from zero and be contiguous (e.g. 0, 1, 2). This is for the convenience of the graph-building code, and does not lead to any loss of generality.
The TransitionModel object stores the transition probabilities and information about HMM topologies (it contains a HmmTopology object). The graph-building code depends on the TransitionModel object to get the topology and transition probabilities (it also requires a ContextDependencyInterface object to get the pdf-ids associated with particular phonetic contexts).
The decision that underlies a lot of the transition-modeling code is as follows: we have decided to make the transition probability of a context dependent HMM state depend on the following five things (you could view them as a 5-tuple):
The last of these four items could be viewed as encoding the destination HMM-state in the HmmTopology object. The reason for making the transition probabilities depend on these things, is that this is the most fine-grained way we could model transitions without increasing the size of the compiled decoding graphs; it is also quite convenient for training the transition probabilities. In practice, with conventional setups it probably does not make any difference to model the transitions as precisely as this, and the HTK approach of sharing the transitions at the monophone level would probably be sufficient.
The TransitionModel object sets up a number of integer mappings when it is initialized, and is used by other parts of the code to perform these mappings. Apart from the quantities mentioned above, there are quantities called transition identifiers (transition-ids), transition indexes (which are not the same thing as transition-ids), and transition states. The reason we have introduced these identifiers and the associated mappings is so that we can use a completely FST-based training recipe. The most "natural" FST-based setups would have what we call pdf-ids on the input labels. However, bearing in mind that given our tree-building algorithms it will not always be possible to map uniquely from a pdf-id to a phone, this would make it hard to map from an input-label sequence to a phone sequence, and this is inconvenient for a number of reasons; it would also make it hard in general to train the transition probabilities using the information in the FST alone. For this reason we put identifiers called transition-ids on the input labels of the FST, and these can be mapped to the pdf-id but also to the phone and to a particular transition in a prototype HMM (as given in the HmmTopology object).
The following types of identifiers are used in the TransitionModel interface. All of them are represented as type int32. Note that some of these quantities are one-based indices and some are zero-based. We have tried to avoid one-based indices as much as possible in the toolkit because they are not very compatible with C++ array indexing, but because OpenFst treats zero as a special case (meaning the special symbol epsilon), we have decided to allow one-based indexing for quantities that frequently appear as input symbols on FSTs. Most importantly, transition-ids are one based. Since we do not imagine pdf-ids appearing very frequently as FST labels, and since we often use them as C++ array indexes, we have decided to make them zero-based but if they appear as FST input symbols (which should be rarely) we add one to them. When reading the TransitionModel code, be aware that when indexing arrays with one-based quantities there are cases where we subtract one and some cases where we do not; this is documented where the member variables are declared. In any case, such code is not in the public interface so it should not lead to too much confusion. A complete list of the various integer quantities used in TransitionModel are as follows:
There are also in the transition-modeling code reference to the following concepts:
The training procedure for the transition model is very simple. The FSTs that we create (for both training and test) have transition-ids as their input labels. In training we do a Viterbi decoding that gives us the input-label sequence, which is a sequence of transition-ids (one for each feature vector). The statistics we accumulate for training transitions are essentially counts of how many times each transition-id was seen in training (the code itself uses floating-point occupation counts but these are just integers in our normal training setup). The function Transition::Update() does the ML update for each transition-state. This works in the "obvious" way. There are also some minor issues related to probability flooring and what to do if a particular transition-state is unseen; for these details, see the code.
At this point we introduce the concept of an alignment. By "alignment", we generally mean something of type vector<int32>, which contains a sequence of transition-ids (c.f. Integer identifiers used by TransitionModel) whose length is the same as the utterance the alignment corresponds to. This sequence of transition-ids would generally be obtained from the decoder as the input-label sequence. Alignments are used in training time for Viterbi training, and in test time for adaptation. Because transition-ids encode the phone information, it is possible to work out the phonetic sequence from an alignment (c.f. SplitToPhones() and ali-to-phones.cc).
We often need to deal with collections of alignments indexed by utterance. To do this conveniently, we read and write alignments with tables; see I/O with alignments for more information.
The function ConvertAlignment() (c.f. the command-line program convert-ali) converts alignments from one transition-model to another. The typical case is where you have alignments created using one transition-model (created from a particular decision tree) and want to convert them to be valid for another transition model with a different tree. It optionally takes a mapping from the original phones to a new phone set; this feature is not normally needed but we have used it when dealing with simplified models based on a reduced (clustered) phone set.
Programs that read in alignments generally have the suffix "-ali".
State-level posteriors are an extension of the "alignment" concept (previous section), except that instead of having a single transition-id per frame we have an arbitrary number of transition-ids per frame, and each one has a weight. It is stored in the following type of structure:
typedef std::vector<std::vector<std::pair<int32, BaseFloat> > > Posterior;
where if we have an object "post" of type Posterior, post.size() will be equal to the length of the utterance in frames, and post[i] is a list of pairs (stored as a vector), where each pair consists of a (transition-id, posterior).
In the current programs, there are only two ways to create posteriors: either
In future, when lattice generation is added, we will add utilities to create posteriors from lattices.
Programs that read in posteriors don't have a suffix comparable to "-ali", which is the suffix for programs that read in alignments. This is for brevity; reading in state-level posteriors is considered the "default" behavior of programs that need this type of alignment information.
A set of Gaussian-level posteriors for an utterance may be stored using the following typedef:
typedef std::vector<std::vector<std::pair<int32, Vector<BaseFloat> > > > GauPost;
This is like the Posterior structure, except the floating-point value (which represents the posterior of the state) is now a vector of floating-point values, one for each Gaussian in the state. The size of the vector would be the same as the number of Gaussians in the pdf corresponding to the transition-id which is the first element of the pair.
The program post-to-gpost converts Posterior structures into GauPost structures; it uses the model and the features to compute the Gaussian-level posteriors. This is mainly useful in situations where we may need to compute Gaussian-level posteriors with a different model or features than the ones we need to accumulate statistics with. Programs that read in Gaussian-level posteriors have the suffix "-gpost".
A complete list of functions and classes involved in converting HMMs to FSTs may be found here.
The most important one is the function GetHTranducer(), declared as follows:
There are aspects of this function which will be hard to understand without having first understood the ilabel_info object, the ContextDependencyInterface interface, and at least the basics of how FSTs are used in speech recognition. This function returns an FST whose input labels are transition-ids and whose output labels represent context-dependent phones (they are indices into the ilabel_info object). The FST it returns has a state that is both initial and final, and all the transitions out of it have output-labels (for efficient composition with CLG). Each transition out of it will typically enter a structure representing a 3-state HMM, and then loop back to the initial state. The FST returned GetHTransducer() will only be valid for the phonetic contexts represented in ilabel_info, which the caller can specify. This is useful because for wide-context systems there can be a large number of contexts, most of which are never used. The ilabel_info object can be obtained from the ContextFst object (which represents C) after composing it with something, and it contains just the contexts that have been used. We would then provide this same ilabel_info object to GetHTransducer() to get an H transducer that covers everything we need.
Note that GetHTransducer() function does not include the self-loops. These must be added later by the function AddSelfLoops(); it is normally convenient to only add the self-loops after all stages of decoding-graph optimization.
The HTransducerConfig configuration class controls the behavior of GetHTransducer.
The function GetHmmAsFst() takes a phonetic context window and returns the corresponding finite state acceptor with transition-ids as the symbols. This is used in GetHTransducer(). A function GetHmmAsFstSimple() that takes fewer options is also provided as a form of documentation, in order to show in principle how the process works.
The AddSelfLoops() function adds self-loops to a graph that has been created without self-loops. A typical setup is to create the H transducer without self-loops, compose with CLG, do determinization and minimization, and then add the self-loops. This enables more efficient determinization and minimization. The AddSelfLoops() function has the option to reorder the transitions; see below Reordering transitions for more details on this. It also takes a transition-probability scale, "self_loop_scale", which does not have to be the same as the normal transition-probability scale; for more on this, see below Scaling of transition and acoustic probabilities.
The AddTransitionProbs() function adds transition probabilities to an FST. The reason this is useful is so that graphs can be created without transition probabilities on them (i.e. without the component of the weights that arises from the HMM transitions), and these can be added in later; this makes it possible to use the same graph on different iterations of training the model, and keep the transition-probabilities in the graph up to date. Creating the graph without transition-probabilities is accomplished by using a zero value for trans_prob_scale (command-line option: –transition-scale). In training time, our scripts tend to store the graphs on disk without the transition probabilities, and then each time we realign we add in the currently valid transition probabilities.
The AddSelfLoops() function takes a boolean option "reorder" which tells it to reorder transion-probabilities so the self-loop comes after the transition out of the state. Where applicable this becomes a boolean command-line option, e.g. you can do –reorder=true to enable reordering during graph creation. This option makes the "simple" and "faster" decoders more efficient (see Decoders used in the Kaldi toolkit), although it is not compatible with the "kaldi" decoder.
The idea of reordering is that we switch the order of the self-loop arcs with all the other arcs that come out of a state, so the self-loop is located at the destination state of each of the other arcs. For this to work, we have to ensure that the FST has certain properties, namely that all the arcs into a particular state must induce the same self-loop (also, a state with a self-loop cannot have input arcs with epsilon inputs, or be the start state). The AddSelfLoops() function modifies the graphs to ensure that they have this property. A similar property is required even if the "reorder" option is set to false. The graphs created with the "reorder" option are exactly equivalent to the non-reordered graphs in terms of the acoustic and transition-model probabilities you get when decoding an utterance. The transition-ids on the resulting alignment are in a different order, but this does not matter given the ways that we make use of these alignments.
There are three types of scaling that can be applied in Kaldi:
Name in code | Name in command-line arguments | Example value (train) | Example value (test) |
acoustic_scale | –acoustic-scale=? | 0.1 | 0.08333 |
self_loop_scale | –self-loop-scale=? | 0.1 | 0.1 |
transition_scale | –transition-scale=? | 1.0 | 1.0 |
You may notice that there is no language model scale on this list; everything is scaled relative to the language model. Also we don't support a word insertion penalty, in general (although the "kaldi" decoder does support this). The idea is that the language model represents "real" probabilities so it makes sense to scale everything else relative to them. The scales during training time are the scales we use in decoding to get Viterbi alignments. In general, we use a figure of 0.1 whenever a parameter is not to critical and is expected to be small. The acoustic scale used during test is quite critical and is typically tuned to the task. We now explain what these three scales do:
The reason we feel it might make sense to apply a different probability scale to the self-loops versus the normal transition scale is we think they could be dictating the probabilities of events at different timescales. A slightly more subtle argument is the following. All the transition probabilities can be regarded as "real" probabilities (comparable to LM probabilities), because the problem of correlation between acoustic probabilities does not occur for transitions. However, a problem arises because we use the Viterbi algorithm in testing (and in our case, in training too). The transition probabilities would only represent real probabilities when summed over, as in the forward-backward algorithm. We expect this to be more of an issue for the self-loops than for probabilities that dictate the weight to give entirely different paths through the HMM, as in the latter case the acoustic distributions will often be quite disjoint, and the difference between forward-backward and Viterbi will be small.