This page contains a glossary of terms that Kaldi users might want to know about.
The current content here consists just of a few examples; more content will be added shortly. The easiest way to search in this page is to use the search function of your browser. For convenience the definition of each term's section is preceded and followed by a colon, so for instance, typing ctrl-f ":lattice:" would take you to the section for "lattice".
:acoustic scale: The acoustic scale used in decoding, written as –acoustic-scale in C++ programs and –acwt in programs. This is a scale on the acoustic log-probabilities, and is a universally used kludge in HMM-GMM and HMM-DNN systems to account for the correlation between frames. It's usually set to 0.1, meaning the acoustic log-probs get a much lower weight than the language model log-probs. In scoring scripts you'll often see a range of language model weights being searched over (like the range 7 to 15). These can be interpreted as the inverse of an acoustic scale; it's the ratio between the two that matters for Viterbi decoding.
A representation of the sequence of HMM states taken by the Viterbi (best-path) alignment of an utterance. In Kaldi an alignment is synonymous with a sequence of transition-ids
. Most of the time an alignment is derived from aligning the reference transcript of an utterance, in which case it is called a forced alignment
also contain alignment information as sequences of transition-ids for each word sequence in the lattice. The program show-alignments
shows alignments in a human-readable format.
:forced alignment: see alignment.
A representation of alternative likely transcriptions of an utterance, together with associated alignment and cost information. See Lattices in Kaldi
a one-based index that encodes the pdf-id (i.e. the clustered context-dependent HMM state), the phone identity, and information about whether we took the self-loop or forward transition in the HMM. Appears in lattices, decoding graphs and alignments. See Transition models (the TransitionModel object)
The grammar FST
which lives in the
directory in the scripts (see Data preparation-- the "lang" directory.
) represents the language model in a Finite State Transducer format (see www.openfst.org). For the most part it is an acceptor, meaning the input and output symbols on the arcs are the same, but for statistical language models with backoff, the backoff arcs have the "disambiguation symbol"
on the input side only. For many purposes you'll want to get rid of the disambiguation symbols using the command
. The disambiguation symbols are needed during graph compilation to make the FST determinizable, but for things like language-model rescoring you don't want them.