This page contains a glossary of terms that Kaldi users might want to know about.
The current content here consists just of a few examples; more content will be added shortly. The easiest way to search in this page is to use the search function of your browser. For convenience the definition of each term's section is preceded and followed by a colon, so for instance, typing ctrl-f ":lattice:" would take you to the section for "lattice".
:acoustic scale: The acoustic scale used in decoding, written as –acoustic-scale in C++ programs and –acwt in programs. This is a scale on the acoustic log-probabilities, and is a universally used kludge in HMM-GMM and HMM-DNN systems to account for the correlation between frames. It's usually set to 0.1, meaning the acoustic log-probs get a much lower weight than the language model log-probs. In scoring scripts you'll often see a range of language model weights being searched over (like the range 7 to 15). These can be interpreted as the inverse of an acoustic scale; it's the ratio between the two that matters for Viterbi decoding.
:alignment: A representation of the sequence of HMM states taken by the Viterbi (best-path) alignment of an utterance. In Kaldi an alignment is synonymous with a sequence of
transition-ids. Most of the time an alignment is derived from aligning the reference transcript of an utterance, in which case it is called a
forced alignment.
lattices also contain alignment information as sequences of transition-ids for each word sequence in the lattice. The program
show-alignments shows alignments in a human-readable format.
:cost: Any quantity which is used as a 'cost' in a weighted FST algorithm (e.g. acoustic cost, language model cost; see
Lattices in Kaldi for more details). Costs are, generally speaking, interpretable as a negative log of a likelihood or probability, but there may be scaling factors involved.
:forced alignment: see alignment.
:lattice: A representation of alternative likely transcriptions of an utterance, together with associated alignment and cost information. See
Lattices in Kaldi.
:likelihood: A mathematical concept meaning the value of a function representing the distribution of a continuous value. These can be more than one. Often represented in log space (as log-likelihood) because likelihood values of multi-dimensional features can often be too small or large to fit in standard floating-point precision. With standard cross-entropy trained neural net systems we obtain "pseudo-likelihoods" by dividing log-probabilities by the priors of context-dependent states.
:posterior: "Posterior" is shorthand for "posterior probability" which is a very general mathematical concept, generally meaning "the probability of some random
variable after seeing the relevant data". In general, posteriors will sum to one. In Kaldi terminology, if you encounter the term "posterior", abbreviated to "post", without further expanation it generally means the per-frame posterior probability of transition-ids. However these posteriors may be very peaky (i.e. mostly ones and zeros) depending how you obtained them, e.g. from a lattice or from an alignment. Alignments and lattices can be converted to posteriors over transition-ids (see
lattice-to-post.cc), or over lattice arcs (see
ali-to-post.cc and
lattice-arc-post.cc). Posteriors over transition-ids can be converted to posteriors over pdf-ids or over phones; see the tools
ali-to-post.cc,
post-to-pdf-post.cc and
post-to-phone-post.cc.
:transition-id: a one-based index that encodes the pdf-id (i.e. the clustered context-dependent HMM state), the phone identity, and information about whether we took the self-loop or forward transition in the HMM. Appears in lattices, decoding graphs and alignments. See
Transition models (the TransitionModel object).
:G.fst: The grammar FST
G.fst
which lives in the
data/lang/
directory in the scripts (see
Data preparation-- the "lang" directory.) represents the language model in a Finite State Transducer format (see www.openfst.org). For the most part it is an acceptor, meaning the input and output symbols on the arcs are the same, but for statistical language models with backoff, the backoff arcs have the "disambiguation symbol"
#0
on the input side only. For many purposes you'll want to get rid of the disambiguation symbols using the command
fstproject –project_output=true
. The disambiguation symbols are needed during graph compilation to make the FST determinizable, but for things like language-model rescoring you don't want them.