Finite State Transducer algorithms

Here we describe the FST algorithms in the Kaldi toolkit that are new or different than the the ones in OpenFst (we use the OpenFst code itself for many algorithms).

These algorithms are in the directory fstext/, and the corresponding command-line programs, where they exist, are in fstbin/. This code uses the OpenFst library. We will only describe here the algorithms that are actually used in our current recipes.

We use a different determinization algorithm from the one in OpenFst. We distinguish this with the name DeterminizeStar(); the corresponding command-line program is named fstdeterminizestar. Our determinization algorithm is actually closer to the standard FST determinization algorithm than the one in OpenFst, in that it does epsilon removal along with determinization (thus, like many other FST algorithms, we do not consider epsilon to be a "real symbol".

Our determinization algorithm has a different way of handling what happens when a transition in the initial determinized output has more than one output symbol on it. The OpenFst determinization algorithm uses a function called FactorWeights that moves around the output symbols (encoded as weights) around while maintaining equivalence, in order to ensure that no arc has more than one (encoded) output symbol; it does not introduce new states with epsilon symbols on their input, which is the most "obvious" thing to do. However, the FactorWeights algorithm can fail for the output of DeterminizeStar, because there can be cycles with more outputs than states on the cycle (this is not possible for the output of the normal Determinize algorithm, because it does not do epsilon removal). Instead, whenever we encounter a link with more than one output symbols, we create a chain with a sufficient number of intermediate states to accommodate all the output symbols. The weight and input symbol go on the first link of this chain. The output of our DeterminizeStar algorithm is deterministic according to the definitions OpenFst uses, i.e. treating epsilons as a normal symbols. Its output does have epsilons on the input side, which is against the normal definition of determinism, but this is to be viewed as an encoding mechanism for allowing more than one output symbol on a link, and in any case it only happens in quite specific circumstances (an epsilon arc is always the only arc out of a state).

One other difference is that our program fstdeterminizestar does not require the input FST to have its output symbols encoded as weights.

In general debugging determinization is quite hard, because when a process takes a long time it's hard to tell whether ever terminate. The program fstdeterminizestar has the feature that if you send it a signal with "kill -SIGUSR1 \<its processid\>", it will stop and print out some information that can be useful for debugging the determinization.

We supply a function DeterminizeInLog() which casts an Fst in the normal (tropical) semiring to the log semiring before determinizing, and then converts back. This is the form of determinization used in our algorithms, as it preserves stochasticity (see Preserving stochasticity and testing it).

We supply an epsilon-removal algorithm called RemoveEpsLocal() that is guaranteed to never blow up the FST, but on the other hand is not guaranteed to remove all epsilons. Essentially it removes whatever epsilons it can easily remove without making the graph larger. There is no optimality guaranteed here, as this is a hard problem. The RemoveEpsLocal() function has slightly different behaviour from OpenFst's RemoveEps() function, because it will combine two arcs if one has an input epsilon and one an output epsilon. The function RemoveEpsLocal() preserves FST equivalence.

There is also a function RemoveEpsLocalSpecial() that preserves equivalence in the tropical semiring while preserving stochasticity in the log semiring (for more on stochasticity see next section). This appears to be a case where the usefulness of the semiring formalism breaks down a little bit, as we have to consider two semirings simultaneously.

We define a stochastic FST as an FST in which, in the FST's semiring the sum of the weights of the arcs out of any given state (plus the final-weight) equals one (in the semiring). This concept is most useful and natural in the log semiring; essentially, a stochastic FST is one where the sum of the weights out of a given arc is one (e.g. a properly normalized HMM would correspond to a stochastic FST).

The function IsStochasticFst() tests stochasticity. Optionally it can output minimum and maximum weights, to inform the user of the extent to which the FST fails to be stochastic. The command-line version of this is fstisstochastic. We aim for most of the FST algorithms we use to preserve stochasticity, in the sense that they will produce stochastic outputs given stochastic inputs. For non-stochastic inputs, we aim that the minimum and maximum range of the weights will not get larger. By default the program isstochasticfst will test stochasticity after casting to the log semiring, as this is the most useful case (you can stop this by supplying the option –test-in-log=false).

In order to preserve stochasticity, the FSTs that we compose with have to have certain properties. These should apply to L, C and H. Consider L, for example. We require that for any linear FST corresponding to a path through G, call this FST F, the product L o F must be stochastic. This basically means that L have properly normalized pronunciation probabilities. The actual property formally required may be a little stronger than this (this relates to ensuring that the probabilities appear at the "right time"). In practice we verify stochasticity by running the program isstochasticfst for each stage of graph creation.

We use the minimization algorithm supplied by OpenFst, but we apply a patch before compiling OpenFst so that minimization can be applied to non-deterministic FSTs. The reason for this is so that we can remove disambiguation symbols before minimizing, which is more optimal (it allows minimization to combine more states). The patch fixes data-structure related issues; fundamentally, OpenFst's minimization algorithm is applicable to non-deterministic FSTs. We supply a command-line program called fstminimizeencoded that minimizes FSTs after encoding weights and output symbols as part of the input symbols (so the FST becomes an acceptor). This is the same thing the fstminimize program does except that it does not do weight pushing. This is desirable for us because the way we ensure stochasticity entails avoiding any weight pushing.

For the most part we use OpenFst's own composition algorithms, but we do make use of a function TableCompose(), and a corresponding command-line program fsttablecompose, which is a more efficient composition algorithm for certain common cases. It uses the "Matcher" concept of OpenFst; a Matcher is a kind of helper class used during composition that performs lookup on the arcs out of a state to find any arcs with a particular input or output symbol. The normal matcher that OpenFst uses is SortedMatcher, which relies on arcs being sorted on the relevant label, and does a binary search. TableMatcher detects cases where it would be efficient to build a table indexed by label, and for those states it avoids the overhead of binary search. This leads to a speedup when composing with things like lexicons that have a very high out-degree.

Our FST recipes (like other transducer-based recipes) rely on disambiguation symbols. In the normal recipes, these are added to the input side of the lexicon FST (L) to make it determizable. We also add disambiguation symbols to G and C (see Disambiguation symbols). Whenever we do a composition and the FST on the right has disambiguation symbols on its input, we (in theory) add to each state in the left-hand FST a self-loop for each of the disambiguation symbols, which has that symbol on both its input and output. Note that the actual integer symbols id's for the disambiguation symbols on the left and right may not be the same. For instance, we have a special symbol #0 in G (where epsilon would normally be). The symbol-id for this would generally be the highest-numbered word plus one. But when we want to pass this symbol through L, we need a symbol in L's input symbol table (which mainly contains phones), to represent #0. We have a function AddSelfLoops() that takes a mutable FST and two vectors of labels (a label is an integer id for a symbol). The vectors are the same size, and represent corresponding input and output labels for the disambiguation symbols. This function adds self-loops to each final state and each state with non-epsilon output symbols on at least one arc out of it.

We remove disambiguation symbols with the function DeleteISymbols(), accessible on the command line with the program fstrmsymbols.