Introduction

We will start with a few words about the general philosophy of our modeling code, and why we chose this path. Our aim is for Kaldi to support conventional models (i.e. diagonal GMMs) and Subspace Gaussian Mixture Models (SGMMs), but also to be easily extensible to new kinds of model. In a previous iteration of designing this software, we used a virtual base class that both the GMM and SGMM classes inherited from, and wrote command-line tools that handled both types of model. Our experience was that a base class is not as useful as one might think, because there are too many differences in the models (e.g. they support different types of adaptation), and we were forced to constantly expand the base-class so that our supposedly "generic" code could access functionality specific to one model or the other. Eventually our command-line tools reached a state where they were almost impossible to modify.

When redesigning the code, we decided on a more "modern" software engineering approach that focused less on using class hierarchies to capture commonalities, and more on creating simple, reusable components. For example, our decoder code (see Decoders used in the Kaldi toolkit) is generic because its requirements are very limited; it only requires that we create an object inheriting from the simple base-class DecodableInterface, that behaves a lot like a matrix of acoustic likelihoods for an utterance. Individual command-line tools generally have simple and limited functionalities (e.g. gmm-align produces state-level alignments of utterances given a diagonal GMM). The idea is that implementing a new technique will generally involve creating a new command-line program, rather than increasing the complexity of any of the existing command-line programs.

Diagonal GMMs

The class DiagGmm represents a single diagonal-covariance Gaussian Mixture Model. An acoustic model based on a collection of objects of type DiagGmm, indexed by zero-based "pdf-ids", is implemented as class AmDiagGmm. You can think of AmDiagGmm as a vector of type DiagGmm, although it has a slightly richer interface than that. Representing an acoustic model as a collection of individual models, one for each p.d.f., is not the way we imagine all models would be represented; for example, SGMMs cannot be represented that way, and if we implemented GMMs with tying of Gaussians among states we would not be able to represent the pdfs separately.

Individual GMMs

Class DiagGmm is, conceptually, a fairly simple and passive object that stores the parameters of a Gaussian Mixture Model and has member functions that compute likelihoods. It does not "know anything" about how it will be used; it just provides access to its members. It does not handle accumulation or update; for that, see below, or class MlEstimateDiagGmm. The class DiagGmm stores its parameters as: inverse variances, and (means times inverse variances). This means that likelihoods can be computed with simple dot products. The "gconsts" (i.e. the precomputed constant terms in the likelihood) are different from, say, HTK's gconsts because they depend on the mean also.

Since it is quite complicated to modify the Gaussian parameters in this form, we also provide a class DiagGmmNormal, which contains the parameters in a more simple and obvious form, and we provide functions to convert back and forth between DiagGmm and DiagGmmNormal representations. Most of the update code works with the DiagGmmNormal representation.

GMM-based acoustic model

Class AmDiagGmm represents a collection of DiagGmm objects, indexed by pdf-id. This class does not represent a HMM-GMM, just a collection of GMMs. Putting it together with the HMM structure is the responsibility of other code, principally the topology and transition-modeling code and the code responsible for compiling decoding graphs (see HMM topology and transition modeling). We mention at this point that we never write an object of AmDiagGmm to disk on its own; instead we write an object of type TransitionModel and then an object of type AmDiagGmm. This is simply a convenience to avoid having to write too many separate files to disk, since normally we update the Gaussians and the transitions at the same time. The idea is that with other model types we would create a file with a TransitionModel and then [an object some other model type]. This way, programs that need to read only the transition model (e.g. for graph creation) can read the file without needing to know the type of the model.

Class AmDiagGmm is a fairly simple object and does not take the responsibility for such things as model estimation (e.g. see AccumAmDiagGmm), or transform estimation (there are various pieces of code that do this; see Feature and model-space transforms in Kaldi.

Full-covariance GMMs

We have a class FullGmm for full-covariance GMMs, which has similar functionality to the DiagGmm class but with full covariances. This is mainly of use for training full-covariance Universal Background Models (UBMs) in the SGMM recipe (see below). The only command-line tools available for full GMMs are used to train global mixture models (i.e. UBMs); we have not implemented a full covariance version of the AmDiagGmm class or the corresponding command line tools, although doing so would be fairly easy.

Subspace Gaussian Mixture Models (SGMMs)

Subspace Gaussian Mixture Models (SGMMs) are implemented by class AmSgmm. This class essentially implements the approach described in ``The Subspace Gaussian Mixture Model – a Structured Model for Speech Recognition'', by D. Povey, Lukas Burget et. al, Computer Speech and Language, 2011. The class AmSgmm represents a whole collection of pdf's; there is no class that represents a single pdf of the SGMM (as there is for GMMs). Estimation of SGMMs is handled (at the C++ level) by the classes MleAmSgmmAccs and MleAmSgmmUpdater.

For example scripts that demonstrate how to build an SGMM based system, see egs/rm/s1/steps/train_ubma.sh, egs/rm/s1/steps/train_sgmma.sh, and egs/rm/s1/steps/decode_sgmma.sh.