Introduction

This page contains the answers to some miscellaneous frequently asked questions from the mailing lists. This should not be your primary way of finding such answers: the mailing lists and github contain many more discussions, and a web search may be the easiest way to find answers.

About the name and logo of Kaldi

According to legend, Kaldi was the Ethiopian goatherder who discovered the coffee plant. The name was chosen by sponsors of this project because they drank a lot of coffee that time (in 2009 according to Ondrej Glembek ). Then the logo symbolizes those guys working on a speech project (the microphone in the logo) while drinking coffee (the coffee bean in the logo). There are some dissenting opinion about the logo, they suggest we should use more awesome logo. Generally, we would like to change logo if someone comes up with a well-designed new logo.

How to check the Kaldi version?

Refer to version.h.

Reading materials for beginners in speech recognition.

We notice that there are more and more beginners in speech recognition starting using Kaldi as their first toolkit for speech recognition. For those guys, we recommend them first to read these basic materials to get started:

HTK book (at least reading the Tutorial Overview part)
The Application of Hidden Markov Models in Speech Recognition
Speech Recognition with Weighted Finite-State Transducers for WSFT
A Bit of Progress in Language Modeling (Extended Version)

For those who may want a "Kaldi Book" with tutorial on theory and implementation like what HTK Book does, we would generally just say sorry. As Dan explains in this post, the field of speech recognition is moving so fast that we need to implement too many things in Kaldi and have no time to write such a book.

Free dataset to get started

If you have not bought any LDC license, there are also some free dataset for you to get started, that is, Librispeech, Tedlium and AMI.

About TIMIT

There are many people asked questions about TIMIT on mailing lists, as Dan says in this post, generally we'll suggest you do not use TIMIT.

Windows ASR toolkit based on Kaldi

VoiceBridge is a ASR toolkit which is designed for windows developers and based on Kaldi. Currently it only supports GMM-based ASR but it will be updated with more models added as the author declared here. Of course, if anyone create or know any other windows ASR toolkit based on Kaldi, please feel free to let us know and we will add it in this section.

Python wrapper for Kaldi

There are a few Python wrappers for Kaldi including:

PyKaldi
PyKaldi2
PyTorch-Kaldi
py-kaldi-asr: just for nnet3 online decoder

People may wonder why TensorFlow or PyTorch isn't used in Kaldi DNN setup. It is mainly a historical reason as Dan explained here. A good news is that a PyTorch-integrated version of Kaldi that Dan declared here is already in the planning stage. Dan may announce it when it's ready.

Docker for Kaldi

Kaldi offers two set of images: CPU-based images and GPU-based images, please see here.

Kaldi for Android

A guide for compiling Kaldi for Android with the corresponding Dockerfile can be found here. Note that this build is just based on Ubuntu and does not continue to update for new version of NDK, so if you build Kaldi for Android on different computing platform or using different toolchain (e.g. CMake instead of ndk-build in this post), please let us know.

Naming conventions of common tools in Kaldi

There are many tools in Kaldi following simple and consistent naming conventions. Three typical frequently used tools with self-explanatory names are:

copy-* (or *-copy): e.g. copy-matrix, copy-feats, copy-feats-to-htk, copy-tree, copy-transition-model, copy-posts, wav-copy, gmm-copy, sgmm2-copy, nnet3-copy, lattice-copy.
*-info: e.g. tree-info, hmm-info, gmm-info, am-info, nnet3-am-info, nnet3-info.
*-to*: e.g. feat-to-dim, feat-to-len, ali-to-phones, ali-to-post, lattice-to-nbest, lattice-to-post, nbest-to-lattice.

We strongly suggest you search first in your build output directory to find tools you need before seeking help from others in mailing lists. Here I'll just give some example usages which are asked in the mailing lists.

  copy-feats ark:data/raw_mfcc.ark ark,t:data/mfcc.txt  # copy binary feature archive to text archive format

  cat feats_with_range.scp
  utt_id_1 raw_mfcc.1.ark:9[0:2,0:5]
  utt_id_2 raw_mfcc.1.ark:16965[0:3]
  
  # copy ranges of feature archive to stdout with text archive format
  copy-feats scp:feats_with_range.scp ark,t:-

  cat cmvn.scp
  speaker_id_1 data/cmvn_test.ark:4
  speaker_id_2 data/cmvn_test.ark:247
  speaker_id_3 data/cmvn_test.ark:490
  speaker_id_4 data/cmvn_test.ark:733
  speaker_id_5 data/cmvn_test.ark:976
  
  # copy specific speaker's cmvn vector to stdout with text format
  copy-feats --binary=false $(grep speaker_id_2 cmvn.scp | awk '{print $2}') -

  # copy GMM model to text format
  gmm-copy --binary=false final.mdl final_text.mdl

  hmm-info final.mdl
  number of phones 351
  number of pdfs 3400
  number of transition-ids 47952
  number of transition-states 23916

  nnet3-am-info final.mdl | head
  input-dim: 40
  ivector-dim: 100
  num-pdfs: 2856
  prior-dimension: 0
  # Nnet info follows.
  left-context: 29
  right-context: 29
  num-parameters: 8355408
  modulus: 1
  input-node name=ivector dim=100
  input-node name=input dim=40
  component-node name=idct component=idct input=input input-dim=40 output-dim=40

  # write utterance length in frames to stdout with text archive format
  feat-to-len scp:feats.scp ark,t:- | head

Some useful scripts for data preparation and processing

Besides tools mentioned above, there are also some useful scripts in Kaldi in the directory of "steps" and "utils". Here we will list some frequently used scripts in data preparation and processing and leave other important scripts to be illustrated in the corresponding sections below.

  steps/combine_ali_dirs.sh              # combine alignment directories
  steps/combine_lat_dirs.sh              # combine lattice directories
  
  # create lattices for the aug dirs by copying the lattices of original train dir
  steps/copy_lat_dir.sh 
  # create alignments for the aug dirs by copying the alignments of original train dir  
  steps/copy_ali_dir.sh                 
  
  steps/cleanup/split_long_utterance.sh  # truncate the long audio into smaller overlapping segments
  
  # perform segmentation of the input data based on the transcription
  # and outputs segmented data along with the corresponding aligned transcription
  steps/cleanup/segment_long_utterances[_nnet3].sh

  # copy train/test data directory to another directory, 
  # possibly adding a specified prefix or a suffix to the utterance and/or speaker names
  utils/copy_data_dir.sh
  
  # combine the data from multiple source directories into a single destination directory
  utils/combine_data.sh 
  # split data-dir to multiple subsets according to num-to-split or speaker numbers
  utils/split_data.sh
  # split scp file up with an approximately equal number of lines in each output file 
  utils/split_scp.pl
  # create a subset of train/test data with different options, consisting of some specified number of utterances 
  utils/subset_data_dir.sh

  # filter a scp file by list of utterance-ids
  utils/filter_scp.pl
  
  utils/int2sym.pl   # map from integers to symbols (e.g. word-ids to transcript)
  utils/sym2int.pl   # map from symbols to integers
  # like ./sym2int.pl, but a bit more general in that it doesn't assume the things being mapped to are single tokens
  utils/apply_map.pl 
  
  utils/utt2spk_to_spk2utt.pl   # convert an utt2spk file to a spk2utt file
  utils/spk2utt_to_utt2spk.pl   # convert a spk2utt file to an utt2spk file

  # get file 'utt2dur' which maps from utterance to the duration of the utterance in seconds
  utils/data/get_utt2dur.sh
  # get file 'reco2durr' which maps from recording to the duration of the recording in seconds 
  utils/data/get_reco2dur.sh
  
  # copy the data directory and modify it to use the recording-id as the speaker 
  utils/data/modify_speaker_info_to_recording.sh
  
  # remove excess utterances once they appear more than a specified number of times with the same transcription
  utils/data/remove_dup_utts.sh
  
  # create a new subsegmented output directory from an existing data-dir with 'segments' file
  utils/data/subsegment_data_dir.sh
  
  # do the standard 3-way speed perturbing of a data directory (it operates on the wav.scp).
  utils/data/perturb_data_dir_speed_3way.sh
  # generate the files which are used for perturbing the speed of the original data
  utils/perturb_data_dir_speed.sh

Indeterminacy in feature extraction

Uses may notice that there is tiny difference when they run two rounds of feature extraction including MFCC, Fbank and PLP. This is because the random signal-level ‘dithering’ used in the extraction process to prevent zeros in the filterbank energy computation. The corresponding code is 'Dither' function in file feature-window.cc. For those who want deterministic result, just set

  --dither=0 and --energy-floor=1

or do srand(0) at the start of program. BTW, we really do NOT think the determinism is worth considering, users should measure WER instead of the output of feature extraction. For more discussions, please refer to: