Frequently Asked Questions

Introduction

This page contains the answers to some miscellaneous frequently asked questions from the mailing lists. This should not be your primary way of finding such answers: the mailing lists and github contain many more discussions, and a web search may be the easiest way to find answers.

About the name and logo of Kaldi

According to legend, Kaldi was the Ethiopian goatherder who discovered the coffee plant. The name was chosen by sponsors of this project because they drank a lot of coffee that time (in 2009 according to Ondrej Glembek ). Then the logo symbolizes those guys working on a speech project (the microphone in the logo) while drinking coffee (the coffee bean in the logo). There are some dissenting opinion about the logo, they suggest we should use more awesome logo. Generally, we would like to change logo if someone comes up with a well-designed new logo.

How to check the Kaldi version?

Refer to version.h.

Reading materials for beginners in speech recognition.

We notice that there are more and more beginners in speech recognition starting using Kaldi as their first toolkit for speech recognition. For those guys, we recommend them first to read these basic materials to get started:

  • HTK book (at least reading the Tutorial Overview part)
  • The Application of Hidden Markov Models in Speech Recognition
  • Speech Recognition with Weighted Finite-State Transducers for WSFT
  • A Bit of Progress in Language Modeling (Extended Version)

For those who may want a "Kaldi Book" with tutorial on theory and implementation like what HTK Book does, we would generally just say sorry. As Dan explains in this post, the field of speech recognition is moving so fast that we need to implement too many things in Kaldi and have no time to write such a book.

Free dataset to get started

If you have not bought any LDC license, there are also some free dataset for you to get started, that is, Librispeech, Tedlium and AMI.

About TIMIT

There are many people asked questions about TIMIT on mailing lists, as Dan says in this post, generally we'll suggest you do not use TIMIT.

Windows ASR toolkit based on Kaldi

VoiceBridge is a ASR toolkit which is designed for windows developers and based on Kaldi. Currently it only supports GMM-based ASR but it will be updated with more models added as the author declared here. Of course, if anyone create or know any other windows ASR toolkit based on Kaldi, please feel free to let us know and we will add it in this section.

Python wrapper for Kaldi

There are a few Python wrappers for Kaldi including:

People may wonder why TensorFlow or PyTorch isn't used in Kaldi DNN setup. It is mainly a historical reason as Dan explained here. A good news is that a PyTorch-integrated version of Kaldi that Dan declared here is already in the planning stage. Dan may announce it when it's ready.

Docker for Kaldi

Kaldi offers two set of images: CPU-based images and GPU-based images, please see here.

Kaldi for Android

A guide for compiling Kaldi for Android with the corresponding Dockerfile can be found here. Note that this build is just based on Ubuntu and does not continue to update for new version of NDK, so if you build Kaldi for Android on different computing platform or using different toolchain (e.g. CMake instead of ndk-build in this post), please let us know.

Naming conventions of common tools in Kaldi

There are many tools in Kaldi following simple and consistent naming conventions. Three typical frequently used tools with self-explanatory names are:

  • copy-* (or *-copy): e.g. copy-matrix, copy-feats, copy-feats-to-htk, copy-tree, copy-transition-model, copy-posts, wav-copy, gmm-copy, sgmm2-copy, nnet3-copy, lattice-copy.
  • *-info: e.g. tree-info, hmm-info, gmm-info, am-info, nnet3-am-info, nnet3-info.
  • *-to*: e.g. feat-to-dim, feat-to-len, ali-to-phones, ali-to-post, lattice-to-nbest, lattice-to-post, nbest-to-lattice.

We strongly suggest you search first in your build output directory to find tools you need before seeking help from others in mailing lists. Here I'll just give some example usages which are asked in the mailing lists.

  copy-feats ark:data/raw_mfcc.ark ark,t:data/mfcc.txt  # copy binary feature archive to text archive format
  cat feats_with_range.scp
  utt_id_1 raw_mfcc.1.ark:9[0:2,0:5]
  utt_id_2 raw_mfcc.1.ark:16965[0:3]
  
  # copy ranges of feature archive to stdout with text archive format
  copy-feats scp:feats_with_range.scp ark,t:- 
  cat cmvn.scp
  speaker_id_1 data/cmvn_test.ark:4
  speaker_id_2 data/cmvn_test.ark:247
  speaker_id_3 data/cmvn_test.ark:490
  speaker_id_4 data/cmvn_test.ark:733
  speaker_id_5 data/cmvn_test.ark:976
  
  # copy specific speaker's cmvn vector to stdout with text format
  copy-feats --binary=false $(grep speaker_id_2 cmvn.scp | awk '{print $2}') -    
  # copy GMM model to text format
  gmm-copy --binary=false final.mdl final_text.mdl  
  hmm-info final.mdl
  number of phones 351
  number of pdfs 3400
  number of transition-ids 47952
  number of transition-states 23916
  nnet3-am-info final.mdl | head
  input-dim: 40
  ivector-dim: 100
  num-pdfs: 2856
  prior-dimension: 0
  # Nnet info follows.
  left-context: 29
  right-context: 29
  num-parameters: 8355408
  modulus: 1
  input-node name=ivector dim=100
  input-node name=input dim=40
  component-node name=idct component=idct input=input input-dim=40 output-dim=40
  # write utterance length in frames to stdout with text archive format
  feat-to-len scp:feats.scp ark,t:- | head  

Some useful scripts for data preparation and processing

Besides tools mentioned above, there are also some useful scripts in Kaldi in the directory of "steps" and "utils". Here we will list some frequently used scripts in data preparation and processing and leave other important scripts to be illustrated in the corresponding sections below.

  steps/combine_ali_dirs.sh              # combine alignment directories
  steps/combine_lat_dirs.sh              # combine lattice directories
  
  # create lattices for the aug dirs by copying the lattices of original train dir
  steps/copy_lat_dir.sh 
  # create alignments for the aug dirs by copying the alignments of original train dir  
  steps/copy_ali_dir.sh                 
  
  steps/cleanup/split_long_utterance.sh  # truncate the long audio into smaller overlapping segments
  
  # perform segmentation of the input data based on the transcription
  # and outputs segmented data along with the corresponding aligned transcription
  steps/cleanup/segment_long_utterances[_nnet3].sh                                                     
  # copy train/test data directory to another directory, 
  # possibly adding a specified prefix or a suffix to the utterance and/or speaker names
  utils/copy_data_dir.sh
  
  # combine the data from multiple source directories into a single destination directory
  utils/combine_data.sh 
  # split data-dir to multiple subsets according to num-to-split or speaker numbers
  utils/split_data.sh
  # split scp file up with an approximately equal number of lines in each output file 
  utils/split_scp.pl
  # create a subset of train/test data with different options, consisting of some specified number of utterances 
  utils/subset_data_dir.sh

  # filter a scp file by list of utterance-ids
  utils/filter_scp.pl
  
  utils/int2sym.pl   # map from integers to symbols (e.g. word-ids to transcript)
  utils/sym2int.pl   # map from symbols to integers
  # like ./sym2int.pl, but a bit more general in that it doesn't assume the things being mapped to are single tokens
  utils/apply_map.pl 
  
  utils/utt2spk_to_spk2utt.pl   # convert an utt2spk file to a spk2utt file
  utils/spk2utt_to_utt2spk.pl   # convert a spk2utt file to an utt2spk file
  # get file 'utt2dur' which maps from utterance to the duration of the utterance in seconds
  utils/data/get_utt2dur.sh
  # get file 'reco2durr' which maps from recording to the duration of the recording in seconds 
  utils/data/get_reco2dur.sh
  
  # copy the data directory and modify it to use the recording-id as the speaker 
  utils/data/modify_speaker_info_to_recording.sh
  
  # remove excess utterances once they appear more than a specified number of times with the same transcription
  utils/data/remove_dup_utts.sh
  
  # create a new subsegmented output directory from an existing data-dir with 'segments' file
  utils/data/subsegment_data_dir.sh
  
  # do the standard 3-way speed perturbing of a data directory (it operates on the wav.scp).
  utils/data/perturb_data_dir_speed_3way.sh
  # generate the files which are used for perturbing the speed of the original data
  utils/perturb_data_dir_speed.sh

Indeterminacy in feature extraction

Uses may notice that there is tiny difference when they run two rounds of feature extraction including MFCC, Fbank and PLP. This is because the random signal-level ‘dithering’ used in the extraction process to prevent zeros in the filterbank energy computation. The corresponding code is 'Dither' function in file feature-window.cc. For those who want deterministic result, just set

  --dither=0 and --energy-floor=1

or do srand(0) at the start of program. BTW, we really do NOT think the determinism is worth considering, users should measure WER instead of the output of feature extraction. For more discussions, please refer to:

Below are FAQ candidates (with some TODOs) from the mailing lists, we will update these candidates to make them more readable.

1. How to interpret the final.mdl

2. WFST

4. Lattice

5. CTM

9. Resume training

10. GOP (goodness of pronunciation) and confidence score in Kaldi

11. Decision tree

12. –Transition-scale, –self-loop-scale, –acoustic-scale, lm-weight

13. Interaction between Kaldi and HTK

  • Feature level (copy-feats-to-htk, etc)
  • Model Level (they are primarily different, with further explanation)
  • releated questions:

14. The effect of Beam

15. Nnet-align-compiled used too much memory?

17. Getting acoustic scores on state level in decoding

18. Mandarin: Pitch vs No Pitch

19. Is it possible to run kaldi on AMD gpu? Is a opencl port available?

20. Rescore

22. Thread safe in Kaldi

24. Lexicon free Text recognition

25. Decoding .wav files

26. How to remove the silence modeling during training and testing

28. Examples for different task

29. Model (update) in Kaldi

30. The use of !SIL word in the lexicon

31. Problem when do alignment

32. Why is there a disambiguation symbol in L_disambig.fst after optional silence?

35. RNNLM

36. Run nnet3 without ivectors

37. Which is a best starting point to learn online decoding

38. How to print partial result in online2-wav-nnet3-latgen-faster

39. Data preprocessing and augmentation

40. Speaker diarization

41. How to specify GPU for chain model training

42. How to do the Latency control training in kaldi ?

43. What's the meaning of content of nnet3's config?

44. Keyword spotting

45. Kaldi supported gpus

47. Optimizing model load time?

48. DNN input feature

49. Is there any trick to accelerate the nnet-compute?

50. Reading *.ark files from bash or python

51. What is meant by WER and SER?

52. Training DNN over LDA+MLLT system

53. End-to-End SR

54. Kaldi already supports SVD. Can you give me an example of how to use SVD in LSTMP network?

55. Decoding a built graph without grammar

56. Why is mfcc used in tdnn,but not fbank?

57. What's the maximum amount of data used with kaldi for training acoustic models

58. Ivector

59. CMVN, VTLN, FMLLR adaptation

60. What causes too many words delete?

61. Kaldi linear Model Combination or Model Merging

62. OCR

63. Robustness of ASR

64. Python3 vs. Python2.7 in Kaldi scripts

67. Adapt speaker recognition model

68. Teacher-student model in Kaldi

69. Language model

70. Real time time decoding force last audio data decoding

71. QR Decomposition within Kaldi

72. Is the word_boundary.int necessary for online-audio-server-decode-faster

73. Is WER a lexcion error or a character error when training kaldi Mandarin speech recognition model?

74. What is word_boundary file and how can I create this?

75. Different results from lattice-align-words and lattice-mbr-decode