This page contains the answers to some miscellaneous frequently asked questions from the mailing lists. This should not be your primary way of finding such answers: the mailing lists and github contain many more discussions, and a web search may be the easiest way to find answers.
According to legend, Kaldi was the Ethiopian goatherder who discovered the coffee plant. The name was chosen by sponsors of this project because they drank a lot of coffee that time (in 2009 according to Ondrej Glembek ). Then the logo symbolizes those guys working on a speech project (the microphone in the logo) while drinking coffee (the coffee bean in the logo). There are some dissenting opinion about the logo, they suggest we should use more awesome logo. Generally, we would like to change logo if someone comes up with a well-designed new logo.
Refer to version.h.
We notice that there are more and more beginners in speech recognition starting using Kaldi as their first toolkit for speech recognition. For those guys, we recommend them first to read these basic materials to get started:
For those who may want a "Kaldi Book" with tutorial on theory and implementation like what HTK Book does, we would generally just say sorry. As Dan explains in this post, the field of speech recognition is moving so fast that we need to implement too many things in Kaldi and have no time to write such a book.
If you have not bought any LDC license, there are also some free dataset for you to get started, that is, Librispeech, Tedlium and AMI.
There are many people asked questions about TIMIT on mailing lists, as Dan says in this post, generally we'll suggest you do not use TIMIT.
VoiceBridge is a ASR toolkit which is designed for windows developers and based on Kaldi. Currently it only supports GMM-based ASR but it will be updated with more models added as the author declared here. Of course, if anyone create or know any other windows ASR toolkit based on Kaldi, please feel free to let us know and we will add it in this section.
There are a few Python wrappers for Kaldi including:
People may wonder why TensorFlow or PyTorch isn't used in Kaldi DNN setup. It is mainly a historical reason as Dan explained here. A good news is that a PyTorch-integrated version of Kaldi that Dan declared here is already in the planning stage. Dan may announce it when it's ready.
Kaldi offers two set of images: CPU-based images and GPU-based images, please see here.
A guide for compiling Kaldi for Android with the corresponding Dockerfile can be found here. Note that this build is just based on Ubuntu and does not continue to update for new version of NDK, so if you build Kaldi for Android on different computing platform or using different toolchain (e.g. CMake instead of ndk-build in this post), please let us know.
There are many tools in Kaldi following simple and consistent naming conventions. Three typical frequently used tools with self-explanatory names are:
We strongly suggest you search first in your build output directory to find tools you need before seeking help from others in mailing lists. Here I'll just give some example usages which are asked in the mailing lists.
copy-feats ark:data/raw_mfcc.ark ark,t:data/mfcc.txt # copy binary feature archive to text archive format
cat feats_with_range.scp utt_id_1 raw_mfcc.1.ark:9[0:2,0:5] utt_id_2 raw_mfcc.1.ark:16965[0:3] # copy ranges of feature archive to stdout with text archive format copy-feats scp:feats_with_range.scp ark,t:-
cat cmvn.scp speaker_id_1 data/cmvn_test.ark:4 speaker_id_2 data/cmvn_test.ark:247 speaker_id_3 data/cmvn_test.ark:490 speaker_id_4 data/cmvn_test.ark:733 speaker_id_5 data/cmvn_test.ark:976 # copy specific speaker's cmvn vector to stdout with text format copy-feats --binary=false $(grep speaker_id_2 cmvn.scp | awk '{print $2}') -
# copy GMM model to text format gmm-copy --binary=false final.mdl final_text.mdl
hmm-info final.mdl number of phones 351 number of pdfs 3400 number of transition-ids 47952 number of transition-states 23916
nnet3-am-info final.mdl | head input-dim: 40 ivector-dim: 100 num-pdfs: 2856 prior-dimension: 0 # Nnet info follows. left-context: 29 right-context: 29 num-parameters: 8355408 modulus: 1 input-node name=ivector dim=100 input-node name=input dim=40 component-node name=idct component=idct input=input input-dim=40 output-dim=40
# write utterance length in frames to stdout with text archive format feat-to-len scp:feats.scp ark,t:- | head
Besides tools mentioned above, there are also some useful scripts in Kaldi in the directory of "steps" and "utils". Here we will list some frequently used scripts in data preparation and processing and leave other important scripts to be illustrated in the corresponding sections below.
steps/combine_ali_dirs.sh # combine alignment directories steps/combine_lat_dirs.sh # combine lattice directories # create lattices for the aug dirs by copying the lattices of original train dir steps/copy_lat_dir.sh # create alignments for the aug dirs by copying the alignments of original train dir steps/copy_ali_dir.sh steps/cleanup/split_long_utterance.sh # truncate the long audio into smaller overlapping segments # perform segmentation of the input data based on the transcription # and outputs segmented data along with the corresponding aligned transcription steps/cleanup/segment_long_utterances[_nnet3].sh
# copy train/test data directory to another directory, # possibly adding a specified prefix or a suffix to the utterance and/or speaker names utils/copy_data_dir.sh # combine the data from multiple source directories into a single destination directory utils/combine_data.sh # split data-dir to multiple subsets according to num-to-split or speaker numbers utils/split_data.sh # split scp file up with an approximately equal number of lines in each output file utils/split_scp.pl # create a subset of train/test data with different options, consisting of some specified number of utterances utils/subset_data_dir.sh # filter a scp file by list of utterance-ids utils/filter_scp.pl utils/int2sym.pl # map from integers to symbols (e.g. word-ids to transcript) utils/sym2int.pl # map from symbols to integers # like ./sym2int.pl, but a bit more general in that it doesn't assume the things being mapped to are single tokens utils/apply_map.pl utils/utt2spk_to_spk2utt.pl # convert an utt2spk file to a spk2utt file utils/spk2utt_to_utt2spk.pl # convert a spk2utt file to an utt2spk file
# get file 'utt2dur' which maps from utterance to the duration of the utterance in seconds utils/data/get_utt2dur.sh # get file 'reco2durr' which maps from recording to the duration of the recording in seconds utils/data/get_reco2dur.sh # copy the data directory and modify it to use the recording-id as the speaker utils/data/modify_speaker_info_to_recording.sh # remove excess utterances once they appear more than a specified number of times with the same transcription utils/data/remove_dup_utts.sh # create a new subsegmented output directory from an existing data-dir with 'segments' file utils/data/subsegment_data_dir.sh # do the standard 3-way speed perturbing of a data directory (it operates on the wav.scp). utils/data/perturb_data_dir_speed_3way.sh # generate the files which are used for perturbing the speed of the original data utils/perturb_data_dir_speed.sh
Uses may notice that there is tiny difference when they run two rounds of feature extraction including MFCC, Fbank and PLP. This is because the random signal-level ‘dithering’ used in the extraction process to prevent zeros in the filterbank energy computation. The corresponding code is 'Dither' function in file feature-window.cc. For those who want deterministic result, just set
--dither=0 and --energy-floor=1
or do srand(0) at the start of program. BTW, we really do NOT think the determinism is worth considering, users should measure WER instead of the output of feature extraction. For more discussions, please refer to:
1. How to interpret the final.mdl
2. WFST
4. Lattice
5. CTM
9. Resume training
10. GOP (goodness of pronunciation) and confidence score in Kaldi
11. Decision tree
12. –Transition-scale, –self-loop-scale, –acoustic-scale, lm-weight
13. Interaction between Kaldi and HTK
14. The effect of Beam
15. Nnet-align-compiled used too much memory?
17. Getting acoustic scores on state level in decoding
18. Mandarin: Pitch vs No Pitch
19. Is it possible to run kaldi on AMD gpu? Is a opencl port available?
20. Rescore
22. Thread safe in Kaldi
24. Lexicon free Text recognition
26. How to remove the silence modeling during training and testing
28. Examples for different task
29. Model (update) in Kaldi
30. The use of !SIL word in the lexicon
32. Why is there a disambiguation symbol in L_disambig.fst after optional silence?
35. RNNLM
36. Run nnet3 without ivectors
37. Which is a best starting point to learn online decoding
38. How to print partial result in online2-wav-nnet3-latgen-faster
39. Data preprocessing and augmentation
40. Speaker diarization
41. How to specify GPU for chain model training
42. How to do the Latency control training in kaldi ?
43. What's the meaning of content of nnet3's config?
44. Keyword spotting
47. Optimizing model load time?
49. Is there any trick to accelerate the nnet-compute?
50. Reading *.ark files from bash or python
51. What is meant by WER and SER?
52. Training DNN over LDA+MLLT system
53. End-to-End SR
54. Kaldi already supports SVD. Can you give me an example of how to use SVD in LSTMP network?
55. Decoding a built graph without grammar
56. Why is mfcc used in tdnn,but not fbank?
57. What's the maximum amount of data used with kaldi for training acoustic models
58. Ivector
59. CMVN, VTLN, FMLLR adaptation
60. What causes too many words delete?
61. Kaldi linear Model Combination or Model Merging
62. OCR
64. Python3 vs. Python2.7 in Kaldi scripts
67. Adapt speaker recognition model
68. Teacher-student model in Kaldi
69. Language model
70. Real time time decoding force last audio data decoding
71. QR Decomposition within Kaldi
72. Is the word_boundary.int necessary for online-audio-server-decode-faster
73. Is WER a lexcion error or a character error when training kaldi Mandarin speech recognition model?
74. What is word_boundary file and how can I create this?
75. Different results from lattice-align-words and lattice-mbr-decode