This page will show you how to prepare your own data for decoding using a pre-trained kaldi acoustic model. Here, we will use a TDNN chain model trained on the Fisher corpus. As an example, we will prepare the eval2000 dataset from scratch. This is to simulate the real world scenario where the data is not already in a format required by kaldi.
Download the TDNN model into kaldi aspire recipe (Not the IBAN) directory.
cd egs/aspire/s5
wget http://kaldi-asr.org/models/1/0001_aspire_chain_model.tar.gz
tar xzvf 0001_aspire_chain_model.tar.gz
The directories required are:
exp/chain/tdnn_7b # chain model
exp/nnet3/extractor # i-vector extractor
data/lang_pp_test # 3-gram LM lang directory
Create a new data directory data/eval2000 with the following files: wav.scp, reco2file_and_channel, segments, utt2spk, spk2utt. The instructions are as follows:
The audio files are in NIST sphere format (*.sph) files in /export/corpora2/LDC/LDC2002S09/hub5e_00/english
The two sides of the conversation are in separate channels and can be extracted on the fly by adding the following lines in wav.scp, one for each channel in every file:
<file-id>-A sph2pipe -f wav -p -c 1 <filename> |
<file-id>-B sph2pipe -f wav -p -c 2 <filename> |
where <file-id> is an arbitrary unique id for the file (usually the basename)
Also create a file `reco2file_and_channel`, which has a mapping from the
recording (one side of conversation) to the file and channel.
<file-id>-<A|B> <file-id> <A|B>The segment information is in the PEM file /export/corpora2/LDC/LDC2002S09/hub5e_00/english/hub5e_00.pem in the format:
<file-id> <side> <speaker> <start-time> <end-time>
where <side> is A or B for channel 1 and 2 respectively.
Use the PEM file to create segments and utt2spk file. The format of the segments file is:
<utterance-id> <file-id>-<A|B> <start-time> <end-time>
The format of the utt2spk file is:
<utterance-id> <speaker-id>
For this task, each side of conversation has only one speaker. So an appropriate <speaker-id> is <file-id>-<A|B>.<utterance-id> has to be unique and must contain the <speaker-id> as the prefix for correct sorting. This is typically done by adding the timing information or the segment count as suffix:
<speaker-id>-<start-time*100>-<end-time*100><speaker-id>-<001|002|...>The transcriptions in /export/corpora2/LDC/LDC2002T43. We will use modified versions of these for scoring
cp /export/a16/vmanoha1/jhu_tutorial/{stm,glm} data/eval2000The scoring script score_sclite.sh will be used for scoring the decoded directory:
cp /export/a16/vmanoha1/jhu_tutorial/score_sclite.sh local/utils/mkgraph.sh --self-loop-scale 1.0 data/lang_pp_test \
exp/chain/tdnn_7b exp/chain/tdnn_7b/graph_pp
utils/copy_data_dir.sh data/eval2000 data/eval2000_hires
steps/make_mfcc.sh --mfcc-config conf/mfcc_hires.conf --nj 30 data/eval2000_hires
steps/compute_cmvn_stats.sh data/eval2000_hires
utils/fix_data_dir.sh data/eval2000_hires
steps/online/nnet2/extract_ivectors.sh --nj 30 --cmd "queue.pl --mem 2G" \
data/eval2000_hires data/lang_pp_test exp/nnet3/extractor exp/nnet3/ivectors_eval2000
steps/nnet3/decode.sh --nj 30 --cmd 'queue.pl --mem 4G' --config conf/decode.config \
--acwt 1.0 --post-decode-acwt 10.0 \
--frames-per-chunk 50 --skip-scoring true \
--online-ivector-dir exp/nnet3/ivectors_eval2000 \
exp/chain/tdnn_7b/graph_pp data/eval2000_hires \
exp/chain/tdnn_7b/decode_eval2000_pp_tg
local/score_sclite.sh --cmd queue.pl --min-lmwt 8 --max-lmwt 12 \
data/eval2000_hires exp/chain/tdnn_7b/graph_pp exp/chain/tdnn_7b/decode_eval2000_pp_tg
grep Sum exp/chain/tdnn_7b/decode_eval2000_pp_tg/scor*/*.sys | utils/best_wer.sh
# %WER 10.7 | 1831 21395 | 90.7 6.4 2.9 1.4 10.7 47.8 | exp/chain/tdnn_7b/decode_eval2000_pp_tg/score_9_0.0/eval2000_hires.ctm.swbd.filt.sys