Preparing your own data for decoding using a pre-trained acoustic model

This page will show you how to prepare your own data for decoding using a pre-trained kaldi acoustic model. Here, we will use a TDNN chain model trained on the Fisher corpus. As an example, we will prepare the eval2000 dataset from scratch. This is to simulate the real world scenario where the data is not already in a format required by kaldi.

Download the pre-trained kaldi TDNN chain model

Download the TDNN model into kaldi aspire recipe (Not the IBAN) directory.

cd egs/aspire/s5
wget http://kaldi-asr.org/models/1/0001_aspire_chain_model.tar.gz
tar xzvf 0001_aspire_chain_model.tar.gz

The directories required are:

exp/chain/tdnn_7b     # chain model
exp/nnet3/extractor   # i-vector extractor
data/lang_pp_test     # 3-gram LM lang directory

Prepare eval2000 data directory

Create a new data directory data/eval2000 with the following files: wav.scp, reco2file_and_channel, segments, utt2spk, spk2utt. The instructions are as follows:

  1. The audio files are in NIST sphere format (*.sph) files in /export/corpora2/LDC/LDC2002S09/hub5e_00/english
    The two sides of the conversation are in separate channels and can be extracted on the fly by adding the following lines in wav.scp, one for each channel in every file:

     <file-id>-A sph2pipe -f wav -p -c 1 <filename> |
     <file-id>-B sph2pipe -f wav -p -c 2 <filename> |

    where <file-id> is an arbitrary unique id for the file (usually the basename)

     Also create a file `reco2file_and_channel`, which has a mapping from the
     recording (one side of conversation) to the file and channel.
    
     <file-id>-<A|B> <file-id> <A|B>
  2. The segment information is in the PEM file /export/corpora2/LDC/LDC2002S09/hub5e_00/english/hub5e_00.pem in the format:

     <file-id> <side> <speaker> <start-time> <end-time>

    where <side> is A or B for channel 1 and 2 respectively.

  3. Use the PEM file to create segments and utt2spk file. The format of the segments file is:

     <utterance-id> <file-id>-<A|B> <start-time> <end-time>

    The format of the utt2spk file is:

     <utterance-id> <speaker-id>
    For this task, each side of conversation has only one speaker. So an appropriate <speaker-id> is <file-id>-<A|B>.
    Note that the <utterance-id> has to be unique and must contain the <speaker-id> as the prefix for correct sorting. This is typically done by adding the timing information or the segment count as suffix:
  4. The transcriptions in /export/corpora2/LDC/LDC2002T43. We will use modified versions of these for scoring

     cp /export/a16/vmanoha1/jhu_tutorial/{stm,glm} data/eval2000
  5. The scoring script score_sclite.sh will be used for scoring the decoded directory:

     cp /export/a16/vmanoha1/jhu_tutorial/score_sclite.sh local/

Prepare decoding graph

utils/mkgraph.sh --self-loop-scale 1.0 data/lang_pp_test \
  exp/chain/tdnn_7b exp/chain/tdnn_7b/graph_pp

Prepare MFCC and i-vectors

utils/copy_data_dir.sh data/eval2000 data/eval2000_hires
steps/make_mfcc.sh --mfcc-config conf/mfcc_hires.conf --nj 30 data/eval2000_hires
steps/compute_cmvn_stats.sh data/eval2000_hires
utils/fix_data_dir.sh data/eval2000_hires

steps/online/nnet2/extract_ivectors.sh --nj 30 --cmd "queue.pl --mem 2G" \
  data/eval2000_hires data/lang_pp_test exp/nnet3/extractor exp/nnet3/ivectors_eval2000

Decoding and scoring

steps/nnet3/decode.sh --nj 30 --cmd 'queue.pl --mem 4G' --config conf/decode.config \
  --acwt 1.0 --post-decode-acwt 10.0 \
  --frames-per-chunk 50 --skip-scoring true \
  --online-ivector-dir exp/nnet3/ivectors_eval2000 \
  exp/chain/tdnn_7b/graph_pp data/eval2000_hires \
  exp/chain/tdnn_7b/decode_eval2000_pp_tg

local/score_sclite.sh --cmd queue.pl --min-lmwt 8 --max-lmwt 12 \
  data/eval2000_hires exp/chain/tdnn_7b/graph_pp exp/chain/tdnn_7b/decode_eval2000_pp_tg

grep Sum exp/chain/tdnn_7b/decode_eval2000_pp_tg/scor*/*.sys | utils/best_wer.sh
# %WER 10.7 | 1831 21395 | 90.7 6.4 2.9 1.4 10.7 47.8 | exp/chain/tdnn_7b/decode_eval2000_pp_tg/score_9_0.0/eval2000_hires.ctm.swbd.filt.sys