When you check out the Kaldi source tree (see Downloading and installing Kaldi), you will find many sets of example scripts in the egs/ directory.
This table summarizes some key facts about some of those example scripts; however, it it not an exhaustive list.
Name | BW | Lang | Train Domain | Train Hours | Train Speakers | License and Availability | Year Released | Speech Style | Test Domain | Kaldi Aprox Perf | Model Type | LM Data | Lexicon |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AMI | 16k | English (+non-native) | Microphone: head-mike, single and multiple distance mikes | 100 | 123 M 66 F | Free / Download http://groups.inf.ed.ac.uk/ami/corpus/ | 2014 | Meeting room | Same as train no overlap(?) | ~25% WER head (T)DNN ~45% WER distant (B)LSTM | AMI + (opt) Fisher | 50K (CMU dict + kaldi sources) | |
Aspire | English | Conversational microphone developed on telephone | see Fisher | 2015 | 30.8% WER (dev or eval?) | ||||||||
WSJ | 16k | English | Clean close-mic read speech | 80 | LDC LDC93S6B (WSJ0) and LDC94S13B (WSJ1) | 1993 | Read speech | Same | 6-7% WER | same as train | 20k (CMU dict) | ||
RM | English | read transcript limited vocab and grammar | LDC LDC93S3A | 1987-1989 | read speech | same | 1-2% WER | predefined grammar | <1K RM dict | ||||
Timit | 16k | English | read transcript very limited grammar | 630 | 1986 | read speech | same | ~30-40% PER | none | ~47 phones | |||
fisher_english | 8k | English | Telephone speech Auto-transcribed (errorful transcriptions) | 1,600 | 5203 M 7198 F | LDC speech: LDC2004S13, LDC2005S13 transcript: LDC2004T19, LDC2005T19 | 2004/2005 | CTS | Fisher (may overlap witb train) | ~22% WER (DNN) | LDC Fisher | CMU dict Size UNK | |
Switchboard 1 | 8k | English | CTS | 300 | LDC Train: LDC97S62 Mississippi State transcriptions Eval: LDC2002S09 and LDC2002T43 | 1993/1997/2000 | CTS | CTS eval2000 (hub5) | ~10% WER (LSTM) | Mississippi Trans + (opt) Fisher | 30K (CMU dict) | ||
Switchboard 1 + Fisher | 8k | English | CTS | see above | see above | see above | see above | CTS | eval2000 rt03 | ~12% eval2000 ~19% rt03 | see above | see above | |
Callhome Egyptian | Egyptian Colloquial Arabic | CTS | 120 conv | LDC Speech : LDC97S45 Transcripts : LDC97T19 Lexicon : LDC99L22 | 1997 | CTS | hub5 arabic LDC2002S22 LDC2002T39 | 50-60% WER | Train trans | LDC dict | |||
Corpus of Spontaneous Japanese | Japanese | Mixed style Close-talking mic | 650 hours<br>(240 hr train) | >1,400 | Unclear how to get this http://www.ninjal.ac.jp/english/products/csj/ http://pj.ninjal.ac.jp/corpus_center/csj/ | 2004 | Mixed | 9-10% WER | UNK | UNK | |||
Fisher Spanish Callhome Spanish | Caribbean Spanish | CTS | Fisher: 163 hrs Callhome: 60 hrs? 120 30min conv | Fisher: 136 Callhome: | LDC Fisher speech : LDC96S35 Fisher transcripts : LDC96T17 Callhome Speech : LDC96S35 Callhome Transcripts : LDC96T17 | Fisher: 2010 Callhome: 1996 | CTS | Kaldi subset of Fisher | 29-30% WER | Fisher trans | LDC96L16 | ||
Gale Arabic Phase 2 | 16K | Arabic | Broadcast Conversational/Report | 320 train 9.3 test | LDC2013S02 LDC2014S07 LDC2013S07 LDC2014T17 LDC2013T17 LDC2013T04 | Collected 2006/2007 | Broadcast Conversational and Report | Report: 13% WER (LSTM) Conver: 28% WER (LSTM) Comb: 24% WER (LSTM) | LDC2013T17 LDC2013T04 LDC2014T17 | http://alt.qcri.org/ | |||
Gale Mandarin | 16K | Mandarin Chinese | Broadcast | 126 | LDC2013S08 LDC2013T20 | 2006-2007 | Broadcast | Same as train | 17.5% WER [1] | LDC2013S08 LDC2013T20 | Same as HKUST below | ||
hkust EARS RT04F data dev and train [2] | 8K | Mandarin Chinese | Telephone Conversational | ~145 | ~873 | LDC2005S15 LDC2005T32 | 2004 | Conversational | Same as train | 33.5% CER | Acoustic trans<br>(very little) | Both Eng and Man. CMU dict use for Eng mdbg dict use for Man http://www.mdbg.net | |
librispeech [3] | 16K | English | Read transcription | 100 - 960 (460 | F: 125-1128 M: 126-1167 | http://www.openslr.org/12/ | 2015 | Read trans | Librispeech | ~5% | Large (books) | cmu (with sequitur) G2P) | |
reverb | |||||||||||||
sprakbanken | Danish | Read transcript? | 350 | Free download http://www.nb.no/sprakbanken/#ticketsfrom?lang=en | 2012 | Read/Dictation | Same as train | 14% WER | NST Provided | NST Provided? | |||
vystadial_en [4] | 8Khz | English | Telephone, dialog system | 41 | unk | Free | 2014 | Dialog sys | Same as train | ~11% WER (GMM/HMM) | Train trans | CMU + 250 | |
vystadial_cz [4] | 8Khz | Czech | Telephone, dialog system | 15 | unk | Free | 2014 | Dialog sys | Same as train | ~50% WER (GMM/HMM) | Train trans | Rule derived | |
chime3 | 16Khz | English | Read trans, simulated and real noise | 18 | WSJ0 + 4 | Not clear (Chime performers) | 2015 | Read transcript | Same as train (same channels!) | ~12% WER real (4 spkrs) ~12% WER simu | Official WSJ0 5K trans | WSJ0 | |
voxforge | 16Khz | English | Read trans | >75hrs | unk | Free GPL | 2008? | Read trans | unk | unk | Train | cmu + g2p for oov | |
Tedlium | 16KHz | English | Presentation/talk | 118 | 666 | Free download | 2014? | Presentation | Same as train | ~10% WER | Cantab provided LM | Cantab provided dict |
[1] "Audio Augmentation for Speech Recognition" Tom Ko, Vijayaditya Peddinti, Daniel Povey, Sanjeev Khudanpur.
[2] There should be more Mandarin data from rt04f - 50 hours of dev data I believe (see LDC2004E67, LDC2004E68). There should also be eval data. See https://www.ldc.upenn.edu/collaborations/past-projects/gale/data/gale-pubs.
[3] See http://www.danielpovey.com/files/2015_icassp_librispeech.pdf for details. Acoustic and language models are available online.
[4] See http://www.lrec-conf.org/proceedings/lrec2014/pdf/535_Paper.pdf.