History of the Kaldi project

Kaldi began its existence in the 2009 Johns Hopkins University workshop cumbersomely titled "Low Development Cost, High Quality Speech Recognition for New Languages and Domains" (see Acknowledgements).

The focus of that project was Subspace Gaussian Mixture Model (SGMM) based modeling and some investigations into lexicon learning. The software which is now Kaldi began to be developed there, but the recipe we developed at that time was still dependent on HTK. A list of participants in that workshop, official and unofficial, is (alphabetically by last name):

Mohit Agarwal, Pinar Akyazi, Lukas Burget, Arnab Ghoshal, Ondrej Glembek, Nagendra Goel, Martin Karafiat, Feng Kai, Daniel Povey, Ariya Rastrow, Richard C. Rose, Petr Schwarz, Samuel Thomas.

Some of the participants of that workshop agreed to meet again in the summer of 2010 in Brno, Czech Republic (hosted by the Brno University of Technology). The aim of that workshop was to create a recipe based on the work done in 2009 that was clean and releasable, and to create a general-purpose speech toolkit as a byproduct. The problem we were trying to solve was that our previous recipe was based on disparate scripts involving both HTK and our own early "Kaldi" code, and was not easy to encapsulate. We also felt that a well-engineered, modern, general-purpose speech toolkit with an open license would be an asset to the speech-recognition community. During August of 2010 the following group of people met in Brno to work on this (again alphabetically):

Pinar Akyazi, Lukas Burget, Gilles Boullianne, Ondrej Glembek, Arnab Ghoshal, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Daniel Povey, Yanmin Qian, Petr Schwarz, Jan Silowsky, Georg Stemmer, and Karel Vesely.

We also had some remote help around this time and shortly afterward, from Sandeep Boda, Sandeep Reddy and Haihua Xu (who helped with coding, code cleanup and documentation); we were visited by Michael Riley (who helped us to understand OpenFst and gave some lectures on FSTs), and would like to acknowledge the help of Honza Cernocky (for negotiating the venue and some support for the workshop from the Faculty of Information Technology of BUT and helping to organize it), Renata Kohlova (administration), and Tomas Kasparek (system administration). It is possible that this list of contributors contains oversights; any important omissions are unlikely to be intentional.

A lot of code was written during the summer of 2010 but we still did not have a complete working system. Some of the participants of the 2010 workshop continued working to complete the toolkit and get a working set of training scripts. The code was released on May 14th, 2011, and presented to public at ICASSP 2011 in Prague, see the recordings.

Since the initial release, Kaldi has been maintained and developed to a large extent by Daniel Povey, working at Microsoft Research until early 2012 and since then at Johns Hopkins University; but also with major contributions by others: notably Karel Vesely, who developed the neural-net training framework, and Arnab Ghoshal, who coordinated the acoustic modeling work early on; but also other major contributors whom we do not name here because it is too hard to determine where to cut off the list; and a long tail of minor contributors; the total number of people who have contributed code or scripts or patches is about 70 so far.

Acknowledgements

The JHU 2009 workshop was supported by National Science Foundation Grant Number IIS-0833652, with supplemental funding from Google Research, DARPA's GALE program and the Johns Hopkins University Human Language Technology Center of Excellence. BUT researchers were partially supported during this time by Czech Ministry of Trade and Commerce project no. FR-TI1/034, Grant Agency of Czech Republic project no. 102/08/0707, and Czech Ministry of Education project no. MSM0021630528. Arnab Ghoshal was affiliated with Saarland University supported by the European Community's Seventh Framework Programme grant number 213850 (SCALE), and with The University of Edinburgh supported by United Kingdom's Engineering and Physical Sciences Research Council grant number EP/I031022/1 (Natural Speech Technology)"

The work of BUT researchers on Kaldi was supported by the Technology Agency of the Czech Republic under project No. TA01011328.

We would like to acknowledge the support of Geoffrey Zweig and Alex Acero at Microsoft Research, as well as the generosity of Henrique (Rico) Malvar in allowing the use of his FFT code. Thanks are also due to Patrick Nguyen for his help in organizing the JHU'09 workshop and with the Wall Street Journal recipe. We would also like to acknowledge the help of faculty and staff at Johns Hopkins University's Center for Language and Speech Processing during the JHU'09 workshop: particularly Sanjeev Khudanpur, Desiree Cleves and the late Fred Jelinek.

Since 2012, Kaldi development has received significant support from IARPA's BABEL program (IARPA-BAA-11-02) and from the Human Language Technology Center of Excellence (HLTCOE); and since 2015, from the NSF computing research infrastructure (CRI) award ``CI-EN: Enhancements for the Kaldi Speech Recognition Toolkit''.

Sanjeev Khudanpur deserves special mention for creating the conditions for the Kaldi project to succeed, first at the JHU'09 workshop where in his role as workshop organizer he was instrumental in putting the team together (e.g. suggesting to add Lukas Burget, without whom none of this would have happened); and since 2012 by making it possible for Daniel Povey to work at Johns Hopkins University in a position which allows him to devote much of his time to Kaldi development.