Warning, this page is deprecated as it refers to the older online-decoding setup.

The page for the new setup is Online decoding in Kaldi.

There are several programs in the Kaldi toolkit that can be used for online recognition. They are all located in the src/onlinebin folder and require the files from the src/online folder to be compiled as well (you can currently compile these with "make ext"). Many of these programs will also require the Portaudio library present in the tools folder, which can be downloaded using the appropriate script found there. The programs are as follows:

online-gmm-decode-faster - records audio from the microphone and outputs the result on stdout
online-wav-gmm-decode-faster - reads a list of WAV files and outputs the result into special output format
online-server-gmm-decode-faster - reads vectors of MFCC features from a UDP socket and prints the result to stdout
online-net-client - records audio from the microphone, converts it to vectors of features and sends them over UDP to online-server-gmm-decode-faster
online-audio-server-decode-faster - reads RAW audio data from a TCP socket and outputs aligned words to the same socket
online-audio-client - reads a list of WAV files, sends their audio over TCP to online-audio-server-decode-faster, reads the output and saves the result to the chosen file format

There is also a Java equivalent of the online-audio-client which contains slightly more features and has a GUI.

In addition, there is a GStreamer 1.0 compatible plugin that acts as a filter, taking raw audio as input and producing recognized word as output. The plugin is based on OnlineFasterDecoder, as other online recognition programs.

Online Audio Server

The main difference between the online-server-gmm-decode-faster and online-audio-server-decode-faster programs is the input: the former accepts feature vectors, while the latter accepts RAW audio. The advantage of the latter is that it can be deployed directly as a back-end for any client: whether it is another computer on the Internet or a mobile device. Main thing here is that the client doesn't need to know anything about the feature set used to train the models and provided it can record standard audio at the predetermined sampling frequency and bit-depth, it will always be compatible with the server. An advantage of the server that accepts feature vectors, instead of audio, is a lower cost of data transfer between the client and the server, but this can be easily outperformed by simply using a state-of-the-art codec for audio (which is something that may be done in the future).

The communication between the online-audio-client and online-audio-server-decode-faster consists of two phases: first the client sends packets of raw audio to the server, second the server replies with the output of the recognition. The two phases may happen asynchronously, meaning that the decoder can output results online, as fast as it is certain of their outcome and not wait for the end of the data to arrive. This opens up more possibilities for creating applications in the future.

Audio Data

The audio data format is currently hardcoded to be RAW 16 kHz, 16-bit, signed, little-endian (server native), linear PCM. The protocol works by splitting data into chunks and prepending each chunk with a 4-byte code containing its exact size. The size is also (server native) little-endian and can contain any value as long as it is positive and even (because of 16-bit sampling). The last packet of size 0 is treated as the end of stream and forces the decoder to dump the rest of its results and finish the recognition process.

Recognition Results

There are two types of packets sent by the server. Time-aligned results and partial results. Partial results are sent as fast as the decoder recognizes each word and time-alignment is performed every time the decoder recognizes the end of an utterance (which may or may not correspond to silences and/or sentence boundaries).

Each partial result packet is prepended by the characters "PARTIAL:" follwed by exactly one word. Each word is sent in a different partial packet.

Time-aligned results are prepended by a header starting with the characters "RESULT:". Following that is a comma separated list of key=value parameters containing some useful information:

NUM - number of words to follow
FORMAT - the format of the result to follow; currently only WSEC (word-start-end-confidence), but it will allow modification in the future
RECO-DUR - time it took to recognize the sequence (as a floating point value in seconds)
INPUT-DUR - length of the input audio sequence (as a floating point value in seconds)

You can divide INPUT-DUR by RECO-DUR to get real-time recognition speed of the decoder.

The header "RESULT:DONE" is sent when there are no more results that can be returned by the server. In this case, the server simply waits for either more data by the client, or for a disconnect.

The data underneath the header consists of exactly NUM lines of words formatted in the way determined by the FORMAT parameter. In the case of WSE format, this is simply a comma separated list containing 4 tokens: the word (as present in the dictionary), start time and end time (as floating point values in seconds). Beware that the words are going to be encoded exactly as they are in the dictionary provided to the server and therefore, the client must make sure to perform the appropriate character conversion if necessary. The online-audio-client, for example, doesn't perform any character conversion while generating WebVTT files (which require UTF8), so you need to convert the resulting files to UTF8 using iconv (or a similar program).

An example of the results of the server is as follows:

 RESULT:NUM=3,FORMAT=WSE,RECO_DUR=1.7,INPUT_DUR=3.22
 one,0.4,1.2
 two,1.4,1.9
 three,2.2,3.4
 RESULT:DONE

Example Usage

Command line to start the server:

online-audio-server-decode-faster --verbose=1 --rt-min=0.5 --rt-max=3.0 --max-active=6000 --beam=72.0 --acoustic-scale=0.0769 
final.mdl graph/HCLG.fst graph/words.txt '1:2:3:4:5' graph/word_boundary.int 5010 final.mat

Arguments are as follow:

final.mdl - the acoustic model
HCLG.fst - the complete FST
words.txt - word dictionary (mapping word ids to their textual representation)
'1:2:3:4:5' - list of silence phoneme ids
5010 - port the server is listening on
word_boundary.int - a list of phoneme boundary information required for word alignemnt
final.mat - feature LDA matrix

Command line to start the client:

 online-audio-client --htk --vtt localhost 5010 scp:test.scp

Arguments are as follow:

–htk - save results as an HTK label file
–vtt - save results as a WebVTT file
localhost - server to connect to
5010 - port to connect to
scp:test.scp - list of WAV files to send

Command line to start the Java client:

java -jar online-audio-client.jar

Or simply double-click the JAR file in the graphical interface.

GStreamer plugin

Kaldi toolkit comes with a plugin for the GStreamer media streaming framework (version 1.0 or compatible). The plugin acts as a filter that accepts raw audio as input and produces recognized words as output.

The main benefit of the plugin is the fact that it makes Kaldi's online speech recognition functionality available to all programming languages that support GStreamer 1.0 (that includes Python, Ruby, Java, Vala and many more). It also simplifies the integration of the Kaldi online decoder in applications since communicating with the decoder follows GStreamer standards.

Installation

The source of the GStreamer plugin is located in the `src/gst-plugin` directory. To compile the plugin, rest of the Kaldi toolkit has to be compiled to use the shared libraries. To do this, invoke `configure` with the `–shared` flag. Also compile the online extensions (`make ext`).

Make sure the package that provides GStreamer 1.0 development headers is installed on your system. On Debian Jessie (current 'testing' distribution), the needed package is called `libgstreamer1.0-dev`. On Debian Wheezy (current stable), GStreamer 1.0 is available from the backports repository (http://backports.debian.org). Install the 'good' GStreamer plugins (package gstreamer1.0-plugins-good) and GStreamer 1.0 tools (package `gstreamer1.0-tools`). A demo program also requires the PulseAudio Gstreamer plugins (package gstreamer1.0-pulseaudio).

Finally, run `make depend` and `make` in the `src/gst-plugin` directory. This should result in a file `src/gst-plugin/libgstonlinegmmdecodefaster.so` which contains the GStreamer plugin.

To make GStreamer able to find the Kaldi plugin, you have to add the `src/gst-plugin` directory to its plugin search path. To do this, add the directory to the GST_PLUGIN_PATH environment variable:

export GST_PLUGIN_PATH=$KALDI_ROOT/src/gst-plugin

Of course, replace `$KALDI_ROOT` with the actual location of the Kaldi root folder on your file system.

Now, running `gst-inspect-1.0 onlinegmmdecodefaster` should provide info about the plugin:

# gst-inspect-1.0 onlinegmmdecodefaster
Factory Details:
  Rank:     none (0)
  Long-name:        OnlineGmmDecodeFaster
  Klass:            Speech/Audio
  Description:      Convert speech to text
  Author:           Tanel Alumae <tanel.alumae@phon.ioc.ee>
[..]
Element Properties:
  name                : The name of the object
                        flags: readable, writable
                        String. Default: "onlinegmmdecodefaster0"
  parent              : The parent of the object
                        flags: readable, writable
                        Object of type "GstObject"
  silent              : Determines whether incoming audio is sent to the decoder or not
                        flags: readable, writable
                        Boolean. Default: false
  model               : Filename of the acoustic model
                        flags: readable, writable
                        String. Default: "final.mdl"
  fst                 : Filename of the HCLG FST
                        flags: readable, writable
                        String. Default: "HCLG.fst"
[..]
  min-cmn-window      : Minimum CMN window used at start of decoding (adds latency only at start)
                        flags: readable, writable
                        Integer. Range: -2147483648 - 2147483647 Default: 100 

Element Signals:
  "hyp-word" :  void user_function (GstElement* object,
                                    gchararray arg0,
                                    gpointer user_data);

Usage through the command-line

The most simple way to use the GStreamer plugin is via the command line. You have to specify the model files used for decoding when lauching the plugin. To do this, set the `model`, `fst`, `word-syms`, `silence-phones` and optionally the `lda-mat` plugin properties (similarly to Kaldi's command-line online decoders). The decoder accepts only 16KHz 16-bit mono audio. Any audio stream can be automatically converted to the required format using GStreamer's `audioresample` and `audioconvert` plugins.

For example, to decode the file `test1.wav` using a model files in `tri2b_mmi`, and have the recognized stream of words printed to stdout, execute:

gst-launch-1.0 -q filesrc location=test1.wav \
    ! decodebin ! audioconvert ! audioresample \
    ! onlinegmmdecodefaster model=tri2b_mmi/model fst=tri2b_mmi/HCLG.fst \
                            word-syms=tri2b_mmi/words.txt silence-phones="1:2:3:4:5" lda-mat=tri2b_mmi/matrix \
    ! filesink location=/dev/stdout buffer-mode=2

Note that the audio stream is segmented on the fly, with "<#s>" denoting silence.

You can easily try live decoding of microphone input by replacing `filesrc location=test1.wav` with `pulsesrc` (given that your OS uses the PulseAudio framework).

An example script that uses the plugin via the command-line to process a buch of audio files is located in `egs/voxforge/gst_demo/run-simulated.sh`.

Usage through GStreamer bindings

An example of a Python GUI program that uses the plugin via the GStreamer bindings is located in `egs/voxforge/gst_demo/run-live.py`.

The program constructs in the `init_gst(self)` method a similar pipeline of GStreamer elements as in the command-line example. The model files and some decoding parameters are communicated to the `onlinegmmdecodefaster` element through the standard `set_property()` method. More interesting is this part of the code:

        self.asr.connect('hyp-word', self._on_word)

This expression orders our decoding plugin to call the GUI's `_on_word` method whenever it produces a new recognized word. The `_on_word()` method looks like this:

    def _on_word(self, asr, word):
        Gdk.threads_enter()
        if word == "<#s>":
          self.textbuf.insert_at_cursor("\n")
        else:
          self.textbuf.insert_at_cursor(word)
        self.textbuf.insert_at_cursor(" ")
        Gdk.threads_leave()

What it does (apart from some GUI-related chemistry), is that it inserts the recognized word into the text buffer that is connected to the GUI's main text box. If a segmentation symbol is recognized, it inserts a line break instead.

Recognition start and stop are controlled by setting the `silent` property of the decoder plugin to `False` or `True`. Setting the property to `False` orders the plugin not to process any incoming audio (although the audio that is already being processed might produce some new recognized words).

Overview of the online decoders' code

The main components used in the implementation of the online decoders are as follows:

OnlineAudioSource - represents a source of raw audio samples. There are different implementations of this "interface", that can for example acquire samples from microphone (e.g. OnlinePaSource, using PortAudio), or read them from existing Vector (OnlineVectorSource). Technically OnlineAudioSource is currently not a C++ interface, using virtual methods - instead the classes "implementing" it are passed as template parameters to their users.
OnlineFeatInputItf - used to read feature vectors that are computed in some manner. The computation in this case can mean feature extraction(OnlineFeInput), cepstral mean normalization (OnlineCmnInput), calculation of higher order features (OnlineDeltaInput) and so on.
OnlineDecodableDiagGmmScaled - an implementation of DecodableInterface, reading its input features from an implementation of OnlineFeatInputItf(through OnlineFeatureMatrix proxy).
OnlineFasterDecoder - extends FasterDecoder (described in the Decoders used in the Kaldi toolkit section), to make it work with real time input.

The process of feature extraction and transformation is implemented by forming a pipeline of several objects, each of which pulling data from the previous object in the pipeline, performing a specific type of computation and passing the result to the next one. Usually, at the start of the pipeline there is an OnlineAudioSource object, which is used to acquire the raw audio in some way. Then comes a component of type OnlineFeInput, that reads the raw samples and extracts features. OnlineFeInput can be used to produce either MFCC or PLP features, by passing reference to the object doing the actual job(Mfcc or Plp) as parameter to its constructor. After that the pipeline continues with other OnlineFeatInputItf objects. For example if we want the decoder to receive audio from a microphone using PortAudio and to use a vector of MFCC, delta and delta-delta features the pipeline looks like:

OnlinePaSource -> OnlineFeInput<OnlinePaSource, Mfcc> -> OnlineCmnInput -> OnlineDeltaInput

NOTE: The pipeline shown above has nothing to do with GStreamer pipeline described in the previous sections - the both structures are completely orthogonal to each other. The GStreamer plug-in makes it possible to grab audio samples from GStreamer's pipeline, by implementing a new OnlineAudioSource (GstBufferSource), and feed them into Kaldi online decoder's pipeline.

OnlineFasterDecoder itself is completely oblivious to the details of feature computation. As is the case with the other Kaldi decoders, the decoder only "knows" that it can get scores for a given feature frame using a Decodable object. The actual type of this object, for the currently implemented example programs, is OnlineDecodableDiagGmmScaled. OnlineDecodableDiagGmmScaled doesn't directly access the final OnlineFeatInputItf either. Instead, the responsibility for bookkeeping and requesting batches of features from the pipeline, as well as coping with potential timeouts is delegated to an object of type OnlineFeatureMatrix.

The chain of function calls, that results in pulling new features along the pipeline, looks like the following:

in OnlineFasterDecoder::ProcessEmitting(), the decoder invokes OnlineDecodableDiagGmmScaled::LogLikelihood()
this triggers a call to OnlineFeatureMatrix::IsValidFrame(). If the requested feature frame is not already available inside the cache maintained by OnlineFeatureMatrix, a new batch of features is requested, by calling the OnlineFeatInputItf::Compute() method of the last object in the pipeline.
the Compute() method of the last object (e.g. OnlineDeltaInput in the example above) then calls the Compute() method of the previous object in the pipeline and so on to the first OnlineFeatInputItf object, which invokes OnlineAudioSource::Read().

OnlineFasterDecoder inherits most of its implementation from FasterDecoder, but its Decode() method processes the feature vectors in batches of configurable size and returns a code corresponding to the current decoder state. There are three different codes:

kEndFeats means there are no more features left in the pipeline
kEndUtt signals end of the current utterance, which can be used for example to separate the utterances by new line characters, as in our demos, or other application-dependent feadback. An utterance is considered to be finished when sufficiently long silence is detected in the traceback. The number of the silence frames that are interpreted as utterance delimiter is controlled by OnlineDecoderOpts::inter_utt_sil.
kEndBatch - nothing extraordinary has happened - the decoder just finished processing the current batch of feature vectors.

OnlineFasterDecoder makes it possible to get recognition results as soon as possible, by implementing partial traceback. The back-pointers of all active FasterDecoder::Token objects are followed until a point is reached, back in time, where there is only one active token. That means that this (immortal) token is a common ancestor of all currently active tokens. Moreover by keeping track of the previous such token(for the previously processed batches), the decoder can produce a partial result, by taking the output symbols along the traceback from the current immortal token to the previous one.