The CUDA Matrix library

The CUDA matrix library provides access to GPU-based matrix operations with an interface similar to The Kaldi Matrix library.

The general principle is that if you want to be able to run a particular part of the computation the GPU, you would declare the relevant quantities as type CuMatrix or CuVector instead of Matrix or Vector. Then, if you have configured Kaldi to use the GPU and if the Kaldi program you are running has initialized access to the GPU, those operations will run on the GPU. Otherwise, they will run on the CPU. CuMatrix and CuVector quantities store their contents in GPU memory space, if you have configured for GPU and your program has initialized the GPU device.

You can't mix CuMatrix and CuVector with Matrix and Vector in matrix operations, because they live in different memory spaces, but you can copy from one to the other. Kaldi does not try to automatically decide which operations are best done on GPU: it is all under the control of the programmer.

If the configure script sees that the NVidia compilation tool nvcc is on the path when it is run, it assumes you want to compile for GPU, and will define HAVE_CUDA=1 and set other Makefile variables to enable GPU compilation. You can disable this if you don't want it by calling configure with –use-cuda=no. If the script doesn't find the location where you installed the CUDA toolkit but you want to use it, you can use an option like –cudatk-dir=/opt/cuda-4.2. If you want to tell whether Kaldi has been configured to use CUDA, you can grep for CUDATKDIR in; if the string appears, then it has been configured to use CUDA. In scripts, you can check the return status of the program cuda-compiled: it returns success (0) if you compiled for CUDA.

You can also tell from the logs whether a program is using the GPU. If it is using the GPU, you'll see lines like this near the top of the program's output:

LOG (nnet-train-simple:IsComputeExclusive() CUDA setup operating under Compute Exclusive Mode.
LOG (nnet-train-simple:FinalizeActiveGpu() The active GPU is [1]: Tesla K10.G2.8GB  \
    free:3519M, used:64M, total:3583M, free/total:0.982121 version 3.0

In addition to configuring at the Makefile to use CUDA, if any individual program wants to use GPU operations it needs to have code like the following:

#if HAVE_CUDA==1

where use_gpu is a string, typically a command-line option, that can take the following values:

  • "yes": use the GPU (or crash if one is not available).
  • "no" don't use the GPU.
  • "optional" use the GPU if the machine it's running on has GPUs attached.
  • "wait": like "yes" but if the GPUs are running other processes, the program will wait indefinitely until one becomes free.

If a program doesn't take the –use-gpu command line option, that generally means that it hasn't been programmed to support the use of GPU operations, even if the code it runs contains the CuVector and CuMatrix types. Usually we only run specific tasks on the GPU- mainly neural net training.

NVidia GPUs (which is the only kind Kaldi supports) have various "compute modes": "default", "process exclusive", "thread exclusive". This controls whether or not the GPU is configured to run multiple processes at the same time. Kaldi is intended to be run in "exclusive mode"; whether it's process exclusive or thread exclusive doesn't matter. You can find out what mode your GPU is running in as follows:

# nvidia-smi  --query | grep 'Compute Mode'
    Compute Mode                    : Exclusive_Process

You can set the correct mode by typing nvidia-smi -c 3. You might want to do this in a startup script so it happens each time you reboot.

Rather than calling the malloc and free functions that NVidia provides, Kaldi does caching of previously released memory so that we don't have to incur the overhead of NVidia's malloc. This was done because at one point we were running in Amazon's cloud and found that NVidia's malloc was very slow. This was probably caused by the virtualization, and we're not sure whether that problem still exists. Anyway, the memory caching can cause a problem if for some reason you run using the default (non-exclusive) compute mode, because it can cause allocation failures. You can disable it at the code level by setting CuAllocatorOptions::cache_memory to false, if needed.