Parallelization in Kaldi


Kaldi is designed to work best with software such as Sun GridEngine or other software that works on a similar principle; and if multiple machines are to work together in a cluster then they need access to a shared file system such as one based on NFS. However, Kaldi can easily be configured to run on a single machine.

If you look at a top-level example script like egs/wsj/s5/, you'll see commands like

 steps/  --cmd "$train_cmd" \
   4200 40000 data/train_si284 data/lang exp/tri3b_ali_si284 exp/tri4a

At the top of the script you'll see it sourcing a file called

. ./

and in you'll see the following variable being set:

export train_cmd=" -l arch=*64"

You'll change this variable if you don't have GridEngine or if your queue is configured differently from CLSP@JHU. To run everything locally on a single machine you can set export

In steps/ the varible cmd is set to the argument to the –cmd option, i.e. to -l arch=*64 in this case, and in the script you'll see commands like the following:

      $cmd JOB=1:$nj $dir/log/fmllr.$x.JOB.log \
        ali-to-post "ark:gunzip -c $dir/ali.JOB.gz|" ark:-  \| \
        weight-silence-post $silence_weight $silphonelist $dir/$x.mdl ark:- ark:- \| \
        gmm-est-fmllr --fmllr-update-type=$fmllr_update_type \
        --spk2utt=ark:$sdata/JOB/spk2utt $dir/$x.mdl \
        "$feats" ark:- ark:$dir/tmp_trans.JOB || exit 1;

What's going on is that the command $cmd (e.g. or is being executed; it is responsible for spawning the jobs and waiting until they are done, and returning with nonzero status if something went wrong. The basic usage of these commands (and there are others called and is like this: <options> <log-file> <command>

and the simplest possible example of using one of these scripts is: foo.log echo hello world

(we're using for the example because it will run on any system, it doesn't require GridEngine). It's possible to run a one-dimensional array of jobs, and an example is: JOB=1:10 foo.JOB.log echo hello world number JOB

and these programs will replace any instance of JOB in the command line with a number within that range, so make sure that your working directory doesn't contain the string JOB, or bad things may happen. You can even submit jobs with pipes and redirection by suitable use of quoting or escaping: JOB=1:10 foo.JOB.log echo "hello world number JOB" \| head -n 1 \> output.JOB

In this case, the command that actually gets executed will be something like:

echo "hello world number JOB" | head -n 1 > output.JOB

If you want to see what's actually getting executed, you can look in a file like foo.1.log, where you'll see the following:

# echo "hello world number 1" | head -n 1 > output.1
# Started at Sat Jan  3 17:44:20 PST 2015
# Accounting: time=0 threads=1
# Ended (code 0) at Sat Jan  3 17:44:20 PST 2015, elapsed time 0 seconds

Common interface of parallelization tools

In this section we discuss the commonalities of the parallelization tools. They are designed to all be interchangeable on the command line, so that a script that is tested for one of these parallelization tools will work for any; you can switch over to using another by setting the $cmd variable to a different value.

The basic usage of these tools is, as we said, <options> <log-file> <command>

and what we are about to say also holds for, and

<options> may include some or all of the following:

  • A job range specifier (e.g. JOB=1:10). The name is uppercase by convention only, and may include underscores. The starting index must be 1 or more; this is a GridEngine limitation.
  • Anything that looks as if it would be accepted by GridEngine as an option to qsub. For example, -l arch=*64*, or -l mem_free=6G,ram_free=6G, or -pe smp 6. For compatibility, scripts other than will ignore such options.
  • New-style options like –mem 10G (see below).

<log-file> is just a filename, which for array jobs must contain the identifier of the array (e.g. exp/foo/log/process_data.JOB.log).

<command> can basically be anything, including symbols that would be interpreted by the shell, but of course can't process something if it gets interpreted by bash first. For instance, this is WRONG: test.log  echo foo | awk 's/f/F/';

because won't even see the arguments after the pipe symbol, but will get its standard output piped into the awk command. Instead you should write test.log  echo foo \| awk 's/f/F/';

You need to escape or quote the pipe symbol, and also things like ";" and ">". If one of the arguments in the <command> contains a space, then will assume you quoted it for a reason, and will quote it for you when it gets passed to bash. It quotes using single quotes by default, but if the string itself contains single quotes then it uses double quotes instead. This usually does what we want. The PATH variable from the shell that you executed from will be passed through to the scripts that get executed, and just to be certain you get everything you need, the file ./ will also be sourced. The commands will be executed with bash.

New-style options (unified interface)

When we originally wrote Kaldi, we made the example scripts pass in options like -l ram_free=6G,mem_free=6G to, when we needed to specify things like memory requirements. Because the scripts like steps/ can't make assumptions about how GridEngine is configured or whether we are using GridEngine at all, such options had to be passed in from the very outer level of the scripts, which is awkward. We have more recently defined a "new-style interface" to the parallelization scripts, such that they all accept the following types of options (examples shown):

   --config conf/queue_mod.conf
   --mem 10G
   --num-threads 6
   --max-jobs-run 10
   --gpu 1

The config file specifies how to convert new-style options into a form that GridEngine (or your grid software of choice) can interpret. Currently only actually interprets these options; the other scripts ignore them. Our plan is to gradually modify the scripts in steps/ to make use of the new-style options, where necessary, and to use for other varieties of grid software through use of the config files (where possible). will read conf/queue.conf if it exists; otherwise it will default to a particular config file that we define in the code. The config file specifies how to convert the "new-style" options into options that GridEngine or similar software can interpret. The following example show the behavior that the default config file specifies:

New-style option Converted form (for GridEngine) Comment
–mem 10G -l mem_free=10G,ram_free=10G
–max-jobs-run 10 -tc 10 (We use this for jobs that cause too much I/O).
–num-threads 6 -pe smp 6 (general case)
–num-threads 1 (no extra options) (special case)
–gpu 1 -q g.q -l gpu=1 (general case)
–gpu 0 (no extra options) (special case for gpu=0)

It's also possible to add extra options with this general format, i.e. options that look like –foo-bar and take one argument. The default configuration tabulated above works for the CLSP grid but may not work everywhere, because GridEngine is very configurable. Thefore you may have to create a config file conf/queue.conf and edit it to work with your grid. The following configuration file is the one that defaults to if conf/queue.conf does not exist and the –config option is not specified, and may be used as a starting point for your own config file:

# Default configuration
command qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*
option mem=* -l mem_free=$0,ram_free=$0
option mem=0          # Do not add anything to qsub_opts
option num_threads=* -pe smp $0
option num_threads=1  # Do not add anything to qsub_opts
option max_jobs_run=* -tc $0
default gpu=0
option gpu=0
option gpu=* -l gpu=$0 -q g.q

The line beginning with command specifies the unchanging part of the command line, and you can modify this to get it to use grid software other than GridEngine, or to specify options that you always want. The lines beginning with option specify how to transform the input options such as –mem. Lines beginning with something like "option mem=*" handle the general case (the $0 gets replaced with the actual argument to the option), while lines like "option gpu=0" allow you to specify special behavior for special cases of the argument, so in this case the option –gpu 0 is configured to produce no extra options to qsub at all. The line "default gpu=0" specifies that if you don't give the –gpu option at all, should act like you specified –gpu 0. In this particular configuration we could have omitted the line default gpu=0, because in any case the effect is to produce no extra command line options. We previously had it configured with a line: "option gpu=0 -q all.q", so there was a time when the line "default gpu=0" used to make a difference.

(Note: most of the time if you are configuring GridEngine yourself you will want a queue.conf that is the same as the one above missing the "-q g.q" option).

The mapping from what the config-file specifies to what appears on the command-line of qsub sometimes has to be tweaked slightly in the perl code: for instance, we made it so that the –max-jobs-run option is ignored for non-array jobs.

Example of configuring grid software with new-style options

We'd like to give an example of how the config file can be used in a real situation. We had a problem where, due to a bug in an outdated version of the CUDA toolkit that we had installed on the grid, some of our neural net training runs were crashing on our K20 GPU cards but not the K10s. We created a config file conf/no_k20.conf which was as the configuration file above (search in the text above for # Default configuration, but with the following lines added:

default allow_k20=true
option allow_k20=true
option allow_k20=false -l 'hostname=!g01*&!g02*&!b06*'

We then set the relevant $cmd variable to the value -config conf/no_k20.conf –allow-k20 false. Note that a simpler way to have done this would have been to simply edit the command line in the config file to read

command qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64* -l 'hostname=!g01*&!g02*&!b06*'

and if we had done that, it would not have been necessary to add –allow-k20 false.

Parallelization using specific scripts

In this section we explain the things that are specific to individual parallelization scripts.

Parallelization using is the normal, recommended way to parallelize. It was originally designed for use with GridEngine, but now that we have introduced the "new-style options" we believe it can be configured for use with other parallelization engines, such as Tork or slurm. If you develop config files that work for such engines, please contact the maintainers of Kaldi. It may also be necessary to make further changes to to properly support other engines, since some parts of the command line are currently constructed in in a way that is not configurable from the command line: e.g., adding -o foo/bar/q/jobname.log to direct output from qsub itself to a separate log-file; and for array jobs, adding options like -t 1:40 to the command line. The scripts that we ask qsub to run also make use of the variable $SGE_TASK_ID, which SGE sets to the job index for array jobs. Our plan is to extend the config-file mechanism as necessary to accommodate whatever changes are needed to support other grid software, within reason.

Since we have explained the behavior of at length above, we aren't going to provide many further details in this section, but please see below the section Setting up GridEngine for use with Kaldi.

Parallelization using is a fall-back option in case the user does not have GridEngine installed. This script is very simple; it runs all the jobs you request on the local machine, and it does so in parallel if you use a job range specifier like JOB=1:10. in parallel on the local machine. It doesn't try to keep track of how many CPUs are available or how much memory your machine has. Therefore if you use to run scripts that were designed to run with on a larger grid, you may end up exhausting the memory of your machine or overloading it with jobs. We recommend to study the script you are running, and being particularly careful with decoding scripts that run in the background (with &) and with scripts that use a large number of jobs, e.g. –nj 50. Generally speaking you can reduce the value of the –nj option without affecting the outcome, but there are some situations where the –nj options given to multiple scripts must match, or a later stage will crash. will ignore all options given to it except for the job-range specifier.

Parallelization using is a poor man's, for use in case you have a small cluster of several machines but don't want the trouble of setting up GridEngine. Like, it doesn't attempt to keep track of CPUs or memory; it works like except that it distributes the jobs across multiple machines. You have to create a file .queue/machines (where .queue is a subdirectory of the directory you are running the script from), where each line contains the name of a machine. It needs to be possible to ssh to each of these machines without a password, i.e. you have to set up your ssh keys.

Parallelization using was written to accomodate the slurm grid management tool, which operates on similar principles to GridEngine. It has not been tested very recently. Probably it is now possible to set up to use slurm using a suitable configuration file, which would make unnecessary.

Setting up GridEngine for use with Kaldi

Sun GridEngine (SGE) is the open-source grid management tool that we (the maintainers of Kaldi) have most experience with. Oracle now maintains SGE and has started calling it Oracle GridEngine. The version that is in use at CLSP@JHU is 6.2u5; SGE is old and fairly stable so the precise version number is not too critical. There are various open-source alternatives to SGE and various forks of it, but our instructions here relate to the main-line version which is currently maintained by Oracle.

In this section we explain how to install and configure GridEngine on an arbitrary cluster. If you have a cluster in Amazon's EC2 cloud and you want something that can take care of spinning up new nodes, you might want to look at MIT's StarCluster project, although we (the maintainers of Kaldi) have also created a project called "kluster" on Sourceforge that provides some scripts and documentation for the same purpose. StarCluster was not very stable at the time we developed kluster, but we believe it's improved since then.

Installing GridEngine

To start with, you probably want to get a basic installation of GridEngine working. In GridEngine, the queue management software runs on the "master", and a different set of programs runs on all the nodes. The master can also be a node in the queue. There's also a concept of a "shadow master" which is like a backup for the master, in case the master dies, but we won't address that here (probably it's just a question of installing the gridengine-master package on another node and setting the master to be another node, but we're not sure).

Last time we checked, installing GridEngine from source was a huge pain. Your life will be much easier if your distribution of Linux has a GridEngine package in its repositories, and choose your distribution wisely because not all distributions have such a package. We'll discuss how to do this with Debian, because that's what we're most experienced with.

To install GridEngine on the master, you'll run (on your chosen master node):

  sudo apt-get install gridengine-master gridengine-client

Select "yes" for automatic configuration. It will ask you for the "cell name", which you can leave as "default", and it will ask for the name of the "master", which you should set to the hostname of your chosen master. Typically this should be the fully qualified domain name (FQDN) of the master, but I believe anything that resolves via hostname lookup to the master node should work. Note that GridEngine is sometimes picky about about hostname lookup and reverse DNS lookup matching up, and GridEngine problems can sometimes be traced to this. Also be aware that doing "apt-get remove" of these packages and reinstalling them won't give you a blank slate because Debian sometimes remembers your selections; this can be a pain.

It will make your life easier if you add yourself as manager, so do:

 sudo qconf -am <your-user-id>

Here "am" means add manager; "dm" would mean delete manager and "sm" would mean show all managers. To see the available options, do qconf -help.

To install GridEngine on the normal nodes, you'll run

  sudo apt-get install gridengine-client gridengine-exec

The "cell name" should be left as "default", and the "master" should be the name of the master node that you previously installed. You can run this on the master too if the master is to run jobs also.

Typing qstat and qhost -q will let you know whether things are working. The following is what it looks like when things are working fine (we tested this in the Google cloud):

dpovey_gmail_com@instance-1:~$ qstat
dpovey_gmail_com@instance-1:~$ qhost -q
global                  -               -     -       -       -       -       -
instance-1.c.analytical-rig-638.internal lx26-amd64      1  0.07    3.6G  133.9M     0.0     0.0

We don't have a fully working setup yet, we still need to configure it; we're just checking that the client can reach the master. At this point, any errors likely relate to DNS lookup, reverse DNS lookup, or your /etc/hostname or /etc/hosts files; GridEngine doesn't like it when these things are inconsistent. If you need to change the name of the master from what you told the installer, you may be able to do so by editing the file


(at least, this is where it's located in Debian Wheezy).

Configuring GridEngine

First let's make sure that a queue is defined. GridEngine doesn't define any queues by default. We'll set up a queue called all.q. Make sure the shell variable EDITOR is set to your favorite shell (e.g. vim or emacs), and type as follows; and this should work from master or client.

 qconf -aq

This will bring up an editor. Edit the line

qname               template

so it says

qname               all.q

Also change the field shell from /bin/csh to /bin/bash; this is a better default, although it shouldn't affect Kaldi. Quitting the editor will save the changes, although if you made syntax errors, qconf will reject your edits. Later on we'll make more changes to this queue by typing qconf -mq all.q.

GridEngine stores some global configuration values, not connected with any queue, which can be viewed with qconf -sconf. We'll edit them using qconf -mconf. There is a line that reads

administrator_mail           root

and if you have sending emails working from your machine (i.e. you can type mail and the mail gets delivered), then you can change root to an email address where you want to receive notifications if things go wrong. Be advised that due to anti-spam measures, sending emails from the cloud is painful from EC2 and close to impossible from Google's cloud offering, so it may be best just to leave this field the way it is and make do without email notifications. You may also want to run qconf -msconf and edit it so that the schedule_interval is:

schedule_interval                 0:0:5

(the default is 00:00:15), which will give a slightly faster turnaround time for submitting jobs.

GridEngine has the concept of "resources" which can be requested or specified by your jobs, and these can be viewed using qconf -sc. Modify them using qconf -mc. Modify the mem_free line to change the default memory requirement from 0 to 1G, and to make it consumable, i.e.:

#name               shortcut    type        relop requestable consumable default  urgency
mem_free            mf         MEMORY      <=    YES         YES         1G        0

and also add the following two new lines; it doesn't matter where in the file you add them.

#name               shortcut    type        relop requestable consumable default  urgency
gpu                 g           INT         <=    YES         YES        0        10000
ram_free            ram_free    MEMORY      <=    YES         JOB        1G       0

You'll only need the "gpu" field if you add GPUs to your grid; the ram_free is a field that we find useful in managing the memory of the machines, as the inbuilt field mem_free doesn't seem to work quite right for our purposes. Later on when we add hosts to the grid, we'll use the command qconf -me <some-hostname> to edit the complex_values field to read something like:

 complex_values        ram_free=112G,gpu=2

(for a machine with 112G of physical memory and 2 GPUs). If we want to submit a job that needs 10G of memory, we'll specify -l mem_free=10G,ram_free=10G as an option to qsub; the mem_free requirement makes sure the machine has that much free memory at the time the job starts, and the ram_free requirement makes sure we don't submit a lot of jobs requiring a lot of memory, all to the same host. We tried, as an alternative to adding the ram_free resource, using qconf -mc to edit the consumable field of the inbuilt mem_free resource to say YES, to make GridEngine keep track of memory requests; but this did not seem to work as desired. Note that both ram_free and gpu are names that we chose ourselves; they have no special meaning to GridEngine, while some of the inbuilt resources such as mem_free do have special meanings. The string JOB in the consumable entry for ram_free means that the ram_free resource is specified per job rather than per thread; this is more convenient for Kaldi scripts.

We are not saying anything here about the queue "g.q" which is required for GPUs in the default behavior. That was introduced at Hopkins in order to avoid the situation where GPUs are blocked because all CPU slots are used. It's really not necessary in general; the easiest way to get things working without it is to create a file conf/queue.conf that's the same as the default one (search above) except without the "-q g.q" option.

Next you have to have to add the parallel environment called smp to GridEngine. This is a kind of tradition in GridEngine setups, but it's not built-in. It's a simple parallel environment where GridEngine doesn't really do anything, it just reserves you a certain number of slots, so if you do qsub -pe smp 10 <your-script> you will get 10 CPU slots reserved; this can be useful for multi-threaded or multi-process jobs. Do qconf -ap smp, and edit the slots field to say 9999, so it reads:

pe_name            smp
slots              9999

Then do qconf -mq all.q, and edit the pe_list field by adding smp, so it reads:

pe_list               make smp

This enables the smp parallelization method in the queue all.q.

Configuring GridEngine (advanced)

In this section we just want to make a note of some things that might be helpful, but which aren't necessary just to get things running. In the CLSP cluster, we edited the prolog field in qconf -mq all.q so that it says

prolog                /var/lib/gridengine/default/common/

(the default was NONE), and the script /var/lib/gridengine/default/common/, which we copied to that location on each individual node in the cluster, reads as follows. Its only purpose is to wait a short time if the job script can't be accessed, to give NFS some time to sync in case the scripts were written very recently and haven't yet propagated across the grid:


function test_ok {
  if [ ! -z "$JOB_SCRIPT" ] && [ "$JOB_SCRIPT" != QLOGIN ] && [ "$JOB_SCRIPT" != QRLOGIN ]; then
    if [ ! -f "$JOB_SCRIPT" ]; then
       echo "$0: warning: no such file $JOB_SCRIPT, will wait" 1>&2
       return 1;
  if [ ! -z "$SGE_STDERR_PATH" ]; then
    if [ ! -d "`dirname $SGE_STDERR_PATH`" ]; then
      echo "$0: warning: no such directory $JOB_SCRIPT, will wait." 1>&2
      return 1;
  return 0;

if ! test_ok; then
  sleep 2;
  if ! test_ok; then
     sleep 4;
     if ! test_ok; then
        sleep 8;

exit 0;

This script waits at most 14 seconds, which is enough because we configured acdirmax=8 in our NFS options as the maximum wait before refreshing a cached directory (see Keeping your grid stable (NFS) below).

We also edited the queue with qconf -mq all.q to change rerun from FALSE to TRUE, i.e. to say:

rerun                 TRUE

This means that when jobs fail, they get in a status that shows up in the output of qstat as Eqw, with the E indicating error, and you can ask the queue to reschedule them by clearing the error status with qmod -cj <numeric-job-id> (or if you don't want to rerun them, you can delete them with qmod -dj <numeric-job-id>). Setting the queue to allow reruns can avoid the hassle of rerunning scripts from the start when things break due to NFS problems.

Something else we did in the CLSP queue is to edit the following fields, which by default read:

rlogin_daemon                /usr/sbin/sshd -i
rlogin_command               /usr/bin/ssh
qlogin_daemon                /usr/sbin/sshd -i
qlogin_command               /usr/share/gridengine/qlogin-wrapper
rsh_daemon                   /usr/sbin/sshd -i
rsh_command                  /usr/bin/ssh

to read instead:

qlogin_command               builtin
qlogin_daemon                builtin
rlogin_command               builtin
rlogin_daemon                builtin
rsh_command                  builtin
rsh_daemon                   builtin

This was to solve a problem whose nature we can no longer recall, but it's something you might want to try it if commands like qlogin and qrsh don't work.

Configuring GridEngine (adding nodes)

In this section we address what you do when you add nodes to the queue. As mentioned above, you can install GridEngine on nodes by doing

  sudo apt-get install gridengine-client gridengine-exec

and you need to specify default as the cluster name, and the name of your master node as the master (probably using the FQDN of the master is safest here, but if you are on a local network, just the last part of the name may also work).

But that doesn't mean your machine is fully in the queue. GridEngine has separate notions of hosts being administrative hosts, execution hosts and submit hosts. All your machines should be all three. You can view which machines have these three roles using the commands qconf -sh, qconf -sel, and qconf -ss respectively. You can add your machine as an administrative or submit host with the commands:

 qconf -ah <your-fqdn>
 qconf -as <your-fqdn>

and you can add your host as an execution host with the command

 qconf -ae <your-fqdn>

which brings up an editor; you can put in the ram_free and possibly GPU fields in here, e.g.

complex_values    ram_free=112G,gpu=1

You'll notice is a slight asymmetry between the commands qconf -sh and qconf -ss on the one hand, and qconf -sel on the other. The "l" in the latter command means show the list. The difference is that administrative and submit host lists are just lists of hosts, whereas qconf stores a bunch of information about the execution hosts, so it views it as a different type of data structure. You can view the information about a particular host with qconf -se <some-hostname>, add a new host with qconf -ae <some-hostname, and modify with qconf -me <some-hostname>. This is a general pattern in GridEngine: for things like queues that have a bunch of information in them, you can show the full list by typing a command ending in "l" like qconf -sql, and the corresponding "add" ("a") and "modify" ("m") commands accept arguments.

It's not enough to tell GridEngine that a node is an execution host; you have to also add it to the queue, and tell the queue how many slots to allocate for that node. First figure out how many CPUs (or virtual CPUs) your machine has, by doing:

grep proc /proc/cpuinfo | wc -l

Suppose this is 48. You can choose a number a little smaller than this, say 40, and use that for the number of slots. Edit the queue using qconf -mq all.q, add your machine to the hostlist, and set the number of slots. It should look like this:

qname                 all.q
hostlist    ,
slots                 30,[],[]

In the slots field, the 30 at the beginning is a default value; for any nodes with that number of slots you can save yourself some time and avoid adding the node's name to the slots field. There is an alternative way to set up the hostlist field. GridEngine has the concept of host groups, so you could do qconf -ahgrp @allhosts to add a group of hosts, and edit it using qconf -mhgrp @allhosts to add your new nodes. The configuration of all.q could then just read:

hostlist            @allhosts

It's your choice. For simple queues, it's probably fine to just put the host list in the all.q configuration.

A useful command to list the hosts that GridEngine knows about, and what queues they are in, is qhost -q. For example:

# qhost -q
global                  -               -     -       -       -       -       -        lx26-amd64     24 12.46  126.2G   11.3G   86.6G  213.7M
   all.q                BIP   0/6/20        lx26-amd64     24 16.84  126.2G   12.4G   51.3G  164.5M
   all.q                BIP   0/18/20

If you see the letter "E" in the place where the example above shows "BIP", it means the node is in the error state. Other letters you don't want to see in that position are "a" for alarm (a generic indicator of badness) and "u" for unreachable. "d" means a node has been disabled by an administrator. Nodes sometimes get in the error ("E") state when GridEngine had trouble running a job, which is often due to NFS or automount problems. You can clear the error by doing something like

qmod -c all.q@a01

but of course if the node has serious problems, it might be wise to fix them first. It's sometimes also useful to enable and disable nodes in the queue by doing

qmod -d all.q@a01

to disable a node, and qmod -e all.q@a01 to enable it again.

A common symptom of GridEngine problems is jobs waiting when you think nodes are free. The easiest way to debug this is to look for the job-id in the output of qstat, and then to do qstat -j <job-id> and look for the reasons why the job is not running.

You can view all jobs from all users by running

qstat -u '*'

Keeping your grid stable

In this section we have some general notes on how to ensure stability in a compute cluster of the kind useful for Kaldi.

Keeping your grid stable (OOM)

One of the major causes of crashes in compute clusters is memory exhaustion. The default OOM-killer in Linux is not very good, so if you exhaust memory, it may end up killing an important system process, which tends to cause hard-to-diagnose instability. Even if nothing is killed, malloc() may start failing when called from processes on the system; and very few programs deal with this gracefully. In the CLSP grid we wrote our own version of an OOM killer, which we run as root, and we wrote the corresponding init scripts. When our OOM killer detects memory overload, it kills the largest process of whichever non-system user is using the most memory. This is usually the right thing to do. These scripts have been made public as part of the kluster project, and you can get them as shown below if you want to add them to your system. The following commands will only work as-is if you have LSB-style init scripts, which is the case in Debian wheezy. The next Debian release, jessie, won't have init scripts at all and will use systemd instead (the so-called "systemd controversy"). If someone can figure out how to do the following in systemd, please let us know. Type

sudo bash

and then as root, do:

apt-get install -y subversion
svn cat > /sbin/
chmod +x /sbin/
cd /etc/init.d
svn cat >mem-killer
chmod +x mem-killer
update-rc.d mem-killer defaults
service mem-killer start is capable of sending email to the administrators and to the person whose jobs was killed if something is wrong, but this will only work if your system can send mail, and if the user has put their email somewhere in the "office" field of their gecos information, using chfn (or ypchfn if using NIS). Again, if you're running in the cloud, it's best to forget about anything email-related, as it's too hard to get working.

Keeping your grid stable (NFS)

We aren't going to give a complete run-down of how to install NFS here, but we want to mention some potential issues, and explain some options that work well. Firstly, NFS is not the only option for a shared filesystem; there are some newer distributed file systems available too, but we don't have much experience with them.

NFS can perform quite badly if the options are wrong. Below are the options we use. We show it as if we're grepping it from /etc/fstab; this isn't actually how we do it (we actually use NIS and automount), but the following way is simpler:

# grep a05 /etc/fstab
a05:/mnt/data /export/a05 nfs rw,vers=3,rsize=8192,wsize=8192,acdirmin=5,acdirmax=8,hard,proto=tcp 0 0

The option "vers=3" means we use NFS version 3, which is stateless. We tried using version 4, a supposedly more advanced "stateful" protocol, but we got a lot of crashes.

The acdirmin=5 and acdirmin=8 options are the minimum and maximum times that NFS waits before re-reading cached directory information; the defaults are 30 and 60 seconds respectively. This is important for Kaldi scripts, because the files that we execute on GridEngine are written only shortly before we run the scripts, so with default NFS options they may not yet be visible on the execution host at the time they are needed. Above we showed our script /var/lib/gridengine/default/common/ which waits up to 14 seconds for the script to appear. It's significant that 14 > 8, i.e. that the number of seconds the prolog script will wait for is greater than the maximum directory caching period for NFS.

The hard option is also important; it means that if the server is busy, the client will wait for it to succeed rather than reporting an error (e.g. fopen returning error status). If you specify soft, Kaldi may crash. hard is the default so could be omitted.

The proto=tcp option is also the default on Debian currently; the alternative is proto=udp. The TCP protocol is important for stability when local networks may get congested; we have found this through experience.

The rsize=8192,wsize=8192 are packet sizes; they are supposedly important in the performance of NFS. Kaldi reads and writes are generally contiguous and don't tend to seek much within files, so large packet size is probably suitable.

Another thing you might want to tune that's NFS related is the number of threads in the server. You can see this as follows:

/$ head /etc/default/nfs-kernel-server
# Number of servers to start up

To change it you edit that file and then restart the service (service nfs-kernel-server restart, on Debian). Apparently it is not good for this to be less than the number of clients that might simultaneously access your server (although 64 is the upper limit of what I've seen people recommend to set this to). Apparently one way to tell if you have too few threads is too look at the retrans count in your NFS stats:

nfsstat -rc
Client rpc stats:
calls      retrans    authrefrsh
434831612   6807461    434979729

The stats above have a largish number of retrans, and apparently in the ideal case it should be zero. We had the number of NFS threads set to a lower number (24) for most of the time that machine was up, which was less than the potential number of clients, so it's not surprising that we had a high amount of retrans. What seems to happen is that if a large number of clients are actively using the server, especially writing large amounts of data simultaneously, they can tie up all the server threads and then when another client tries to do something, it fails and in the client logs you'll see something like nfs: server a02 not responding, still trying. This can sometimes be associated with crashed jobs, and if you use automount on your setup, it can sometimes cause jobs that are not even accessing that server to crash or be delayed (automount has a brain-dead, single-threaded design, so failure of one mount request can hang up all other automount reqeuests).

Keeping your grid stable (general issues)

In this section we have some general observations about how to configure and manage a compute grid.

In CLSP we use a lot of NFS hosts, not just one or two; in fact, most of our nodes also export data via NFS. If you do this you should use our or a similar script, or you will get instability due to memory exhaustion when users make mistakes. Having a large number of file servers is a particularly good idea for queues that are shared by many people, because it's inevitable that people will overload file servers, and if there are only one or two file servers, the queue will end up being in a degraded state for much of the time. The network bandwidth of each individual file server at CLSP is quite slow: for cost reasons we still use 1G ethernet, but all connected to each other with an expensive Cisco router so that there is no global bottleneck. This means that each file server gets overloaded quite easily; but because there are so many individual file servers, this generally only inconveniences the person who is overloading it and maybe one or two others.

When we do detect that a particular file server is loaded (generally either by noticing slowness, or by seeing errors like nfs: server a15 not responding in the output of dmesg), we try to track down why this is happening. Without exception this is due to bad user behavior, i.e. users running too many I/O-heavy jobs. Usually through a combination of the output of iftop and the output of qstat, we can figure out which user is causing the problem. In order to keep the queue stable it's necessary to get users to correct their behavior and to limit the number of I/O heavy jobs they run, by sending them emails and asking them to modify their setups. If we didn't contact users in this way the queue would be unusable, because users would persist in their bad habits.

Many other groups have a lot of file servers, like us, but still have all their traffic going to one file server because they buy the file servers one at a time and they always allocate space on the most recently bought one. This is just stupid. In CLSP we avoid the administrator hassle of having to allocate space, by having all the NFS servers world-writable at the top level, and by instructing people to make directories with their userid on them, on a server of their choice. We set up scripts to notify the administrators by email when any of the file servers gets 95% full, and to let us know which directories contain a lot of data. We can then ask the users concerned to delete some data, or we delete it ourselves if the users are gone and if we feel it's appropriate. We also set up scripts to work out, queue-wide, which users are using the most space, and to send them emails informing them where they are consuming the most space. The administrators will also ask the users to clean up, particularly in extreme cases (e.g. when a junior student is using a huge amount of space for no obvious reason). Space should never really be a problem, because it's almost always the case that the disk is 95% full of junk that no-one cares about or even remembers. It's simply a matter of finding a way to ask those responsible to clean up, and making their life easy by telling them where the bulk of their data is.

Another useful thing is to locate home directories on a server that is not also used for experiments; this ensures that users can always get a prompt, even when other users are being stupid. It also makes the backup policy easier: we back up the home directories but not the NFS volumes used for experimental work, and we make clear to users that those volumes are not backed up (of course, students will still fail to back up their important data, and will sometimes lose it as a result). The ban on running experiments in home directories needs to be enforced; we frequently have to tell users to stop parallel jobs with data in their home directories. This is the most frequent cause of grid-wide problems.