All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Modules Pages
align-text.cc File Reference
#include "util/common-utils.h"
#include "util/parse-options.h"
#include "util/edit-distance.h"
#include <algorithm>
Include dependency graph for align-text.cc:

Go to the source code of this file.

Functions

bool IsNotToken (const std::string &token)
 
int main (int argc, char *argv[])
 

Function Documentation

bool IsNotToken ( const std::string &  token)

Definition at line 25 of file align-text.cc.

References kaldi::IsToken().

Referenced by main().

25  {
26  return ! kaldi::IsToken(token);
27 }
bool IsToken(const std::string &token)
Returns true if "token" is nonempty, and all characters are printable and whitespace-free.
Definition: text-utils.cc:103
int main ( int  argc,
char *  argv[] 
)

Definition at line 29 of file align-text.cc.

References SequentialTableReader< Holder >::Done(), ParseOptions::GetArg(), RandomAccessTableReader< Holder >::HasKey(), IsNotToken(), KALDI_ASSERT, KALDI_ERR, KALDI_LOG, KALDI_WARN, SequentialTableReader< Holder >::Key(), kaldi::LevenshteinAlignment(), SequentialTableReader< Holder >::Next(), ParseOptions::NumArgs(), ParseOptions::PrintUsage(), ParseOptions::Read(), ParseOptions::Register(), RandomAccessTableReader< Holder >::Value(), SequentialTableReader< Holder >::Value(), and TableWriter< Holder >::Write().

29  {
30  using namespace kaldi;
31  typedef kaldi::int32 int32;
32 
33  try {
34  const char *usage =
35  "Computes alignment between two sentences with the same key in the\n"
36  "two given input text-rspecifiers. The current implementation uses\n"
37  "Levenshtein distance as the distance metric.\n"
38  "\n"
39  "The input text file looks like follows:\n"
40  " key1 a b c\n"
41  " key2 d e\n"
42  "\n"
43  "The output alignment file looks like follows:\n"
44  " key1 a a ; b <eps> ; c c \n"
45  " key2 d f ; e e \n"
46  "where the aligned pairs are separated by \";\"\n"
47  "\n"
48  "Usage: align-text [options] <text1-rspecifier> <text2-rspecifier> \\\n"
49  " <alignment-wspecifier>\n"
50  " e.g.: align-text ark:text1.txt ark:text2.txt ark,t:alignment.txt\n"
51  "See also: compute-wer,\n"
52  "Example scoring script: egs/wsj/s5/steps/score_kaldi.sh\n";
53 
54  ParseOptions po(usage);
55 
56  std::string special_symbol = "<eps>";
57  std::string separator = ";";
58  po.Register("special-symbol", &special_symbol, "Special symbol to be "
59  "aligned with the inserted or deleted words. Your sentences "
60  "should not contain this symbol.");
61  po.Register("separator", &separator, "Separator for each aligned pair in "
62  "the output alignment file. Note: it should not be necessary "
63  "to change this even if your sentences contain ';', because "
64  "to parse the output of this program you can just split on "
65  "space and then assert that every third token is ';'.");
66 
67  po.Read(argc, argv);
68 
69  if (po.NumArgs() != 3) {
70  po.PrintUsage();
71  exit(1);
72  }
73 
74  std::string text1_rspecifier = po.GetArg(1),
75  text2_rspecifier = po.GetArg(2),
76  align_wspecifier = po.GetArg(3);
77 
78  SequentialTokenVectorReader text1_reader(text1_rspecifier);
79  RandomAccessTokenVectorReader text2_reader(text2_rspecifier);
80  TokenVectorWriter align_writer(align_wspecifier);
81 
82  int32 n_done = 0;
83  int32 n_fail = 0;
84  for (; !text1_reader.Done(); text1_reader.Next()) {
85  std::string key = text1_reader.Key();
86 
87  if (!text2_reader.HasKey(key)) {
88  KALDI_WARN << "Key " << key << " is in " << text1_rspecifier
89  << ", but not in " << text2_rspecifier;
90  n_fail++;
91  continue;
92  }
93  const std::vector<std::string> &text1 = text1_reader.Value();
94  const std::vector<std::string> &text2 = text2_reader.Value(key);
95 
96  // Checks if the special symbol is in the string.
97  KALDI_ASSERT(std::find(text1.begin(),
98  text1.end(), special_symbol) == text1.end());
99  KALDI_ASSERT(std::find(text2.begin(),
100  text2.end(), special_symbol) == text2.end());
101 
102  if (std::find_if(text1.begin(), text1.end(), IsNotToken) != text1.end()) {
103  KALDI_ERR << "In text1, the utterance " << key << " contains unprintable characters." \
104  << "That means there is a problem with the text (such as incorrect encoding)." << std::endl;
105  return -1;
106  }
107  if (std::find_if(text2.begin(), text2.end(), IsNotToken) != text2.end()) {
108  KALDI_ERR << "In text2, the utterance " << key << " contains unprintable characters." \
109  << "That means there is a problem with the text (such as incorrect encoding)." << std::endl;
110  return -1;
111  }
112 
113  std::vector<std::pair<std::string, std::string> > aligned;
114  LevenshteinAlignment(text1, text2, special_symbol, &aligned);
115 
116  std::vector<std::string> token_vec;
117  std::vector<std::pair<std::string, std::string> >::const_iterator iter;
118  for (iter = aligned.begin(); iter != aligned.end(); ++iter) {
119  token_vec.push_back(iter->first);
120  token_vec.push_back(iter->second);
121  if (aligned.end() - iter != 1)
122  token_vec.push_back(separator);
123  }
124  align_writer.Write(key, token_vec);
125 
126  n_done++;
127  }
128 
129  KALDI_LOG << "Done " << n_done << " sentences, failed for " << n_fail;
130  return (n_done != 0 ? 0 : 1);
131  } catch(const std::exception &e) {
132  std::cerr << e.what();
133  return -1;
134  }
135 }
Relabels neural network egs with the read pdf-id alignments.
Definition: chain.dox:20
int32 LevenshteinAlignment(const std::vector< T > &a, const std::vector< T > &b, T eps_symbol, std::vector< std::pair< T, T > > *output)
A templated class for writing objects to an archive or script file; see The Table concept...
Definition: kaldi-table.h:366
Allows random access to a collection of objects in an archive or script file; see The Table concept...
Definition: kaldi-table.h:233
The class ParseOptions is for parsing command-line options; see Parsing command-line options for more...
Definition: parse-options.h:36
A templated class for reading objects sequentially from an archive or script file; see The Table conc...
Definition: kaldi-table.h:287
#define KALDI_ERR
Definition: kaldi-error.h:127
#define KALDI_WARN
Definition: kaldi-error.h:130
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:169
#define KALDI_LOG
Definition: kaldi-error.h:133
bool IsNotToken(const std::string &token)
Definition: align-text.cc:25