ArpaLmCompiler Class Reference

#include <arpa-lm-compiler.h>

Inheritance diagram for ArpaLmCompiler:
Collaboration diagram for ArpaLmCompiler:

Public Member Functions

 ArpaLmCompiler (const ArpaParseOptions &options, int sub_eps, fst::SymbolTable *symbols)
 
 ~ArpaLmCompiler ()
 
const fst::StdVectorFstFst () const
 
fst::StdVectorFstMutableFst ()
 
- Public Member Functions inherited from ArpaFileParser
 ArpaFileParser (const ArpaParseOptions &options, fst::SymbolTable *symbols)
 Constructs the parser with the given options and optional symbol table. More...
 
virtual ~ArpaFileParser ()
 
void Read (std::istream &is)
 Read ARPA LM file from a stream. More...
 
const ArpaParseOptionsOptions () const
 Parser options. More...
 

Protected Member Functions

virtual void HeaderAvailable ()
 Override function called to signal that ARPA header with the expected number of n-grams has been read, and ngram_counts() is now valid. More...
 
virtual void ConsumeNGram (const NGram &ngram)
 Pure override that must be implemented to process current n-gram. More...
 
virtual void ReadComplete ()
 Override function called after the last n-gram has been consumed. More...
 
- Protected Member Functions inherited from ArpaFileParser
virtual void ReadStarted ()
 Override called before reading starts. More...
 
const fst::SymbolTable * Symbols () const
 Read-only access to symbol table. Not owned, do not make public. More...
 
int32 LineNumber () const
 Inside ConsumeNGram(), provides the current line number. More...
 
std::string LineReference () const
 Inside ConsumeNGram(), returns a formatted reference to the line being compiled, to print out as part of diagnostics. More...
 
bool ShouldWarn ()
 Increments warning count, and returns true if a warning should be printed or false if the count has exceeded the set maximum. More...
 
const std::vector< int32 > & NgramCounts () const
 N-gram counts. Valid from the point when HeaderAvailable() is called. More...
 

Private Member Functions

void RemoveRedundantStates ()
 
void Check () const
 

Private Attributes

int sub_eps_
 
ArpaLmCompilerImplInterfaceimpl_
 
fst::StdVectorFst fst_
 

Friends

template<class HistKey >
class ArpaLmCompilerImpl
 

Detailed Description

Definition at line 32 of file arpa-lm-compiler.h.

Constructor & Destructor Documentation

◆ ArpaLmCompiler()

ArpaLmCompiler ( const ArpaParseOptions options,
int  sub_eps,
fst::SymbolTable *  symbols 
)
inline

Definition at line 34 of file arpa-lm-compiler.h.

References ArpaLmCompiler::~ArpaLmCompiler().

36  : ArpaFileParser(options, symbols),
37  sub_eps_(sub_eps), impl_(NULL) {
38  }
ArpaFileParser(const ArpaParseOptions &options, fst::SymbolTable *symbols)
Constructs the parser with the given options and optional symbol table.
ArpaLmCompilerImplInterface * impl_

◆ ~ArpaLmCompiler()

Definition at line 279 of file arpa-lm-compiler.cc.

Referenced by ArpaLmCompiler::ArpaLmCompiler().

279  {
280  if (impl_ != NULL)
281  delete impl_;
282 }
ArpaLmCompilerImplInterface * impl_

Member Function Documentation

◆ Check()

void Check ( ) const
private

Definition at line 363 of file arpa-lm-compiler.cc.

References ArpaLmCompilerImpl< HistKey >::fst_, and KALDI_ERR.

Referenced by ArpaLmCompiler::MutableFst().

363  {
364  if (fst_.Start() == fst::kNoStateId) {
365  KALDI_ERR << "Arpa file did not contain the beginning-of-sentence symbol "
366  << Symbols()->Find(Options().bos_symbol) << ".";
367  }
368 }
const fst::SymbolTable * Symbols() const
Read-only access to symbol table. Not owned, do not make public.
const ArpaParseOptions & Options() const
Parser options.
#define KALDI_ERR
Definition: kaldi-error.h:147
fst::StdVectorFst fst_

◆ ConsumeNGram()

void ConsumeNGram ( const NGram )
protectedvirtual

Pure override that must be implemented to process current n-gram.

The n-grams are sent in the file order, which guarantees that all (k-1)-grams are processed before the first k-gram is.

Implements ArpaFileParser.

Definition at line 306 of file arpa-lm-compiler.cc.

References rnnlm::i, KALDI_WARN, and NGram::words.

Referenced by ArpaLmCompiler::MutableFst().

306  {
307  // <s> is invalid in tails, </s> in heads of an n-gram.
308  for (int i = 0; i < ngram.words.size(); ++i) {
309  if ((i > 0 && ngram.words[i] == Options().bos_symbol) ||
310  (i + 1 < ngram.words.size()
311  && ngram.words[i] == Options().eos_symbol)) {
312  if (ShouldWarn())
314  << " skipped: n-gram has invalid BOS/EOS placement";
315  return;
316  }
317  }
318 
319  bool is_highest = ngram.words.size() == NgramCounts().size();
320  impl_->ConsumeNGram(ngram, is_highest);
321 }
virtual void ConsumeNGram(const NGram &ngram, bool is_highest)=0
const ArpaParseOptions & Options() const
Parser options.
int32 eos_symbol
Symbol for </s>, Required non-epsilon.
ArpaLmCompilerImplInterface * impl_
std::string LineReference() const
Inside ConsumeNGram(), returns a formatted reference to the line being compiled, to print out as part...
#define KALDI_WARN
Definition: kaldi-error.h:150
bool ShouldWarn()
Increments warning count, and returns true if a warning should be printed or false if the count has e...
const std::vector< int32 > & NgramCounts() const
N-gram counts. Valid from the point when HeaderAvailable() is called.

◆ Fst()

const fst::StdVectorFst& Fst ( ) const
inline

Definition at line 41 of file arpa-lm-compiler.h.

References ArpaLmCompiler::fst_.

Referenced by kaldi::CoverageTest(), and kaldi::ScoringTest().

41 { return fst_; }
fst::StdVectorFst fst_

◆ HeaderAvailable()

void HeaderAvailable ( )
protectedvirtual

Override function called to signal that ARPA header with the expected number of n-grams has been read, and ngram_counts() is now valid.

Reimplemented from ArpaFileParser.

Definition at line 284 of file arpa-lm-compiler.cc.

References ArpaLmCompilerImpl< HistKey >::fst_, ArpaParseOptions::kAddToSymbols, KALDI_ASSERT, KALDI_LOG, and ArpaLmCompilerImpl< HistKey >::sub_eps_.

Referenced by ArpaLmCompiler::MutableFst().

284  {
285  KALDI_ASSERT(impl_ == NULL);
286  // Use optimized implementation if the grammar is 4-gram or less, and the
287  // maximum attained symbol id will fit into the optimized range.
288  int64 max_symbol = 0;
289  if (Symbols() != NULL)
290  max_symbol = Symbols()->AvailableKey() - 1;
291  // If augmenting the symbol table, assume the worst case when all words in
292  // the model being read are novel.
293  if (Options().oov_handling == ArpaParseOptions::kAddToSymbols)
294  max_symbol += NgramCounts()[0];
295 
296  if (NgramCounts().size() <= 4 && max_symbol < OptimizedHistKey::kMaxData) {
297  impl_ = new ArpaLmCompilerImpl<OptimizedHistKey>(this, &fst_, sub_eps_);
298  } else {
299  impl_ = new ArpaLmCompilerImpl<GeneralHistKey>(this, &fst_, sub_eps_);
300  KALDI_LOG << "Reverting to slower state tracking because model is large: "
301  << NgramCounts().size() << "-gram with symbols up to "
302  << max_symbol;
303  }
304 }
const fst::SymbolTable * Symbols() const
Read-only access to symbol table. Not owned, do not make public.
const ArpaParseOptions & Options() const
Parser options.
Add novel words to the symbol table.
ArpaLmCompilerImplInterface * impl_
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185
fst::StdVectorFst fst_
const std::vector< int32 > & NgramCounts() const
N-gram counts. Valid from the point when HeaderAvailable() is called.
#define KALDI_LOG
Definition: kaldi-error.h:153

◆ MutableFst()

◆ ReadComplete()

void ReadComplete ( )
protectedvirtual

Override function called after the last n-gram has been consumed.

Reimplemented from ArpaFileParser.

Definition at line 370 of file arpa-lm-compiler.cc.

References ArpaLmCompilerImpl< HistKey >::fst_.

Referenced by ArpaLmCompiler::MutableFst().

370  {
371  fst_.SetInputSymbols(Symbols());
372  fst_.SetOutputSymbols(Symbols());
374  Check();
375 }
const fst::SymbolTable * Symbols() const
Read-only access to symbol table. Not owned, do not make public.
fst::StdVectorFst fst_

◆ RemoveRedundantStates()

void RemoveRedundantStates ( )
private

Definition at line 323 of file arpa-lm-compiler.cc.

References ArpaLmCompilerImpl< HistKey >::fst_, KALDI_LOG, fst::RemoveEpsLocal(), and ArpaLmCompilerImpl< HistKey >::sub_eps_.

Referenced by ArpaLmCompiler::MutableFst().

323  {
324  fst::StdArc::Label backoff_symbol = sub_eps_;
325  if (backoff_symbol == 0) {
326  // The method of removing redundant states implemented in this function
327  // leads to slow determinization of L o G when people use the older style of
328  // usage of arpa2fst where the --disambig-symbol option was not specified.
329  // The issue seems to be that it creates a non-deterministic FST, while G is
330  // supposed to be deterministic. By 'return'ing below, we just disable this
331  // method if people were using an older script. This method isn't really
332  // that consequential anyway, and people will move to the newer-style
333  // scripts (see current utils/format_lm.sh), so this isn't much of a
334  // problem.
335  return;
336  }
337 
338  fst::StdArc::StateId num_states = fst_.NumStates();
339 
340 
341  // replace the #0 symbols on the input of arcs out of redundant states (states
342  // that are not final and have only a backoff arc leaving them), with <eps>.
343  for (fst::StdArc::StateId state = 0; state < num_states; state++) {
344  if (fst_.NumArcs(state) == 1 && fst_.Final(state) == fst::TropicalWeight::Zero()) {
345  fst::MutableArcIterator<fst::StdVectorFst> iter(&fst_, state);
346  fst::StdArc arc = iter.Value();
347  if (arc.ilabel == backoff_symbol) {
348  arc.ilabel = 0;
349  iter.SetValue(arc);
350  }
351  }
352  }
353 
354  // we could call fst::RemoveEps, and it would have the same effect in normal
355  // cases, where backoff_symbol != 0 and there are no epsilons in unexpected
356  // places, but RemoveEpsLocal is a bit safer in case something weird is going
357  // on; it guarantees not to blow up the FST.
359  KALDI_LOG << "Reduced num-states from " << num_states << " to "
360  << fst_.NumStates();
361 }
fst::StdArc::StateId StateId
void RemoveEpsLocal(MutableFst< Arc > *fst)
RemoveEpsLocal remove some (but not necessarily all) epsilons in an FST, using an algorithm that is g...
fst::StdArc StdArc
fst::StdArc::Label Label
fst::StdVectorFst fst_
#define KALDI_LOG
Definition: kaldi-error.h:153

Friends And Related Function Documentation

◆ ArpaLmCompilerImpl

friend class ArpaLmCompilerImpl
friend

Definition at line 60 of file arpa-lm-compiler.h.

Member Data Documentation

◆ fst_

fst::StdVectorFst fst_
private

Definition at line 59 of file arpa-lm-compiler.h.

Referenced by ArpaLmCompiler::Fst(), and ArpaLmCompiler::MutableFst().

◆ impl_

Definition at line 58 of file arpa-lm-compiler.h.

◆ sub_eps_

int sub_eps_
private

Definition at line 57 of file arpa-lm-compiler.h.


The documentation for this class was generated from the following files: