ArpaLmCompilerImpl< HistKey > Class Template Reference
Inheritance diagram for ArpaLmCompilerImpl< HistKey >:
Collaboration diagram for ArpaLmCompilerImpl< HistKey >:

Public Member Functions

 ArpaLmCompilerImpl (ArpaLmCompiler *parent, fst::StdVectorFst *fst, Symbol sub_eps)
 
virtual void ConsumeNGram (const NGram &ngram, bool is_highest)
 
- Public Member Functions inherited from ArpaLmCompilerImplInterface
virtual ~ArpaLmCompilerImplInterface ()
 

Private Types

typedef unordered_map< HistKey, StateId, typename HistKey::HashType > HistoryMap
 

Private Member Functions

StateId AddStateWithBackoff (HistKey key, float backoff)
 
void CreateBackoff (HistKey key, StateId state, float weight)
 

Private Attributes

ArpaLmCompilerparent_
 
fst::StdVectorFstfst_
 
Symbol bos_symbol_
 
Symbol eos_symbol_
 
Symbol sub_eps_
 
StateId eos_state_
 
HistoryMap history_
 

Detailed Description

template<class HistKey>
class kaldi::ArpaLmCompilerImpl< HistKey >

Definition at line 114 of file arpa-lm-compiler.cc.

Member Typedef Documentation

◆ HistoryMap

typedef unordered_map<HistKey, StateId, typename HistKey::HashType> HistoryMap
private

Definition at line 133 of file arpa-lm-compiler.cc.

Constructor & Destructor Documentation

◆ ArpaLmCompilerImpl()

ArpaLmCompilerImpl ( ArpaLmCompiler parent,
fst::StdVectorFst fst,
Symbol  sub_eps 
)

Definition at line 138 of file arpa-lm-compiler.cc.

References ArpaLmCompilerImpl< HistKey >::eos_state_, ArpaLmCompilerImpl< HistKey >::fst_, ArpaLmCompilerImpl< HistKey >::history_, and ArpaLmCompilerImpl< HistKey >::sub_eps_.

140  : parent_(parent), fst_(fst), bos_symbol_(parent->Options().bos_symbol),
141  eos_symbol_(parent->Options().eos_symbol), sub_eps_(sub_eps) {
142  // The algorithm maintains state per history. The 0-gram is a special state
143  // for empty history. All unigrams (including BOS) backoff into this state.
144  StateId zerogram = fst_->AddState();
145  history_[HistKey()] = zerogram;
146 
147  // Also, if </s> is not treated as epsilon, create a common end state for
148  // all transitions accepting the </s>, since they do not back off. This small
149  // optimization saves about 2% states in an average grammar.
150  if (sub_eps_ == 0) {
151  eos_state_ = fst_->AddState();
152  fst_->SetFinal(eos_state_, 0);
153  }
154 }
Lattice::StateId StateId
For an extended explanation of the framework of which grammar-fsts are a part, please see Support for...
Definition: graph.dox:21
fst::StdVectorFst * fst_

Member Function Documentation

◆ AddStateWithBackoff()

StateId AddStateWithBackoff ( HistKey  key,
float  backoff 
)
private

Definition at line 245 of file arpa-lm-compiler.cc.

References backoff, ArpaLmCompilerImpl< HistKey >::CreateBackoff(), ArpaLmCompilerImpl< HistKey >::fst_, and ArpaLmCompilerImpl< HistKey >::history_.

Referenced by ArpaLmCompilerImpl< HistKey >::ConsumeNGram().

246  {
247  typename HistoryMap::iterator dest_it = history_.find(key);
248  if (dest_it != history_.end()) {
249  // Found an existing state in the history map. Invariant: if the state in
250  // the map, then its backoff arc is in the FST. We are done.
251  return dest_it->second;
252  }
253  // Otherwise create a new state and its backoff arc, and register in the map.
254  StateId dest = fst_->AddState();
255  history_[key] = dest;
256  CreateBackoff(key.Tails(), dest, backoff);
257  return dest;
258 }
Lattice::StateId StateId
void CreateBackoff(HistKey key, StateId state, float weight)
fst::StdVectorFst * fst_
float backoff

◆ ConsumeNGram()

void ConsumeNGram ( const NGram ngram,
bool  is_highest 
)
virtual

Implements ArpaLmCompilerImplInterface.

Definition at line 157 of file arpa-lm-compiler.cc.

References ArpaLmCompilerImpl< HistKey >::AddStateWithBackoff(), NGram::backoff, ArpaLmCompilerImpl< HistKey >::bos_symbol_, ArpaLmCompilerImpl< HistKey >::eos_state_, ArpaLmCompilerImpl< HistKey >::eos_symbol_, ArpaLmCompilerImpl< HistKey >::fst_, ArpaLmCompilerImpl< HistKey >::history_, KALDI_ERR, KALDI_WARN, ArpaFileParser::LineReference(), NGram::logprob, ArpaLmCompilerImpl< HistKey >::parent_, ArpaFileParser::ShouldWarn(), ArpaLmCompilerImpl< HistKey >::sub_eps_, and NGram::words.

158  {
159  // Generally, we do the following. Suppose we are adding an n-gram "A B
160  // C". Then find the node for "A B", add a new node for "A B C", and connect
161  // them with the arc accepting "C" with the specified weight. Also, add a
162  // backoff arc from the new "A B C" node to its backoff state "B C".
163  //
164  // Two notable exceptions are the highest order n-grams, and final n-grams.
165  //
166  // When adding a highest order n-gram (e. g., our "A B C" is in a 3-gram LM),
167  // the following optimization is performed. There is no point adding a node
168  // for "A B C" with a "C" arc from "A B", since there will be no other
169  // arcs ingoing to this node, and an epsilon backoff arc into the backoff
170  // model "B C", with the weight of \bar{1}. To save a node, create an arc
171  // accepting "C" directly from "A B" to "B C". This saves as many nodes
172  // as there are the highest order n-grams, which is typically about half
173  // the size of a large 3-gram model.
174  //
175  // Indeed, this does not apply to n-grams ending in EOS, since they do not
176  // back off. These are special, as they do not have a back-off state, and
177  // the node for "(..anything..) </s>" is always final. These are handled
178  // in one of the two possible ways, If symbols <s> and </s> are being
179  // replaced by epsilons, neither node nor arc is created, and the logprob
180  // of the n-gram is applied to its source node as final weight. If <s> and
181  // </s> are preserved, then a special final node for </s> is allocated and
182  // used as the destination of the "</s>" acceptor arc.
183  HistKey heads(ngram.words.begin(), ngram.words.end() - 1);
184  typename HistoryMap::iterator source_it = history_.find(heads);
185  if (source_it == history_.end()) {
186  // There was no "A B", therefore the probability of "A B C" is zero.
187  // Print a warning and discard current n-gram.
188  if (parent_->ShouldWarn())
190  << " skipped: no parent (n-1)-gram exists";
191  return;
192  }
193 
194  StateId source = source_it->second;
195  StateId dest;
196  Symbol sym = ngram.words.back();
197  float weight = -ngram.logprob;
198  if (sym == sub_eps_ || sym == 0) {
199  KALDI_ERR << " <eps> or disambiguation symbol " << sym << "found in the ARPA file. ";
200  }
201  if (sym == eos_symbol_) {
202  if (sub_eps_ == 0) {
203  // Keep </s> as a real symbol when not substituting.
204  dest = eos_state_;
205  } else {
206  // Treat </s> as if it was epsilon: mark source final, with the weight
207  // of the n-gram.
208  fst_->SetFinal(source, weight);
209  return;
210  }
211  } else {
212  // For the highest order n-gram, this may find an existing state, for
213  // non-highest, will create one (unless there are duplicate n-grams
214  // in the grammar, which cannot be reliably detected if highest order,
215  // so we better do not do that at all).
216  dest = AddStateWithBackoff(
217  HistKey(ngram.words.begin() + (is_highest ? 1 : 0),
218  ngram.words.end()),
219  -ngram.backoff);
220  }
221 
222  if (sym == bos_symbol_) {
223  weight = 0; // Accepting <s> is always free.
224  if (sub_eps_ == 0) {
225  // <s> is as a real symbol, only accepted in the start state.
226  source = fst_->AddState();
227  fst_->SetStart(source);
228  } else {
229  // The new state for <s> unigram history *is* the start state.
230  fst_->SetStart(dest);
231  return;
232  }
233  }
234 
235  // Add arc from source to dest, whichever way it was found.
236  fst_->AddArc(source, fst::StdArc(sym, sym, weight, dest));
237  return;
238 }
Lattice::StateId StateId
fst::StdArc StdArc
StateId AddStateWithBackoff(HistKey key, float backoff)
fst::StdVectorFst * fst_
std::string LineReference() const
Inside ConsumeNGram(), returns a formatted reference to the line being compiled, to print out as part...
#define KALDI_ERR
Definition: kaldi-error.h:147
#define KALDI_WARN
Definition: kaldi-error.h:150
bool ShouldWarn()
Increments warning count, and returns true if a warning should be printed or false if the count has e...

◆ CreateBackoff()

void CreateBackoff ( HistKey  key,
StateId  state,
float  weight 
)
inlineprivate

Definition at line 265 of file arpa-lm-compiler.cc.

References ArpaLmCompilerImpl< HistKey >::fst_, ArpaLmCompilerImpl< HistKey >::history_, and ArpaLmCompilerImpl< HistKey >::sub_eps_.

Referenced by ArpaLmCompilerImpl< HistKey >::AddStateWithBackoff().

266  {
267  typename HistoryMap::iterator dest_it = history_.find(key);
268  while (dest_it == history_.end()) {
269  key = key.Tails();
270  dest_it = history_.find(key);
271  }
272 
273  // The arc should transduce either <eos> or #0 to <eps>, depending on the
274  // epsilon substitution mode. This is the only case when input and output
275  // label may differ.
276  fst_->AddArc(state, fst::StdArc(sub_eps_, 0, weight, dest_it->second));
277 }
fst::StdArc StdArc
fst::StdVectorFst * fst_

Member Data Documentation

◆ bos_symbol_

Symbol bos_symbol_
private

Definition at line 127 of file arpa-lm-compiler.cc.

Referenced by ArpaLmCompilerImpl< HistKey >::ConsumeNGram().

◆ eos_state_

◆ eos_symbol_

Symbol eos_symbol_
private

Definition at line 128 of file arpa-lm-compiler.cc.

Referenced by ArpaLmCompilerImpl< HistKey >::ConsumeNGram().

◆ fst_

◆ history_

◆ parent_

ArpaLmCompiler* parent_
private

Definition at line 125 of file arpa-lm-compiler.cc.

Referenced by ArpaLmCompilerImpl< HistKey >::ConsumeNGram().

◆ sub_eps_


The documentation for this class was generated from the following file: