pre-determinize-inl.h
Go to the documentation of this file.
1 // fstext/pre-determinize-inl.h
2 
3 // Copyright 2009-2011 Microsoft Corporation
4 
5 // See ../../COPYING for clarification regarding multiple authors
6 //
7 // Licensed under the Apache License, Version 2.0 (the "License");
8 // you may not use this file except in compliance with the License.
9 // You may obtain a copy of the License at
10 //
11 // http://www.apache.org/licenses/LICENSE-2.0
12 //
13 // THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14 // KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
15 // WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
16 // MERCHANTABLITY OR NON-INFRINGEMENT.
17 // See the Apache 2 License for the specific language governing permissions and
18 // limitations under the License.
19 
20 #ifndef KALDI_FSTEXT_PRE_DETERMINIZE_INL_H_
21 #define KALDI_FSTEXT_PRE_DETERMINIZE_INL_H_
22 
23 
24 /* Do not include this file directly. It is an implementation file included by PreDeterminize.h */
25 
26 /*
27  Predeterminization
28 
29  This is a function that makes an FST compactly determinizable by inserting symbols on the input
30  side as necessary for disambiguation. Note that we do not treat epsilon as a real symbol
31  when measuring determinizability in this sense. The extra symbols are added to the vocabulary,
32  on the input side; these are of the form (prefix)1, (prefix)2, and so on without limit, where
33  (prefix) is some prefix the user provides, e.g. '#' (the function checks that this will not
34  lead to conflicts with symbols already in the FST). The function tells us how many such
35  symbols it created.
36 
37  Note that there is a paper "Generalized optimization algorithm for speech recognition
38  transducers" by Allauzen and Mohri, that deals with a similar issue, but this is a very
39  different algorithm that only aims to ensure determinizability, but not *compact*
40  determinizability.
41 
42  Our algorithm is slightly heuristic, and probably not optimal, but does ensure that the
43  output is compactly determinizable, possibly at the expense of inserting unnecessary
44  symbols. We considered more sophisticated algorithms, but these were extremely
45  complicated and would give the same output for the kinds of inputs that we envisage.
46 
47  Suppose the input FST is T. We want to ensure that in det(T), if we consider the
48  states of det(T) as weighted subsets of states of T, each state of T only appears once
49  in any given subset. This ensures that det(T) is no larger than T in an appropriate
50  sense. The way we do this is as follows. We identify all states in T that have
51  multiple input transitions (counting "being an initial state" as an input transition).
52  Let's call these "problematic" states. For a problematic state p we stipulate that it
53  can never appear in any state of det(T) unless that state equals (p, \bar{1}) [i.e. p,
54  unweighted]. In order to ensure this, we insert input symbols on the transitions to these
55  problematic states (this may necessitate adding extra states).
56  We also stipulate that the path through det(T) should always be sufficient to tell us
57  the path through T (and we insert extra symbols sufficient to make this so). This is to
58  simplify the algorithm, so that we don't have to consider the output symbols or weights
59  when predeterminizing.
60 
61  The algorithm is as follows.
62 
63  (A) Definitions
64 
65  (i) Define a *problematic state* as a state that either has multiple input transitions,
66  or is an initial state and has at least one input transition.
67 
68  (ii) For an arc a, define:
69  i[a] = input symbol on a
70  o[a] = output symbol on a
71  n[a] = dest-state of a
72  p[a] = origin-state of a
73 
74  For a state q, define
75  E[q] = set of transitions leaving q.
76  For a set of states Q, define
77  E[Q] = set of transitions leaving some q in Q
78 
79  (iii) For a state s, define Closure(s) as the union of state s, and all states t
80  that are reachable via sequences of arcs a such that i[a]=epsilon and n[a] is
81  not problematic.
82 
83  For a set of states S, define Closure(S) as the union of the closures of
84  states s in S.
85 
86  (B) Inputs and outputs.
87 
88  (i) Inputs and preconditions. Input is an FST, which should have a symbol table compiled into
89  it, and a prefix (e.g. #) for symbols to be added. We check that the input FST is trim,
90  and that it does not have any symbols that appear on its arcs, that are equal to the prefix
91  followed by digits.
92 
93  (ii) Outputs: The algorithm modifies the FST that is given to it, and returns the number of
94  the highest numbered "extra symbol" inserted. The extra symbols are numbered #1, #2 and
95  so on without limit (as integers). They are inserted into the symbol table in a sequential
96  way by calling AvailableKey()
97  for each in turn (this is stipulated in case we need to keep other symbol tables in sync).
98 
99  (C) Sub-algorithm: Closure(S). This requires the array p(s), defined below, which is true
100  if s is problematic. This also requires, for efficiency, that the arcs be sorted on input
101  label.
102  Input: a set of states S. [plus, the fst and the array p].
103  Output: a set of states T.
104  Algorithm:
105  set T <-- S, Q <-- S.
106  while Q is nonempty:
107  pop a state s from Q.
108  for each transition a from state s with epsilon on the input label [we can
109  find these efficiently using the sorting on arcs]:
110  If p(n[a]) is false and n[a] is not in T:
111  Insert n[a] into T.
112  Add n[a] to Q.
113  return T.
114 
115 
116  (D) Main algorithm.
117 
118 
119  (i) (a) Check preconditions (FST is trim)
120  (b) Make sure there is just one final state (insert epsilon transitions as necessary).
121  (c) Sort arcs on input label (so epsilon arcs are at the start of arc lists).
122 
123 
124  (ii) Work out the set of problematic states by constructing a boolean array indexed by
125  states, i.e.
126  p(s)
127  which is true if the state is problematic. We can do this by constructing an array
128  t(s) to store the number of transitions into each state [adding one for the initial state],
129  and then setting p(s) = true if t(s) > 1.
130 
131  Also create a boolean array d(s), defined for states, and set d(s) = false.
132  This array is purely for sanity-checking that we are processing each state exactly once.
133 
134  (iii) Set up an array of integers m(a), indexed by arcs (how exactly we store these is
135  implementation-dependent, but this will probably be a hash from (state, arc-index) to
136  integers. m(a) will store the extra symbol, if any, to be added to that arc (or -1
137  if no such symbol; we can also simply have the arc not present in the hash). The
138  initial value of m(a) is -1 (if array), or undefined (if hash).
139 
140  (iv) Initialize a set of sets-of-states S, and a queue of pairs Q, as follows.
141  The pairs in Q are a pair of (set-of-states, integer), where the integer
142  is the number of "special symbols" already used up for that state.
143 
144  Note that we use a special indexing for the sets in both S and Q, rather than
145  using std::set. We use a sorted vector of StateId's. And in S, we index them
146  by the lowest-numbered state-id. Because each state is supposed to only ever
147  be a member of one set, if there is an attempt to add another, different set
148  with the same lowest-numbered state-id, we detect an error.
149 
150  Let I be the single initial state (OpenFST only supports one).
151  We set:
152  S = { Closure(I) }
153  Push (Closure(I), 0) onto Q.
154  Then for each state s such that p(s) = true, and s is not an initial state:
155  S <-- S u { Closure(s) }
156  Push (Closure(s), 0) onto Q.
157 
158  (v) While Q is nonempty:
159 
160  (a) Pop pair (A, n) from Q (queue discipline is arbitrary).
161 
162  (b) For each state s in A, check that d(s) is false, and set d(s) to true.
163  This is for sanity checking only.
164 
165  (c)
166  Let S_\eps be the set of epsilon-transitions from members of A to problematic
167  states (i.e. S_\eps = \{ a \in E[A]: i[a]=\epsilon, p(n[a]) = true \}).
168 
169  Next, we will define, for each t \neq \epsilon, S_t as the set of
170  transitions from some state s in S with t as the input label, i.e.:
171  S_t = \{ a \in E[A]: i[a] = t \}
172  We further define T_t and U_t as the subsets of S where the destination
173  state is problematic and non-problematic respectively, i.e:
174  T_t = \{ a \in E[A]: i[a] = t, p(n[a]) = true \}
175  U_t = \{ a \in E[A]: i[a] = t, p(n[a]) = false \}
176 
177  The easiest way to obtain these sets is probably to have a hash indexed by
178  t that maps to a list of pairs (state, arc-offset) that stores S_t.
179  From this we can work out the sizes of T_t and U_t on the fly.
180 
181  (d)
182  for each transition a in S_\eps:
183  m(a) <-- n # Will put symbol n on this transition.
184  n <-- n+1 # Note, same n as in pair (A, n)
185 
186  (e)
187  next,
188  for each t\neq epsilon s.t. S_t is nonempty,
189 
190  if |S_t| > 1 #if-statement is because if |S_t|=|T_t|=1, no need for prefix.
191  k = 0
192  for each transition a in T_t:
193  set m(a) to k.
194  set k = k+1
195 
196  if |U_t| > 0
197  Let V_t be the set of destination-states of arcs in U_t.
198  if Closure(V_t) is not in S:
199  insert Closure(V_t) into S, and add the pair (Closure(V_t), k) to Q.
200 
201  (vi) Check that for each state in the FST, d(s) = true.
202 
203  (vii) Let n = max_a m(a). This is the highest-numbered extra symbol (extra symbols
204  start from zero, in this numbering which doesn't correspond to the symbol-table
205  numbering). Here we add n+1 extra symbols to the symbol table and store
206  the mappings from 0, 1, ... n to the symbol-id.
207 
208  (viii) Set up a hash h from (state, int) to (state-id) such that
209  t = h(s, k)
210  will be the state-id of a newly-created state that has a transition to state s
211  with input-label #k.
212 
213  (ix) For each arc a such that m(a) != 0:
214  If i[a] = epsilon (the input label is epsilon):
215  Change i[a] to #m(a). [i.e. prefix then digit m(a)]
216  Otherwise:
217  If t = h(n[a], m(a)) is not defined [where n[a] is the dest-state]:
218  create a new state t with a transition to n[a], with input-label #m(a) and
219  no output-label or weight. Set h(n[a], m(a)) = t.
220  Change n[a] to h(n[a], m(a)).
221 
222 
223 */
224 namespace fst {
225 
226 namespace pre_determinize_helpers {
227 
228 // make it inline to avoid having to put it in a .cc file which most functions here
229 // could not go in.
230 inline bool HasBannedPrefixPlusDigits(SymbolTable *symTable, std::string prefix, std::string *bad_sym) {
231  // returns true if the symbol table contains any string consisting of this
232  // (possibly empty) prefix followed by a nonempty sequence of digits (0 to 9).
233  // requires symTable to be non-NULL.
234  // if bad_sym != NULL, puts the first bad symbol it finds in *bad_sym.
235  assert(symTable != NULL);
236  const char *prefix_ptr = prefix.c_str();
237  size_t prefix_len = strlen(prefix_ptr); // allowed to be zero but not encouraged.
238  for (SymbolTableIterator siter(*symTable); !siter.Done(); siter.Next()) {
239  const char *sym = siter.Symbol().c_str();
240  if (!strncmp(prefix_ptr, sym, prefix_len)) { // has prefix.
241  if (isdigit(sym[prefix_len])) { // we don't allow prefix followed by a digit, as a symbol.
242  // Has at least one digit.
243  size_t pos;
244  for (pos = prefix_len;sym[pos] != '\0'; pos++)
245  if (!isdigit(sym[pos])) break;
246  if (sym[pos] == '\0') { // All remaining characters were digits.
247  if (bad_sym != NULL) *bad_sym = (std::string) sym;
248  return true;
249  }
250  } // else OK because prefix was followed by '\0' or a non-digit.
251  }
252  }
253  return false; // doesn't have banned symbol.
254 }
255 
256 template<class T> void CopySetToVector(const std::set<T> s, std::vector<T> *v) {
257  // adds members of s to v, in sorted order from lowest to highest
258  // (because the set was in sorted order).
259  assert(v != NULL);
260  v->resize(s.size());
261  typename std::set<T>::const_iterator siter = s.begin();
262  typename std::vector<T>::iterator viter = v->begin();
263  for (; siter != s.end(); ++siter, ++viter) {
264  assert(viter != v->end());
265  *viter = *siter;
266  }
267 }
268 
269 // Warning. This function calls 'new'.
270 template<class T>
271 std::vector<T>* InsertMember(const std::vector<T> m, std::vector<std::vector<T>*> *S) {
272  assert(m.size() > 0);
273  T idx = m[0];
274  assert(idx>=(T)0 && idx < (T)S->size());
275  if ( (*S)[idx] != NULL) {
276  assert( *((*S)[idx]) == m );
277  // The vectors should be the same. Otherwise this is a bug in the algorithm.
278  // It could either be a programming error or a deeper conceptual bug.
279  return NULL; // nothing was inserted.
280  } else {
281  std::vector<T> *ret = (*S)[idx] = new std::vector<T>(m); // New copy of m.
282  return ret; // was inserted.
283  }
284 }
285 
286 
287 // See definition of Closure(S) in item A(iii) in the comment above. it's the set of states
288 // that are reachable from S via sequences of arcs a such that i[a]=epsilon and n[a] is
289 // not problematic. We assume that the fst is sorted on input label (so epsilon arcs first)
290 // The algorithm is described in section (C) above. We use the same variable for S and T.
291 template<class Arc> void Closure(MutableFst<Arc> *fst, std::set<typename Arc::StateId> *S,
292  const std::vector<bool> &pVec) {
293  typedef typename Arc::StateId StateId;
294  std::vector<StateId> Q;
295  CopySetToVector(*S, &Q);
296  while (Q.size() != 0) {
297  StateId s = Q.back();
298  Q.pop_back();
299  for (ArcIterator<MutableFst<Arc> > aiter(*fst, s); ! aiter.Done(); aiter.Next()) {
300  const Arc &arc = aiter.Value();
301  if (arc.ilabel != 0) break; // Break from the loop: due to sorting there will be no
302  // more transitions with epsilons as input labels.
303  if (!pVec[arc.nextstate]) { // Next state is not problematic -> we can use this transition.
304  std::pair< typename std::set<StateId>::iterator, bool > p = S->insert(arc.nextstate);
305  if (p.second) { // True means: was inserted into S (wasn't already there).
306  Q.push_back(arc.nextstate);
307  }
308  }
309  }
310  }
311 } // end function Closure.
312 
313 } // end namespace pre_determinize_helpers.
314 
315 
316 template<class Arc, class Int>
317 void PreDeterminize(MutableFst<Arc> *fst,
318  typename Arc::Label first_new_sym,
319  std::vector<Int> *symsOut) {
320  typedef typename Arc::Label Label;
321  typedef typename Arc::StateId StateId;
322  typedef size_t ArcId; // Our own typedef, not standard OpenFst. Use size_t
323  // for compatibility with argument of ArcIterator::Seek().
324  typedef typename Arc::Weight Weight;
325  assert(first_new_sym > 0);
326  assert(fst != NULL);
327  if (fst->Start() == kNoStateId) return; // for empty FST, nothing to do.
328  assert(symsOut != NULL && symsOut->size() == 0); // we will output the symbols we add into this.
329 
330  { // (D)(i)(a): check is trim (i.e. connected, in OpenFST parlance).
331  KALDI_VLOG(2) << "PreDeterminize: Checking FST properties";
332  uint64 props = fst->Properties(kAccessible|kCoAccessible, true); // true-> computes properties if unknown at time when called.
333  if (props != (kAccessible|kCoAccessible)) { // All states are not both accessible and co-accessible...
334  KALDI_ERR << "PreDeterminize: FST is not trim";
335  }
336  }
337 
338  { // (D)(i)(b): make single final state.
339  KALDI_VLOG(2) << "PreDeterminize: creating single final state";
340  CreateSuperFinal(fst);
341  }
342 
343  { // (D)(i)(c): sort arcs on input.
344  KALDI_VLOG(2) << "PreDeterminize: sorting arcs on input";
345  ILabelCompare<Arc> icomp;
346  ArcSort(fst, icomp);
347  }
348 
349  StateId n_states = 0, max_state = 0; // Compute n_states, max_state = highest-numbered state.
350  { // compute nStates, maxStates.
351  for (StateIterator<MutableFst<Arc> > iter(*fst); ! iter.Done(); iter.Next()) {
352  StateId state = iter.Value();
353  assert(state>=0);
354  n_states++;
355  if (state > max_state) max_state = state;
356  }
357  KALDI_VLOG(2) << "PreDeterminize: n_states = "<<(n_states)<<", max_state ="<<(max_state);
358  }
359 
360  std::vector<bool> p_vec(max_state+1, false); // compute this next.
361  { // D(ii): computing the array p. ["problematic states, i.e. states with >1 input transition,
362  // counting being the initial state as an input transition"].
363  std::vector<bool> seen_vec(max_state+1, false); // rather than counting incoming transitions we just have a bool that says we saw at least one.
364 
365  seen_vec[fst->Start()] = true;
366  for (StateIterator<MutableFst<Arc> > siter(*fst); ! siter.Done(); siter.Next()) {
367  for (ArcIterator<MutableFst<Arc> > aiter(*fst, siter.Value()); ! aiter.Done(); aiter.Next()) {
368  const Arc &arc = aiter.Value();
369  assert(arc.nextstate>=0&&arc.nextstate<max_state+1);
370  if (seen_vec[arc.nextstate])
371  p_vec[arc.nextstate] = true; // now have >1 transition in, so problematic.
372  else
373  seen_vec[arc.nextstate] = true;
374  }
375  }
376  }
377  // D(iii): set up m(a)
378  std::map<std::pair<StateId, ArcId>, size_t> m_map;
379  // This is the array m, indexed by arcs. It maps to the index of the symbol we add.
380 
381 
382  // WARNING: we should be sure to clean up this memory before exiting. Do not return
383  // or throw an exception from this function, later than this point, without cleaning up!
384  // Note that the vectors are shared between Q and S (they "belong to" S.
385  std::vector<std::vector<StateId>* > S(max_state+1, (std::vector<StateId>*)(void*)0);
386  std::vector<std::pair<std::vector<StateId>*, size_t> > Q;
387 
388  // D(iv): initialize S and Q.
389  {
390  std::vector<StateId> all_seed_states; // all "problematic" states, plus initial state (if not problematic).
391  if (!p_vec[fst->Start()])
392  all_seed_states.push_back(fst->Start());
393  for (StateId s = 0;s<=max_state; s++)
394  if (p_vec[s]) all_seed_states.push_back(s);
395 
396  for (size_t idx = 0;idx < all_seed_states.size(); idx++) {
397  StateId s = all_seed_states[idx];
398  std::set<StateId> closure_s;
399  closure_s.insert(s); // insert "seed" state.
400  pre_determinize_helpers::Closure(fst, &closure_s, p_vec); // follow epsilons to non-problematic states.
401  // Closure in this case whis will usually not add anything, for typical topologies in speech
402  std::vector<StateId> closure_s_vec;
403  pre_determinize_helpers::CopySetToVector(closure_s, &closure_s_vec);
404  KALDI_ASSERT(closure_s_vec.size() != 0);
405  std::vector<StateId> *ptr = pre_determinize_helpers::InsertMember(closure_s_vec, &S);
406  KALDI_ASSERT(ptr != NULL); // Or conceptual bug or programming error.
407  Q.push_back(std::pair<std::vector<StateId>*, size_t>(ptr, 0));
408  }
409  }
410 
411  std::vector<bool> d_vec(max_state+1, false); // "done vector". Purely for debugging.
412 
413 
414  size_t num_extra_det_states = 0;
415 
416  // (D)(v)
417  while (Q.size() != 0) {
418 
419  // (D)(v)(a)
420  std::pair<std::vector<StateId>*, size_t> cur_pair(Q.back());
421  Q.pop_back();
422  const std::vector<StateId> &A(*cur_pair.first);
423  size_t n =cur_pair.second; // next special symbol to add.
424 
425  // (D)(v)(b)
426  for (size_t idx = 0;idx < A.size(); idx++) {
427  assert(d_vec[A[idx]] == false && "This state has been seen before. Algorithm error.");
428  d_vec[A[idx]] = true;
429  }
430 
431  // From here is (D)(v)(c). We work out S_\eps and S_t (for t\neq eps)
432  // simultaneously at first.
433  std::map<Label, std::set<std::pair<std::pair<StateId, ArcId>, StateId> > > arc_hash;
434  // arc_hash is a hash with info of all arcs from states in the set A to
435  // non-problematic states.
436  // It is a map from ilabel to pair(pair(start-state, arc-offset), end-state).
437  // Here, arc-offset reflects the order in which we accessed the arc using the
438  // ArcIterator (zero for the first arc).
439 
440 
441  { // This block sets up arc_hash
442  for (size_t idx = 0;idx < A.size(); idx++) {
443  StateId s = A[idx];
444  assert(s>=0 && s<=max_state);
445  ArcId arc_id = 0;
446  for (ArcIterator<MutableFst<Arc> > aiter(*fst, s); ! aiter.Done(); aiter.Next(), ++arc_id) {
447  const Arc &arc = aiter.Value();
448 
449  std::pair<std::pair<StateId, ArcId>, StateId>
450  this_pair(std::pair<StateId, ArcId>(s, arc_id), arc.nextstate);
451  bool inserted = (arc_hash[arc.ilabel].insert(this_pair)).second;
452  assert(inserted); // Otherwise we had a duplicate.
453  }
454  }
455  }
456 
457  // (D)(v)(d)
458  if (arc_hash.count(0) == 1) { // We have epsilon transitions out.
459  std::set<std::pair<std::pair<StateId, ArcId>, StateId> > &eps_set = arc_hash[0];
460  typedef typename std::set<std::pair<std::pair<StateId, ArcId>, StateId> >::iterator set_iter_t;
461  for (set_iter_t siter = eps_set.begin(); siter != eps_set.end(); ++siter) {
462  const std::pair<std::pair<StateId, ArcId>, StateId> &this_pr = *siter;
463  if (p_vec[this_pr.second]) { // Eps-transition to problematic state.
464  assert(m_map.count(this_pr.first) == 0);
465  m_map[this_pr.first] = n;
466  n++;
467  }
468  }
469  }
470 
471  // (D)(v)(e)
472  {
473  typedef typename std::map<Label, std::set<std::pair<std::pair<StateId, ArcId>, StateId> > >::iterator map_iter_t;
474  typedef typename std::set<std::pair<std::pair<StateId, ArcId>, StateId> >::iterator set_iter_t2;
475  for (map_iter_t miter = arc_hash.begin(); miter != arc_hash.end(); ++miter) {
476  Label t = miter->first;
477  std::set<std::pair<std::pair<StateId, ArcId>, StateId> > &S_t = miter->second;
478  if (t != 0) { // For t != epsilon,
479  std::set<StateId> V_t; // set of destination non-problem states. Will create this set now.
480 
481  // exists_noproblem is true iff |U_t| > 0.
482  size_t k = 0;
483 
484  // First loop "for each transition a in T_t" (i.e. transitions to problematic states)
485  // The if-statement if (|S_t|>1) is pushed inside the loop, as the loop also computes
486  // the set V_t.
487  for (set_iter_t2 siter = S_t.begin(); siter != S_t.end(); ++siter) {
488  const std::pair<std::pair<StateId, ArcId>, StateId> &this_pr = *siter;
489  if (p_vec[this_pr.second]) { // only consider problematic states (just set T_t)
490  if (S_t.size() > 1) { // This is where we pushed the if-statement in.
491  assert(m_map.count(this_pr.first) == 0);
492  m_map[this_pr.first] = k;
493  k++;
494  num_extra_det_states++;
495  }
496  } else { // Create the set V_t.
497  V_t.insert(this_pr.second);
498  }
499  }
500  if (V_t.size() != 0) {
501  pre_determinize_helpers::Closure(fst, &V_t, p_vec); // follow epsilons to non-problematic states.
502  std::vector<StateId> closure_V_t_vec;
503  pre_determinize_helpers::CopySetToVector(V_t, &closure_V_t_vec);
504  std::vector<StateId> *ptr = pre_determinize_helpers::InsertMember(closure_V_t_vec, &S);
505  if (ptr != NULL) { // was inserted.
506  Q.push_back(std::pair<std::vector<StateId>*, size_t>(ptr, k));
507  }
508  }
509  }
510  }
511  }
512  } // end while (Q.size() != 0)
513 
514 
515  { // (D)(vi): Check that for each state in the FST, d(s) = true.
516  for (StateIterator<MutableFst<Arc> > siter(*fst); ! siter.Done(); siter.Next()) {
517  StateId val = siter.Value();
518  assert(d_vec[val] == true);
519  }
520  }
521 
522  { // (D)(vii): compute symbol-table ID's.
523  // sets up symsOut array.
524  int64 n = -1;
525  for (typename std::map<std::pair<StateId, ArcId>, size_t>::iterator m_iter = m_map.begin();
526  m_iter != m_map.end();
527  ++m_iter) {
528  n = std::max(n, (int64) m_iter->second); // m_iter->second is of type size_t.
529  }
530  // At this point n is the highest symbol-id (type size_t) of symbols we must add.
531  n++; // This is now the number of symbols we must add.
532  for (size_t i = 0;static_cast<int64>(i)<n;i++) symsOut->push_back(first_new_sym + i);
533  }
534 
535  // (D)(viii): set up hash.
536  std::map<std::pair<StateId, size_t>, StateId> h_map;
537 
538  { // D(ix): add extra symbols! This is where the work gets done.
539 
540  // Core part of this is below, search for (*)
541  size_t n_states_added = 0;
542 
543  for (typename std::map<std::pair<StateId, ArcId>, size_t>::iterator m_iter = m_map.begin();
544  m_iter != m_map.end();
545  ++m_iter) {
546  StateId state = m_iter->first.first;
547  ArcId arcpos = m_iter->first.second;
548  size_t m_a = m_iter->second;
549 
550  MutableArcIterator<MutableFst<Arc> > aiter(fst, state);
551  aiter.Seek(arcpos);
552  Arc arc = aiter.Value();
553 
554  // (*) core part here.
555  if (arc.ilabel == 0)
556  arc.ilabel = (*symsOut)[m_a];
557  else {
558  std::pair<StateId, size_t> pr(arc.nextstate, m_a);
559  if (!h_map.count(pr)) {
560  n_states_added++;
561  StateId newstate = fst->AddState();
562  assert(newstate>=0);
563  Arc new_arc( (*symsOut)[m_a], (Label)0, Weight::One(), arc.nextstate);
564  fst->AddArc(newstate, new_arc);
565  h_map[pr] = newstate;
566  }
567  arc.nextstate = h_map[pr];
568  }
569  aiter.SetValue(arc);
570  }
571 
572  KALDI_VLOG(2) << "Added " <<(n_states_added)<<" new states and added/changed "<<(m_map.size())<<" arcs";
573 
574  }
575  // Now free up memory.
576  for (size_t i = 0;i < S.size();i++)
577  delete S[i];
578 } // end function PreDeterminize
579 
580 
581 template<class Label> void CreateNewSymbols(SymbolTable *input_sym_table, int nSym,
582  std::string prefix, std::vector<Label> *symsOut) {
583  // Creates nSym new symbols named (prefix)0, (prefix)1 and so on.
584  // Crashes if it cannot create them because one or more of them were in the symbol
585  // table already.
586  assert(symsOut && symsOut->size() == 0);
587  for (int i = 0;i < nSym;i++) {
588  std::stringstream ss; ss << prefix << i;
589  std::string str = ss.str();
590  if (input_sym_table->Find(str) != -1) { // should not be present.
591  }
592  assert(symsOut);
593  symsOut->push_back( (Label) input_sym_table->AddSymbol(str));
594  }
595 }
596 
597 
598 // see pre-determinize.h for documentation.
599 template<class Arc> void AddSelfLoops(MutableFst<Arc> *fst, std::vector<typename Arc::Label> &isyms,
600  std::vector<typename Arc::Label> &osyms) {
601  assert(fst != NULL);
602  assert(isyms.size() == osyms.size());
603  typedef typename Arc::Label Label;
604  typedef typename Arc::StateId StateId;
605  typedef typename Arc::Weight Weight;
606  size_t n = isyms.size();
607  if (n == 0) return; // Nothing to do.
608 
609  // {
610  // the following declarations and statements are for quick detection of these
611  // symbols, which is purely for debugging/checking purposes.
612  Label isyms_min = *std::min_element(isyms.begin(), isyms.end()),
613  isyms_max = *std::max_element(isyms.begin(), isyms.end()),
614  osyms_min = *std::min_element(osyms.begin(), osyms.end()),
615  osyms_max = *std::max_element(osyms.begin(), osyms.end());
616  std::set<Label> isyms_set, osyms_set;
617  for (size_t i = 0; i < isyms.size(); i++) {
618  assert(isyms[i] > 0 && osyms[i] > 0); // should not have epsilon or invalid symbols.
619  isyms_set.insert(isyms[i]);
620  osyms_set.insert(osyms[i]);
621  }
622  assert(isyms_set.size() == n && osyms_set.size() == n);
623  // } end block.
624 
625  for (StateIterator<MutableFst<Arc> > siter(*fst); ! siter.Done(); siter.Next()) {
626  StateId state = siter.Value();
627  bool this_state_needs_self_loops = (fst->Final(state) != Weight::Zero());
628  for (ArcIterator<MutableFst<Arc> > aiter(*fst, state); ! aiter.Done(); aiter.Next()) {
629  const Arc &arc = aiter.Value();
630  // If one of the following asserts fails, it means that the input FST already had the symbols
631  // we are inserting. This is contrary to the preconditions of this algorithm.
632  assert(!(arc.ilabel>=isyms_min && arc.ilabel<=isyms_max && isyms_set.count(arc.ilabel) != 0));
633  assert(!(arc.olabel>=osyms_min && arc.olabel<=osyms_max && osyms_set.count(arc.olabel) != 0));
634  if (arc.olabel != 0) // Has non-epsilon output label -> need self loops.
635  this_state_needs_self_loops = true;
636  }
637  if (this_state_needs_self_loops) {
638  for (size_t i = 0;i < n;i++) {
639  Arc arc;
640  arc.ilabel = isyms[i];
641  arc.olabel = osyms[i];
642  arc.weight = Weight::One();
643  arc.nextstate = state;
644  fst->AddArc(state, arc);
645  }
646  }
647  }
648 }
649 
650 template<class Arc>
651 int64 DeleteISymbols(MutableFst<Arc> *fst, std::vector<typename Arc::Label> isyms) {
652 
653  // We could do this using the Mapper concept, but this is much easier to understand.
654 
655  typedef typename Arc::Label Label;
656  typedef typename Arc::StateId StateId;
657 
658  int64 num_deleted = 0;
659 
660  if (isyms.size() == 0) return 0;
661  Label isyms_min = *std::min_element(isyms.begin(), isyms.end()),
662  isyms_max = *std::max_element(isyms.begin(), isyms.end());
663  bool isyms_consecutive = (isyms_max+1-isyms_min == static_cast<Label>(isyms.size()));
664  std::set<Label> isyms_set;
665  if (!isyms_consecutive)
666  for (size_t i = 0;i < isyms.size();i++)
667  isyms_set.insert(isyms[i]);
668 
669  for (StateIterator<MutableFst<Arc> > siter(*fst); ! siter.Done(); siter.Next()) {
670  StateId state = siter.Value();
671  for (MutableArcIterator<MutableFst<Arc> > aiter(fst, state); ! aiter.Done(); aiter.Next()) {
672  const Arc &arc = aiter.Value();
673  if (arc.ilabel >= isyms_min && arc.ilabel <= isyms_max) {
674  if (isyms_consecutive || isyms_set.count(arc.ilabel) != 0) {
675  num_deleted++;
676  Arc mod_arc (arc);
677  mod_arc.ilabel = 0; // change label to epsilon.
678  aiter.SetValue(mod_arc);
679  }
680  }
681  }
682  }
683  return num_deleted;
684 }
685 
686 template<class Arc>
687 typename Arc::StateId CreateSuperFinal(MutableFst<Arc> *fst) {
688  typedef typename Arc::StateId StateId;
689  typedef typename Arc::Weight Weight;
690  assert(fst != NULL);
691  StateId num_states = fst->NumStates();
692  StateId num_final = 0;
693  std::vector<StateId> final_states;
694  for (StateId s = 0; s < num_states; s++) {
695  if (fst->Final(s) != Weight::Zero()) {
696  num_final++;
697  final_states.push_back(s);
698  }
699  }
700  if (final_states.size() == 1) {
701  if (fst->Final(final_states[0]) == Weight::One()) {
702  ArcIterator<MutableFst<Arc> > iter(*fst, final_states[0]);
703  if (iter.Done()) {
704  // We already have a final state w/ no transitions out and unit weight.
705  // So we're done.
706  return final_states[0];
707  }
708  }
709  }
710 
711  StateId final_state = fst->AddState();
712  fst->SetFinal(final_state, Weight::One());
713  for (size_t idx = 0; idx < final_states.size(); idx++) {
714  StateId s = final_states[idx];
715  Weight weight = fst->Final(s);
716  fst->SetFinal(s, Weight::Zero());
717  Arc arc;
718  arc.ilabel = 0;
719  arc.olabel = 0;
720  arc.nextstate = final_state;
721  arc.weight = weight;
722  fst->AddArc(s, arc);
723  }
724  return final_state;
725 }
726 
727 
728 } // namespace fst
729 
730 #endif // KALDI_FSTEXT_PRE_DETERMINIZE_INL_H_
fst::StdArc::StateId StateId
void CopySetToVector(const std::set< T > s, std::vector< T > *v)
void PreDeterminize(MutableFst< Arc > *fst, typename Arc::Label first_new_sym, std::vector< Int > *symsOut)
For an extended explanation of the framework of which grammar-fsts are a part, please see Support for...
Definition: graph.dox:21
void CreateNewSymbols(SymbolTable *input_sym_table, int nSym, std::string prefix, std::vector< Label > *symsOut)
bool HasBannedPrefixPlusDigits(SymbolTable *symTable, std::string prefix, std::string *bad_sym)
void AddSelfLoops(MutableFst< Arc > *fst, std::vector< typename Arc::Label > &isyms, std::vector< typename Arc::Label > &osyms)
AddSelfLoops is a function you will probably want to use alongside PreDeterminize, to add self-loops to any FSTs that you compose on the left hand side of the one modified by PreDeterminize.
struct rnnlm::@11::@12 n
Arc::StateId CreateSuperFinal(MutableFst< Arc > *fst)
#define KALDI_ERR
Definition: kaldi-error.h:147
void Closure(MutableFst< Arc > *fst, std::set< typename Arc::StateId > *S, const std::vector< bool > &pVec)
fst::StdArc::Label Label
fst::StdArc::Weight Weight
std::vector< T > * InsertMember(const std::vector< T > m, std::vector< std::vector< T > *> *S)
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185
#define KALDI_VLOG(v)
Definition: kaldi-error.h:156
int64 DeleteISymbols(MutableFst< Arc > *fst, std::vector< typename Arc::Label > isyms)