determinize-lattice.h
Go to the documentation of this file.
1 // fstext/determinize-lattice.h
2 
3 // Copyright 2009-2011 Microsoft Corporation
4 
5 // See ../../COPYING for clarification regarding multiple authors
6 //
7 // Licensed under the Apache License, Version 2.0 (the "License");
8 // you may not use this file except in compliance with the License.
9 // You may obtain a copy of the License at
10 //
11 // http://www.apache.org/licenses/LICENSE-2.0
12 //
13 // THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14 // KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
15 // WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
16 // MERCHANTABLITY OR NON-INFRINGEMENT.
17 // See the Apache 2 License for the specific language governing permissions and
18 // limitations under the License.
19 
20 #ifndef KALDI_FSTEXT_DETERMINIZE_LATTICE_H_
21 #define KALDI_FSTEXT_DETERMINIZE_LATTICE_H_
22 #include <fst/fstlib.h>
23 #include <fst/fst-decl.h>
24 #include <algorithm>
25 #include <map>
26 #include <set>
27 #include <vector>
28 #include "fstext/lattice-weight.h"
29 
30 namespace fst {
31 
34 
35 
36 // For example of usage, see test-determinize-lattice.cc
37 
38 /*
39  DeterminizeLattice implements a special form of determinization
40  with epsilon removal, optimized for a phase of lattice generation.
41  Its input is an FST with weight-type BaseWeightType (usually a pair of floats,
42  with a lexicographical type of order, such as LatticeWeightTpl<float>).
43  Typically this would be a state-level lattice, with input symbols equal to
44  words, and output-symbols equal to p.d.f's (so like the inverse of HCLG). Imagine representing this as an
45  acceptor of type CompactLatticeWeightTpl<float>, in which the input/output
46  symbols are words, and the weights contain the original weights together with
47  strings (with zero or one symbol in them) containing the original output labels
48  (the p.d.f.'s). We determinize this using acceptor determinization with
49  epsilon removal. Remember (from lattice-weight.h) that
50  CompactLatticeWeightTpl has a special kind of semiring where we always take
51  the string corresponding to the best cost (of type BaseWeightType), and
52  discard the other. This corresponds to taking the best output-label sequence
53  (of p.d.f.'s) for each input-label sequence (of words). We couldn't use the
54  Gallic weight for this, or it would die as soon as it detected that the input
55  FST was non-functional. In our case, any acyclic FST (and many cyclic ones)
56  can be determinized.
57  We assume that there is a function
58  Compare(const BaseWeightType &a, const BaseWeightType &b)
59  that returns (-1, 0, 1) according to whether (a < b, a == b, a > b) in the
60  total order on the BaseWeightType... this information should be the
61  same as NaturalLess would give, but it's more efficient to do it this way.
62  You can define this for things like TropicalWeight if you need to instantiate
63  this class for that weight type.
64 
65  We implement this determinization in a special way to make it efficient for
66  the types of FSTs that we will apply it to. One issue is that if we
67  explicitly represent the strings (in CompactLatticeWeightTpl) as vectors of
68  type vector<IntType>, the algorithm takes time quadratic in the length of
69  words (in states), because propagating each arc involves copying a whole
70  vector (of integers representing p.d.f.'s). Instead we use a hash structure
71  where each string is a pointer (Entry*), and uses a hash from (Entry*,
72  IntType), to the successor string (and a way to get the latest IntType and the
73  ancestor Entry*). [this is the class LatticeStringRepository].
74 
75  Another issue is that rather than representing a determinized-state as a
76  collection of (state, weight), we represent it in a couple of reduced forms.
77  Suppose a determinized-state is a collection of (state, weight) pairs; call
78  this the "canonical representation". Note: these collections are always
79  normalized to remove any common weight and string part. Define end-states as
80  the subset of states that have an arc out of them with a label on, or are
81  final. If we represent a determinized-state a the set of just its (end-state,
82  weight) pairs, this will be a valid and more compact representation, and will
83  lead to a smaller set of determinized states (like early minimization). Call
84  this collection of (end-state, weight) pairs the "minimal representation". As
85  a mechanism to reduce compute, we can also consider another representation.
86  In the determinization algorithm, we start off with a set of (begin-state,
87  weight) pairs (where the "begin-states" are initial or have a label on the
88  transition into them), and the "canonical representation" consists of the
89  epsilon-closure of this set (i.e. follow epsilons). Call this set of
90  (begin-state, weight) pairs, appropriately normalized, the "initial
91  representation". If two initial representations are the same, the "canonical
92  representation" and hence the "minimal representation" will be the same. We
93  can use this to reduce compute. Note that if two initial representations are
94  different, this does not preclude the other representations from being the same.
95 
96 */
97 
99  float delta; // A small offset used to measure equality of weights.
100  int max_mem; // If >0, determinization will fail and return false
101  // when the algorithm's (approximate) memory consumption crosses this threshold.
102  int max_loop; // If >0, can be used to detect non-determinizable input
103  // (a case that wouldn't be caught by max_mem).
104  DeterminizeLatticeOptions(): delta(kDelta),
105  max_mem(-1),
106  max_loop(-1) { }
107 };
108 
119 template<class Weight, class IntType>
120 bool DeterminizeLattice(
121  const Fst<ArcTpl<Weight> > &ifst,
122  MutableFst<ArcTpl<Weight> > *ofst,
124  bool *debug_ptr = NULL);
125 
126 
127 /* This is a version of DeterminizeLattice with a slightly more "natural" output format,
128  where the output sequences are encoded using the CompactLatticeArcTpl template
129  (i.e. the sequences of output symbols are represented directly as strings)
130  More efficient if ifst is arc-sorted on input label.
131  If the #arcs gets more than max_arcs, it will throw std::runtime_error (otherwise
132  this code does not use exceptions). This is mainly useful for debug.
133 */
134 template<class Weight, class IntType>
135 bool DeterminizeLattice(
136  const Fst<ArcTpl<Weight> >&ifst,
137  MutableFst<ArcTpl<CompactLatticeWeightTpl<Weight, IntType> > > *ofst,
139  bool *debug_ptr = NULL);
140 
141 
142 
143 
145 
146 } // end namespace fst
147 
149 
150 #endif
For an extended explanation of the framework of which grammar-fsts are a part, please see Support for...
Definition: graph.dox:21
bool DeterminizeLattice(const Fst< ArcTpl< Weight > > &ifst, MutableFst< ArcTpl< Weight > > *ofst, DeterminizeLatticeOptions opts, bool *debug_ptr)
This function implements the normal version of DeterminizeLattice, in which the output strings are re...