All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Modules Pages
Top-level tree-building functions

See Decision tree internals for context. More...

Collaboration diagram for Top-level tree-building functions:

Classes

struct  AccumulateTreeStatsOptions
 

Functions

EventMap * BuildTree (Questions &qopts, const std::vector< std::vector< int32 > > &phone_sets, const std::vector< int32 > &phone2num_pdf_classes, const std::vector< bool > &share_roots, const std::vector< bool > &do_split, const BuildTreeStatsType &stats, BaseFloat thresh, int32 max_leaves, BaseFloat cluster_thresh,int32 P)
 BuildTree is the normal way to build a set of decision trees. More...
 
EventMap * BuildTreeTwoLevel (Questions &qopts, const std::vector< std::vector< int32 > > &phone_sets, const std::vector< int32 > &phone2num_pdf_classes, const std::vector< bool > &share_roots, const std::vector< bool > &do_split, const BuildTreeStatsType &stats, int32 max_leaves_first, int32 max_leaves_second, bool cluster_leaves, int32 P, std::vector< int32 > *leaf_map)
 BuildTreeTwoLevel builds a two-level tree, useful for example in building tied mixture systems with multiple codebooks. More...
 
void GenRandStats (int32 dim, int32 num_stats, int32 N, int32 P, const std::vector< int32 > &phone_ids, const std::vector< int32 > &hmm_lengths, const std::vector< bool > &is_ctx_dep, bool ensure_all_phones_covered, BuildTreeStatsType *stats_out)
 GenRandStats generates random statistics of the form used by BuildTree. More...
 
void ReadSymbolTableAsIntegers (std::string filename, bool include_eps, std::vector< int32 > *syms)
 included here because it's used in some tree-building calling code. More...
 
void AutomaticallyObtainQuestions (BuildTreeStatsType &stats, const std::vector< std::vector< int32 > > &phone_sets_in, const std::vector< int32 > &all_pdf_classes_in, int32 P, std::vector< std::vector< int32 > > *questions_out)
 Outputs sets of phones that are reasonable for questions to ask in the tree-building algorithm. More...
 
void KMeansClusterPhones (BuildTreeStatsType &stats, const std::vector< std::vector< int32 > > &phone_sets_in, const std::vector< int32 > &all_pdf_classes_in, int32 P, int32 num_classes, std::vector< std::vector< int32 > > *sets_out)
 This function clusters the phones (or some initially specified sets of phones) into sets of phones, using a k-means algorithm. More...
 
void ReadRootsFile (std::istream &is, std::vector< std::vector< int32 > > *phone_sets, std::vector< bool > *is_shared_root, std::vector< bool > *is_split_root)
 Reads the roots file (throws on error). More...
 

Detailed Description

See Decision tree internals for context.

Function Documentation

void AutomaticallyObtainQuestions ( BuildTreeStatsType &  stats,
const std::vector< std::vector< int32 > > &  phone_sets_in,
const std::vector< int32 > &  all_pdf_classes_in,
int32  P,
std::vector< std::vector< int32 > > *  questions_out 
)

Outputs sets of phones that are reasonable for questions to ask in the tree-building algorithm.

These are obtained by tree clustering of the phones; for each node in the tree, all the leaves accessible from that node form one of the sets of phones.

Parameters
stats[in] The statistics as used for normal tree-building.
phone_sets_in[in] All the phones, pre-partitioned into sets. The output sets will be various unions of these sets. These sets will normally correspond to "real phones", in cases where the phones have stress and position markings.
all_pdf_classes_in[in] All the pdf-classes that we consider for clustering. In the normal case this is the singleton set {1}, which means that we only consider the central hmm-position of the standard 3-state HMM, for clustering purposes.
P[in] The central position in the phone context window; normally 1 for triphone system.s
questions_out[out] The questions (sets of phones) are output to here.

Definition at line 541 of file build-tree.cc.

References kaldi::DeletePointers(), kaldi::EnsureClusterableVectorNotNull(), kaldi::FilterStatsByKey(), rnnlm::i, kaldi::IsSortedAndUniq(), rnnlm::j, KALDI_ASSERT, KALDI_ERR, KALDI_WARN, TreeClusterOptions::kmeans_cfg, kaldi::kPdfClass, ClusterKMeansOptions::num_tries, kaldi::ObtainSetsOfPhones(), kaldi::SortAndUniq(), kaldi::SplitStatsByKey(), kaldi::SumStatsVec(), and kaldi::TreeCluster().

Referenced by main().

545  {
546  std::vector<std::vector<int32> > phone_sets(phone_sets_in);
547  std::vector<int32> phones;
548  for (size_t i = 0; i < phone_sets.size() ;i++) {
549  std::sort(phone_sets[i].begin(), phone_sets[i].end());
550  if (phone_sets[i].empty())
551  KALDI_ERR << "Empty phone set in AutomaticallyObtainQuestions";
552  if (!IsSortedAndUniq(phone_sets[i]))
553  KALDI_ERR << "Phone set in AutomaticallyObtainQuestions contains duplicate phones";
554  for (size_t j = 0; j < phone_sets[i].size(); j++)
555  phones.push_back(phone_sets[i][j]);
556  }
557  std::sort(phones.begin(), phones.end());
558  if (!IsSortedAndUniq(phones))
559  KALDI_ERR << "Phones are present in more than one phone set.";
560  if (phones.empty())
561  KALDI_ERR << "No phones provided.";
562 
563  std::vector<int32> all_pdf_classes(all_pdf_classes_in);
564  SortAndUniq(&all_pdf_classes);
565  KALDI_ASSERT(!all_pdf_classes.empty());
566 
567  BuildTreeStatsType retained_stats;
568  FilterStatsByKey(stats, kPdfClass, all_pdf_classes,
569  true, // retain only the listed positions
570  &retained_stats);
571 
572  if (retained_stats.size() * 10 < stats.size()) {
573  std::ostringstream ss;
574  for (size_t i = 0; i < all_pdf_classes.size(); i++)
575  ss << all_pdf_classes[i] << ' ';
576  KALDI_WARN << "After filtering the tree statistics to retain only stats where "
577  << "pdf-class is in the set { " << ss.str() << "}, most of your "
578  << "stats disappeared: the size changed from " << stats.size()
579  << " to " << retained_stats.size() << ". You might be using "
580  << "a nonstandard topology but forgot to modify the "
581  << "--pdf-class-list option (it defaults to { 1 } which is "
582  << "the central state in a 3-state left-to-right topology)."
583  << " E.g. a 1-state HMM topology would require the option "
584  << "--pdf-class-list=0.";
585  }
586 
587 
588  std::vector<BuildTreeStatsType> split_stats; // split by phone.
589  SplitStatsByKey(retained_stats, P, &split_stats);
590 
591  std::vector<Clusterable*> summed_stats; // summed up by phone.
592  SumStatsVec(split_stats, &summed_stats);
593 
594  int32 max_phone = phones.back();
595  if (static_cast<int32>(summed_stats.size()) < max_phone+1) {
596  // this can happen if the last phone had no data.. if we are using
597  // stress-marked, position-marked phones, this can happen. The later
598  // code will assume that a summed_stats entry exists for all phones.
599  summed_stats.resize(max_phone+1, NULL);
600  }
601 
602  for (int32 i = 0; static_cast<size_t>(i) < summed_stats.size(); i++) { // A check.
603  if (summed_stats[i] != NULL &&
604  !binary_search(phones.begin(), phones.end(), i)) {
605  KALDI_WARN << "Phone "<< i << " is present in stats but is not in phone list [make sure you intended this].";
606  }
607  }
608 
609  EnsureClusterableVectorNotNull(&summed_stats); // make sure no NULL pointers in summed_stats.
610  // will replace them with pointers to empty stats.
611 
612  std::vector<Clusterable*> summed_stats_per_set(phone_sets.size(), NULL); // summed up by set.
613  for (size_t i = 0; i < phone_sets.size(); i++) {
614  const std::vector<int32> &this_set = phone_sets[i];
615  summed_stats_per_set[i] = summed_stats[this_set[0]]->Copy();
616  for (size_t j = 1; j < this_set.size(); j++)
617  summed_stats_per_set[i]->Add(*(summed_stats[this_set[j]]));
618  }
619 
620  int32 num_no_data = 0;
621  for (size_t i = 0; i < summed_stats_per_set.size(); i++) { // A check.
622  if (summed_stats_per_set[i]->Normalizer() == 0.0) {
623  num_no_data++;
624  std::ostringstream ss;
625  ss << "AutomaticallyObtainQuestions: no stats available for phone set: ";
626  for (size_t j = 0; j < phone_sets[i].size(); j++)
627  ss << phone_sets[i][j] << ' ' ;
628  KALDI_WARN << ss.str();
629  }
630  }
631  if (num_no_data + 1 >= summed_stats_per_set.size()) {
632  std::ostringstream ss;
633  for (size_t i = 0; i < all_pdf_classes.size(); i++)
634  ss << all_pdf_classes[i] << ' ';
635  KALDI_WARN << "All or all but one of your classes of phones had no data. "
636  << "Note that we only consider data where pdf-class is in the "
637  << "set ( " << ss.str() << "). If you have an unusual HMM "
638  << "topology this may not be what you want; use the "
639  << "--pdf-class-list option to change this if needed. See "
640  << "also any warnings above.";
641  }
642 
643 
644  TreeClusterOptions topts;
645  topts.kmeans_cfg.num_tries = 10; // This is a slow-but-accurate setting,
646  // we do it this way since there are typically few phones.
647 
648  std::vector<int32> assignments; // assignment of phones to clusters. dim == summed_stats.size().
649  std::vector<int32> clust_assignments; // Parent of each cluster. Dim == #clusters.
650  int32 num_leaves; // number of leaf-level clusters.
651  TreeCluster(summed_stats_per_set,
652  summed_stats_per_set.size(), // max-#clust is all of the points.
653  NULL, // don't need the clusters out.
654  &assignments,
655  &clust_assignments,
656  &num_leaves,
657  topts);
658 
659  // process the information obtained by TreeCluster into the
660  // form we want at output.
661  ObtainSetsOfPhones(phone_sets,
662  assignments,
663  clust_assignments,
664  num_leaves,
665  questions_out);
666 
667  // The memory in summed_stats was newly allocated. [the other algorithms
668  // used here do not allocate].
669  DeletePointers(&summed_stats);
670  DeletePointers(&summed_stats_per_set);
671 }
void FilterStatsByKey(const BuildTreeStatsType &stats_in, EventKeyType key, std::vector< EventValueType > &values, bool include_if_present, BuildTreeStatsType *stats_out)
FilterStatsByKey filters the stats according the value of a specified key.
void SortAndUniq(std::vector< T > *vec)
Sorts and uniq's (removes duplicates) from a vector.
Definition: stl-utils.h:39
void SplitStatsByKey(const BuildTreeStatsType &stats_in, EventKeyType key, std::vector< BuildTreeStatsType > *stats_out)
SplitStatsByKey splits stats up according to the value of a particular key, which must be always defi...
static const EventKeyType kPdfClass
Definition: context-dep.h:39
BaseFloat TreeCluster(const std::vector< Clusterable * > &points, int32 max_clust, std::vector< Clusterable * > *clusters_out, std::vector< int32 > *assignments_out, std::vector< int32 > *clust_assignments_out, int32 *num_leaves_out, TreeClusterOptions cfg)
TreeCluster is a top-down clustering algorithm, using a binary tree (not necessarily balanced)...
void EnsureClusterableVectorNotNull(std::vector< Clusterable * > *stats)
Fills in any (NULL) holes in "stats" vector, with empty stats, because certain algorithms require non...
static void ObtainSetsOfPhones(const std::vector< std::vector< int32 > > &phone_sets, const std::vector< int32 > &assignments, const std::vector< int32 > &clust_assignments, int32 num_leaves, std::vector< std::vector< int32 > > *sets_out)
ObtainSetsOfPhones is called by AutomaticallyObtainQuestions.
Definition: build-tree.cc:483
void SumStatsVec(const std::vector< BuildTreeStatsType > &stats_in, std::vector< Clusterable * > *stats_out)
Sum a vector of stats.
#define KALDI_ERR
Definition: kaldi-error.h:127
#define KALDI_WARN
Definition: kaldi-error.h:130
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:169
std::vector< std::pair< EventType, Clusterable * > > BuildTreeStatsType
bool IsSortedAndUniq(const std::vector< T > &vec)
Returns true if the vector is sorted and contains each element only once.
Definition: stl-utils.h:63
void DeletePointers(std::vector< A * > *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:186
EventMap * BuildTree ( Questions &  qopts,
const std::vector< std::vector< int32 > > &  phone_sets,
const std::vector< int32 > &  phone2num_pdf_classes,
const std::vector< bool > &  share_roots,
const std::vector< bool > &  do_split,
const BuildTreeStatsType &  stats,
BaseFloat  thresh,
int32  max_leaves,
BaseFloat  cluster_thresh,
int32  P 
)

BuildTree is the normal way to build a set of decision trees.

The sets "phone_sets" dictate how we set up the roots of the decision trees. each set of phones phone_sets[i] has shared decision-tree roots, and if the corresponding variable share_roots[i] is true, the root will be shared for the different HMM-positions in the phone. All phones in "phone_sets" should be in the stats (use FixUnseenPhones to ensure this). if for any i, do_split[i] is false, we will not do any tree splitting for phones in that set.

Parameters
qopts[in] Questions options class, contains questions for each key (e.g. each phone position)
phone_sets[in] Each element of phone_sets is a set of phones whose roots are shared together (prior to decision-tree splitting).
phone2num_pdf_classes[in] A map from phones to the number of pdf-classes in the phone (this info is derived from the HmmTopology object)
share_roots[in] A vector the same size as phone_sets; says for each phone set whether the root should be shared among all the pdf-classes or not.
do_split[in] A vector the same size as phone_sets; says for each phone set whether decision-tree splitting should be done (generally true for non-silence phones).
stats[in] The statistics used in tree-building.
thresh[in] Threshold used in decision-tree splitting (e.g. 1000), or you may use 0 in which case max_leaves becomes the constraint.
max_leaves[in] Maximum number of leaves it will create; set this to a large number if you want to just specify "thresh".
cluster_thresh[in] Threshold for clustering leaves after decision-tree splitting (only within each phone-set); leaves will be combined if log-likelihood change is less than this. A value about equal to "thresh" is suitable if thresh != 0; otherwise, zero will mean no clustering is done, or a negative value (e.g. -1) sets it to the smallest likelihood change seen during the splitting algorithm; this typically causes about a 20% reduction in the number of leaves.
P[in] The central position of the phone context window, e.g. 1 for a triphone system.
Returns
Returns a pointer to an EventMap object that is the tree.

Definition at line 135 of file build-tree.cc.

References kaldi::ClusterEventMapRestrictedByMap(), kaldi::FilterStatsByKey(), kaldi::GetStubMap(), rnnlm::i, kaldi::IsSortedAndUniq(), KALDI_ASSERT, KALDI_LOG, KALDI_VLOG, kaldi::ObjfGivenMap(), kaldi::RenumberEventMap(), kaldi::SplitDecisionTree(), and kaldi::SumNormalizer().

Referenced by kaldi::BuildTreeTwoLevel(), kaldi::GenRandContextDependency(), kaldi::GenRandContextDependencyLarge(), main(), and kaldi::TestBuildTree().

144  {
145  KALDI_ASSERT(thresh > 0 || max_leaves > 0);
146  KALDI_ASSERT(stats.size() != 0);
147  KALDI_ASSERT(!phone_sets.empty()
148  && phone_sets.size() == share_roots.size()
149  && do_split.size() == phone_sets.size());
150 
151  // the inputs will be further checked in GetStubMap.
152  int32 num_leaves = 0; // allocator for leaves.
153 
154  EventMap *tree_stub = GetStubMap(P,
155  phone_sets,
156  phone2num_pdf_classes,
157  share_roots,
158  &num_leaves);
159  KALDI_LOG << "BuildTree: before building trees, map has "<< num_leaves << " leaves.";
160 
161 
162  BaseFloat impr;
163  BaseFloat smallest_split = 1.0e+10;
164 
165 
166  std::vector<int32> nonsplit_phones;
167  for (size_t i = 0; i < phone_sets.size(); i++)
168  if (!do_split[i])
169  nonsplit_phones.insert(nonsplit_phones.end(), phone_sets[i].begin(), phone_sets[i].end());
170 
171  std::sort(nonsplit_phones.begin(), nonsplit_phones.end());
172 
173  KALDI_ASSERT(IsSortedAndUniq(nonsplit_phones));
174  BuildTreeStatsType filtered_stats;
175  FilterStatsByKey(stats, P, nonsplit_phones, false, // retain only those not
176  // in "nonsplit_phones"
177  &filtered_stats);
178 
179  EventMap *tree_split = SplitDecisionTree(*tree_stub,
180  filtered_stats,
181  qopts, thresh, max_leaves,
182  &num_leaves, &impr, &smallest_split);
183 
184  if (cluster_thresh < 0.0) {
185  KALDI_LOG << "Setting clustering threshold to smallest split " << smallest_split;
186  cluster_thresh = smallest_split;
187  }
188 
189  BaseFloat normalizer = SumNormalizer(stats),
190  impr_normalized = impr / normalizer,
191  normalizer_filt = SumNormalizer(filtered_stats),
192  impr_normalized_filt = impr / normalizer_filt;
193 
194  KALDI_VLOG(1) << "After decision tree split, num-leaves = " << num_leaves
195  << ", like-impr = " << impr_normalized << " per frame over "
196  << normalizer << " frames.";
197 
198  KALDI_VLOG(1) << "Including just phones that were split, improvement is "
199  << impr_normalized_filt << " per frame over "
200  << normalizer_filt << " frames.";
201 
202 
203  if (cluster_thresh != 0.0) { // Cluster the tree.
204  BaseFloat objf_before_cluster = ObjfGivenMap(stats, *tree_split);
205 
206  // Now do the clustering.
207  int32 num_removed = 0;
208  EventMap *tree_clustered = ClusterEventMapRestrictedByMap(*tree_split,
209  stats,
210  cluster_thresh,
211  *tree_stub,
212  &num_removed);
213  KALDI_LOG << "BuildTree: removed "<< num_removed << " leaves.";
214 
215  int32 num_leaves = 0;
216  EventMap *tree_renumbered = RenumberEventMap(*tree_clustered, &num_leaves);
217 
218  BaseFloat objf_after_cluster = ObjfGivenMap(stats, *tree_renumbered);
219 
220  KALDI_VLOG(1) << "Objf change due to clustering "
221  << ((objf_after_cluster-objf_before_cluster) / normalizer)
222  << " per frame.";
223  KALDI_VLOG(1) << "Normalizing over only split phones, this is: "
224  << ((objf_after_cluster-objf_before_cluster) / normalizer_filt)
225  << " per frame.";
226  KALDI_VLOG(1) << "Num-leaves is now "<< num_leaves;
227 
228  delete tree_clustered;
229  delete tree_split;
230  delete tree_stub;
231  return tree_renumbered;
232  } else {
233  delete tree_stub;
234  return tree_split;
235  }
236 }
BaseFloat SumNormalizer(const BuildTreeStatsType &stats_in)
Sums the normalizer [typically, data-count] over the stats.
void FilterStatsByKey(const BuildTreeStatsType &stats_in, EventKeyType key, std::vector< EventValueType > &values, bool include_if_present, BuildTreeStatsType *stats_out)
FilterStatsByKey filters the stats according the value of a specified key.
EventMap * SplitDecisionTree(const EventMap &input_map, const BuildTreeStatsType &stats, Questions &q_opts, BaseFloat thresh, int32 max_leaves, int32 *num_leaves, BaseFloat *obj_impr_out, BaseFloat *smallest_split_change_out)
Does a decision-tree split at the leaves of an EventMap.
float BaseFloat
Definition: kaldi-types.h:29
EventMap * RenumberEventMap(const EventMap &e_in, int32 *num_leaves)
RenumberEventMap [intended to be used after calling ClusterEventMap] renumbers an EventMap so its lea...
BaseFloat ObjfGivenMap(const BuildTreeStatsType &stats_in, const EventMap &e)
Cluster the stats given the event map return the total objf given those clusters. ...
EventMap * GetStubMap(int32 P, const std::vector< std::vector< int32 > > &phone_sets, const std::vector< int32 > &phone2num_pdf_classes, const std::vector< bool > &share_roots, int32 *num_leaves_out)
GetStubMap is used in tree-building functions to get the initial to-states map, before the decision-t...
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:169
#define KALDI_VLOG(v)
Definition: kaldi-error.h:136
std::vector< std::pair< EventType, Clusterable * > > BuildTreeStatsType
EventMap * ClusterEventMapRestrictedByMap(const EventMap &e_in, const BuildTreeStatsType &stats, BaseFloat thresh, const EventMap &e_restrict, int32 *num_removed_ptr)
This version of ClusterEventMapRestricted restricts the clustering to only allow things that "e_restr...
bool IsSortedAndUniq(const std::vector< T > &vec)
Returns true if the vector is sorted and contains each element only once.
Definition: stl-utils.h:63
#define KALDI_LOG
Definition: kaldi-error.h:133
EventMap * BuildTreeTwoLevel ( Questions &  qopts,
const std::vector< std::vector< int32 > > &  phone_sets,
const std::vector< int32 > &  phone2num_pdf_classes,
const std::vector< bool > &  share_roots,
const std::vector< bool > &  do_split,
const BuildTreeStatsType &  stats,
int32  max_leaves_first,
int32  max_leaves_second,
bool  cluster_leaves,
int32  P,
std::vector< int32 > *  leaf_map 
)

BuildTreeTwoLevel builds a two-level tree, useful for example in building tied mixture systems with multiple codebooks.

It first builds a small tree by splitting to "max_leaves_first". It then splits at the leaves of "max_leaves_first" (think of this as creating multiple little trees at the leaves of the first tree), until the total number of leaves reaches "max_leaves_second". It then outputs the second tree, along with a mapping from the leaf-ids of the second tree to the leaf-ids of the first tree. Note that the interface is similar to BuildTree, and in fact it calls BuildTree internally.

The sets "phone_sets" dictate how we set up the roots of the decision trees. each set of phones phone_sets[i] has shared decision-tree roots, and if the corresponding variable share_roots[i] is true, the root will be shared for the different HMM-positions in the phone. All phones in "phone_sets" should be in the stats (use FixUnseenPhones to ensure this). if for any i, do_split[i] is false, we will not do any tree splitting for phones in that set.

Parameters
qopts[in] Questions options class, contains questions for each key (e.g. each phone position)
phone_sets[in] Each element of phone_sets is a set of phones whose roots are shared together (prior to decision-tree splitting).
phone2num_pdf_classes[in] A map from phones to the number of pdf-classes in the phone (this info is derived from the HmmTopology object)
share_roots[in] A vector the same size as phone_sets; says for each phone set whether the root should be shared among all the pdf-classes or not.
do_split[in] A vector the same size as phone_sets; says for each phone set whether decision-tree splitting should be done (generally true for non-silence phones).
stats[in] The statistics used in tree-building.
max_leaves_first[in] Maximum number of leaves it will create in first level of decision tree.
max_leaves_second[in] Maximum number of leaves it will create in second level of decision tree. Must be > max_leaves_first.
cluster_leaves[in] Boolean value; if true, we post-cluster the leaves produced in the second level of decision-tree split; if false, we don't. The threshold for post-clustering is the log-like change of the last decision-tree split; this typically causes about a 20% reduction in the number of leaves.
P[in] The central position of the phone context window, e.g. 1 for a triphone system.
leaf_map[out] Will be set to be a mapping from the leaves of the "big" tree to the leaves of the "little" tree, which you can view as cluster centers.
Returns
Returns a pointer to an EventMap object that is the (big) tree.

Definition at line 313 of file build-tree.cc.

References kaldi::BuildTree(), kaldi::ClusterEventMapRestrictedByMap(), kaldi::ComputeTreeMapping(), kaldi::FilterStatsByKey(), rnnlm::i, kaldi::IsSortedAndUniq(), KALDI_ASSERT, KALDI_LOG, kaldi::MapEventMapLeaves(), EventMap::MaxResult(), kaldi::ObjfGivenMap(), kaldi::RenumberEventMap(), kaldi::SplitDecisionTree(), and kaldi::SumNormalizer().

Referenced by main().

323  {
324 
325  KALDI_LOG << "****BuildTreeTwoLevel: building first level tree";
326  EventMap *first_level_tree = BuildTree(qopts, phone_sets,
327  phone2num_pdf_classes,
328  share_roots, do_split, stats, 0.0,
329  max_leaves_first, 0.0, P);
330  KALDI_ASSERT(first_level_tree != NULL);
331  KALDI_LOG << "****BuildTreeTwoLevel: done building first level tree";
332 
333 
334  std::vector<int32> nonsplit_phones;
335  for (size_t i = 0; i < phone_sets.size(); i++)
336  if (!do_split[i])
337  nonsplit_phones.insert(nonsplit_phones.end(), phone_sets[i].begin(), phone_sets[i].end());
338  std::sort(nonsplit_phones.begin(), nonsplit_phones.end());
339 
340  KALDI_ASSERT(IsSortedAndUniq(nonsplit_phones));
341  BuildTreeStatsType filtered_stats;
342  FilterStatsByKey(stats, P, nonsplit_phones, false, // retain only those not
343  // in "nonsplit_phones"
344  &filtered_stats);
345 
346  int32 num_leaves = first_level_tree->MaxResult() + 1,
347  old_num_leaves = num_leaves;
348 
349  BaseFloat smallest_split = 0.0;
350 
351  BaseFloat impr;
352  EventMap *tree = SplitDecisionTree(*first_level_tree,
353  filtered_stats,
354  qopts, 0.0, max_leaves_second,
355  &num_leaves, &impr, &smallest_split);
356 
357  KALDI_LOG << "Building second-level tree: increased #leaves from "
358  << old_num_leaves << " to " << num_leaves << ", smallest split was "
359  << smallest_split;
360 
361  BaseFloat normalizer = SumNormalizer(stats),
362  impr_normalized = impr / normalizer;
363 
364  KALDI_LOG << "After second decision tree split, num-leaves = "
365  << num_leaves << ", like-impr = " << impr_normalized
366  << " per frame over " << normalizer << " frames.";
367 
368  if (cluster_leaves) { // Cluster the leaves of the tree.
369  KALDI_LOG << "Clustering leaves of larger tree.";
370  BaseFloat objf_before_cluster = ObjfGivenMap(stats, *tree);
371 
372  // Now do the clustering.
373  int32 num_removed = 0;
374  EventMap *tree_clustered = ClusterEventMapRestrictedByMap(*tree,
375  stats,
376  smallest_split,
377  *first_level_tree,
378  &num_removed);
379  KALDI_LOG << "BuildTreeTwoLevel: removed " << num_removed << " leaves.";
380 
381  int32 num_leaves = 0;
382  EventMap *tree_renumbered = RenumberEventMap(*tree_clustered, &num_leaves);
383 
384  BaseFloat objf_after_cluster = ObjfGivenMap(stats, *tree_renumbered);
385 
386  KALDI_LOG << "Objf change due to clustering "
387  << ((objf_after_cluster-objf_before_cluster) / SumNormalizer(stats))
388  << " per frame.";
389  KALDI_LOG << "Num-leaves now "<< num_leaves;
390  delete tree;
391  delete tree_clustered;
392  tree = tree_renumbered;
393  }
394 
395  ComputeTreeMapping(*first_level_tree,
396  *tree,
397  stats,
398  leaf_map);
399 
400  { // Next do another renumbering of "tree" so that leaves with the
401  // same value in "first_level_tree" are contiguous.
402  std::vector<std::pair<int32, int32> > leaf_pairs;
403  for (size_t i = 0; i < leaf_map->size(); i++)
404  leaf_pairs.push_back(std::make_pair((*leaf_map)[i], static_cast<int32>(i)));
405  // pair of (small-tree-number, big-tree-number).
406  std::sort(leaf_pairs.begin(), leaf_pairs.end());
407  std::vector<int32> old2new_map(leaf_map->size()),
408  new_leaf_map(leaf_map->size());
409  // Note: old2new_map maps from old indices to new indices, in the
410  // renumbering; new_leaf_map maps from 2nd-level tree indices to
411  // 1st-level tree indices.
412  for (size_t i = 0; i < leaf_pairs.size(); i++) {
413  int32 old_number = leaf_pairs[i].second, new_number = i;
414  old2new_map[old_number] = new_number;
415  new_leaf_map[new_number] = (*leaf_map)[old_number];
416  }
417  *leaf_map = new_leaf_map;
418  EventMap *renumbered_tree = MapEventMapLeaves(*tree, old2new_map);
419  delete tree;
420  tree = renumbered_tree;
421  }
422 
423  delete first_level_tree;
424  return tree;
425 }
BaseFloat SumNormalizer(const BuildTreeStatsType &stats_in)
Sums the normalizer [typically, data-count] over the stats.
void FilterStatsByKey(const BuildTreeStatsType &stats_in, EventKeyType key, std::vector< EventValueType > &values, bool include_if_present, BuildTreeStatsType *stats_out)
FilterStatsByKey filters the stats according the value of a specified key.
EventMap * SplitDecisionTree(const EventMap &input_map, const BuildTreeStatsType &stats, Questions &q_opts, BaseFloat thresh, int32 max_leaves, int32 *num_leaves, BaseFloat *obj_impr_out, BaseFloat *smallest_split_change_out)
Does a decision-tree split at the leaves of an EventMap.
float BaseFloat
Definition: kaldi-types.h:29
EventMap * RenumberEventMap(const EventMap &e_in, int32 *num_leaves)
RenumberEventMap [intended to be used after calling ClusterEventMap] renumbers an EventMap so its lea...
EventMap * MapEventMapLeaves(const EventMap &e_in, const std::vector< int32 > &mapping_in)
This function remaps the event-map leaves using this mapping, indexed by the number at leaf...
BaseFloat ObjfGivenMap(const BuildTreeStatsType &stats_in, const EventMap &e)
Cluster the stats given the event map return the total objf given those clusters. ...
static void ComputeTreeMapping(const EventMap &small_tree, const EventMap &big_tree, const BuildTreeStatsType &stats, std::vector< int32 > *leaf_map)
Definition: build-tree.cc:246
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:169
std::vector< std::pair< EventType, Clusterable * > > BuildTreeStatsType
EventMap * ClusterEventMapRestrictedByMap(const EventMap &e_in, const BuildTreeStatsType &stats, BaseFloat thresh, const EventMap &e_restrict, int32 *num_removed_ptr)
This version of ClusterEventMapRestricted restricts the clustering to only allow things that "e_restr...
bool IsSortedAndUniq(const std::vector< T > &vec)
Returns true if the vector is sorted and contains each element only once.
Definition: stl-utils.h:63
#define KALDI_LOG
Definition: kaldi-error.h:133
EventMap * BuildTree(Questions &qopts, const std::vector< std::vector< int32 > > &phone_sets, const std::vector< int32 > &phone2num_pdf_classes, const std::vector< bool > &share_roots, const std::vector< bool > &do_split, const BuildTreeStatsType &stats, BaseFloat thresh, int32 max_leaves, BaseFloat cluster_thresh, int32 P)
BuildTree is the normal way to build a set of decision trees.
Definition: build-tree.cc:135
void GenRandStats ( int32  dim,
int32  num_stats,
int32  N,
int32  P,
const std::vector< int32 > &  phone_ids,
const std::vector< int32 > &  hmm_lengths,
const std::vector< bool > &  is_ctx_dep,
bool  ensure_all_phones_covered,
BuildTreeStatsType *  stats_out 
)

GenRandStats generates random statistics of the form used by BuildTree.

It tries to do so in such a way that they mimic "real" stats. The event keys and their corresponding values are:

  • key == -1 == kPdfClass -> pdf-class, generally corresponds to zero-based position in HMM (0, 1, 2 .. hmm_lengths[phone]-1)
  • key == 0 -> phone-id of left-most context phone.
  • key == 1 -> phone-id of one-from-left-most context phone.
  • key == P-1 -> phone-id of central phone.
  • key == N-1 -> phone-id of right-most context phone. GenRandStats is useful only for testing but it serves to document the format of stats used by BuildTreeDefault. if is_ctx_dep[phone] is set to false, GenRandStats will not define the keys for other than the P-1'th phone.
    Parameters
    dim[in] dimension of features.
    num_stats[in] approximate number of separate phones-in-context wanted.
    N[in] context-size (typically 3)
    P[in] central-phone position in zero-based numbering (typically 1)
    phone_ids[in] integer ids of phones
    hmm_lengths[in] lengths of hmm for phone, indexed by phone.
    is_ctx_dep[in] boolean array indexed by phone, saying whether each phone is context dependent.
    ensure_all_phones_covered[in] Boolean argument: if true, GenRandStats ensures that every phone is seen at least once in the central position (P).
    stats_out[out] The statistics that this routine outputs.

Definition at line 29 of file build-tree.cc.

References GaussClusterable::AddStats(), VectorBase< Real >::AddVec(), kaldi::CopyMapToVector(), count, rnnlm::d, rnnlm::i, rnnlm::j, KALDI_ASSERT, kaldi::kPdfClass, kaldi::Rand(), kaldi::RandGauss(), kaldi::RandUniform(), MatrixBase< Real >::Row(), VectorBase< Real >::Scale(), kaldi::SortAndUniq(), and VectorBase< Real >::Sum().

Referenced by kaldi::GenRandContextDependency(), kaldi::GenRandContextDependencyLarge(), kaldi::TestBuildTree(), and kaldi::TestGenRandStats().

34  {
35 
36  KALDI_ASSERT(dim > 0);
37  KALDI_ASSERT(num_stats > 0);
38  KALDI_ASSERT(N > 0);
39  KALDI_ASSERT(P < N);
40  KALDI_ASSERT(phone_ids.size() != 0);
41  KALDI_ASSERT(stats_out != NULL && stats_out->empty());
42  int32 max_phone = *std::max_element(phone_ids.begin(), phone_ids.end());
43  KALDI_ASSERT(phone2hmm_length.size() >= static_cast<size_t>(1 + max_phone));
44  KALDI_ASSERT(is_ctx_dep.size() >= static_cast<size_t>(1 + max_phone));
45 
46  // Make sure phone id's distinct.
47  {
48  std::vector<int32> tmp(phone_ids);
49  SortAndUniq(&tmp);
50  KALDI_ASSERT(tmp.size() == phone_ids.size());
51  }
52  size_t num_phones = phone_ids.size();
53 
54  // Decide on an underlying "mean" for phones...
55  Matrix<BaseFloat> phone_vecs(max_phone+1, dim);
56  for (int32 i = 0;i < max_phone+1;i++)
57  for (int32 j = 0;j < dim;j++) phone_vecs(i, j) = RandGauss() * (2.0 / (j+1));
58 
59 
60  std::map<EventType, Clusterable*> stats_tmp;
61 
62  std::vector<bool> covered(1 + max_phone, false);
63 
64  bool all_covered = false;
65  for (int32 i = 0;i < num_stats || (ensure_all_phones_covered && !all_covered);i++) {
66  // decide randomly on a phone-in-context.
67  std::vector<int32> phone_vec(N);
68  for (size_t i = 0;i < (size_t)N;i++) phone_vec[i] = phone_ids[(Rand() % num_phones)];
69 
70  int32 hmm_length = phone2hmm_length[phone_vec[P]];
71  KALDI_ASSERT(hmm_length > 0);
72  covered[phone_vec[P]] = true;
73 
74  // For each position [in the central phone]...
75  for (int32 j = 0; j < hmm_length; j++) {
76  // create event vector.
77  EventType event_vec;
78  event_vec.push_back(std::make_pair(kPdfClass, (EventValueType)j)); // record the position.
79  for (size_t pos = 0; pos < (size_t)N; pos++) {
80  if (pos == (size_t)(P) || is_ctx_dep[phone_vec[P]])
81  event_vec.push_back(std::make_pair((EventKeyType)pos, (EventValueType)phone_vec[pos]));
82  // The if-statement above ensures we do not record the context of "context-free"
83  // phone (e.g., silence).
84  }
85 
86  Vector<BaseFloat> mean(dim); // mean of Gaussian.
87  GaussClusterable *this_stats = new GaussClusterable(dim, 0.1); // 0.1 is var floor.
88  { // compute stats; this block attempts to simulate the process of "real" data
89  // collection and does not correspond to any code you would write in a real
90  // scenario.
91  Vector<BaseFloat> weights(N); // weight of each component.
92  for (int32 k = 0; k < N; k++) {
93  BaseFloat k_pos = (N - 0.5 - k) / N; // between 0 and 1, less for lower k...
94  BaseFloat j_pos = (hmm_length - 0.5 - j) / hmm_length;
95  // j_pos is between 0 and 1, less for lower j.
96 
97  BaseFloat weight = j_pos*k_pos + (1.0-j_pos)*(1.0-k_pos);
98  // if j_pos close to zero, gives larger weight to k_pos close
99  // to zero.
100  if (k == P) weight += 1.0;
101  weights(k) = weight;
102  }
103  KALDI_ASSERT(weights.Sum() != 0);
104  weights.Scale(1.0 / weights.Sum());
105  for (int32 k = 0; k < N; k++)
106  mean.AddVec(weights(k), phone_vecs.Row(phone_vec[k]));
108  if (Rand() % 2 == 0) count = 1000.0 * RandUniform();
109  else count = 100.0 * RandUniform();
110 
111  int32 num_samples = 10;
112  for (size_t p = 0;p < (size_t)num_samples; p++) {
113  Vector<BaseFloat> sample(mean); // copy mean.
114  for (size_t d = 0; d < (size_t)dim; d++) sample(d) += RandGauss(); // unit var.
115  this_stats->AddStats(sample, count / num_samples);
116  }
117  }
118 
119  if (stats_tmp.count(event_vec) != 0) {
120  stats_tmp[event_vec]->Add(*this_stats);
121  delete this_stats;
122  } else {
123  stats_tmp[event_vec] = this_stats;
124  }
125  }
126  all_covered = true;
127  for (size_t i = 0; i< num_phones; i++) if (!covered[phone_ids[i]]) all_covered = false;
128  }
129  CopyMapToVector(stats_tmp, stats_out);
130  KALDI_ASSERT(stats_out->size() > 0);
131 }
float RandUniform(struct RandomState *state=NULL)
Returns a random number strictly between 0 and 1.
Definition: kaldi-math.h:151
float RandGauss(struct RandomState *state=NULL)
Definition: kaldi-math.h:155
void SortAndUniq(std::vector< T > *vec)
Sorts and uniq's (removes duplicates) from a vector.
Definition: stl-utils.h:39
static const EventKeyType kPdfClass
Definition: context-dep.h:39
const size_t count
std::vector< std::pair< EventKeyType, EventValueType > > EventType
Definition: event-map.h:58
float BaseFloat
Definition: kaldi-types.h:29
int32 EventKeyType
Things of type EventKeyType can take any value.
Definition: event-map.h:45
int Rand(struct RandomState *state)
Definition: kaldi-math.cc:46
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:169
int32 EventValueType
Given current code, things of type EventValueType should generally be nonnegative and in a reasonably...
Definition: event-map.h:51
void CopyMapToVector(const std::map< A, B > &m, std::vector< std::pair< A, B > > *v)
Copies the (key, value) pairs in a map to a vector of pairs.
Definition: stl-utils.h:114
void KMeansClusterPhones ( BuildTreeStatsType &  stats,
const std::vector< std::vector< int32 > > &  phone_sets_in,
const std::vector< int32 > &  all_pdf_classes_in,
int32  P,
int32  num_classes,
std::vector< std::vector< int32 > > *  sets_out 
)

This function clusters the phones (or some initially specified sets of phones) into sets of phones, using a k-means algorithm.

Useful, for example, in building simple models for purposes of adaptation.

Definition at line 674 of file build-tree.cc.

References kaldi::ClusterKMeans(), count, kaldi::DeletePointers(), kaldi::EnsureClusterableVectorNotNull(), kaldi::FilterStatsByKey(), rnnlm::i, kaldi::IsSortedAndUniq(), rnnlm::j, KALDI_ASSERT, KALDI_ERR, KALDI_LOG, KALDI_WARN, kaldi::kPdfClass, kaldi::SortAndUniq(), kaldi::SplitStatsByKey(), kaldi::SumClusterableNormalizer(), and kaldi::SumStatsVec().

Referenced by main().

679  {
680  std::vector<std::vector<int32> > phone_sets(phone_sets_in);
681  std::vector<int32> phones;
682  for (size_t i = 0; i < phone_sets.size() ;i++) {
683  std::sort(phone_sets[i].begin(), phone_sets[i].end());
684  if (phone_sets[i].empty())
685  KALDI_ERR << "Empty phone set in AutomaticallyObtainQuestions";
686  if (!IsSortedAndUniq(phone_sets[i]))
687  KALDI_ERR << "Phone set in AutomaticallyObtainQuestions contains duplicate phones";
688  for (size_t j = 0; j < phone_sets[i].size(); j++)
689  phones.push_back(phone_sets[i][j]);
690  }
691  std::sort(phones.begin(), phones.end());
692  if (!IsSortedAndUniq(phones))
693  KALDI_ERR << "Phones are present in more than one phone set.";
694  if (phones.empty())
695  KALDI_ERR << "No phones provided.";
696 
697  std::vector<int32> all_pdf_classes(all_pdf_classes_in);
698  SortAndUniq(&all_pdf_classes);
699  KALDI_ASSERT(!all_pdf_classes.empty());
700 
701  BuildTreeStatsType retained_stats;
702  FilterStatsByKey(stats, kPdfClass, all_pdf_classes,
703  true, // retain only the listed positions
704  &retained_stats);
705 
706 
707  std::vector<BuildTreeStatsType> split_stats; // split by phone.
708  SplitStatsByKey(retained_stats, P, &split_stats);
709 
710  std::vector<Clusterable*> summed_stats; // summed up by phone.
711  SumStatsVec(split_stats, &summed_stats);
712 
713  int32 max_phone = phones.back();
714  if (static_cast<int32>(summed_stats.size()) < max_phone+1) {
715  // this can happen if the last phone had no data.. if we are using
716  // stress-marked, position-marked phones, this can happen. The later
717  // code will assume that a summed_stats entry exists for all phones.
718  summed_stats.resize(max_phone+1, NULL);
719  }
720 
721  for (int32 i = 0; static_cast<size_t>(i) < summed_stats.size(); i++) {
722  // just a check.
723  if (summed_stats[i] != NULL &&
724  !binary_search(phones.begin(), phones.end(), i)) {
725  KALDI_WARN << "Phone "<< i << " is present in stats but is not in phone list [make sure you intended this].";
726  }
727  }
728 
729  EnsureClusterableVectorNotNull(&summed_stats); // make sure no NULL pointers in summed_stats.
730  // will replace them with pointers to empty stats.
731 
732  std::vector<Clusterable*> summed_stats_per_set(phone_sets.size(), NULL); // summed up by set.
733  for (size_t i = 0; i < phone_sets.size(); i++) {
734  const std::vector<int32> &this_set = phone_sets[i];
735  summed_stats_per_set[i] = summed_stats[this_set[0]]->Copy();
736  for (size_t j = 1; j < this_set.size(); j++)
737  summed_stats_per_set[i]->Add(*(summed_stats[this_set[j]]));
738  }
739 
740  for (size_t i = 0; i < summed_stats_per_set.size(); i++) { // A check.
741  if (summed_stats_per_set[i]->Normalizer() == 0.0) {
742  std::ostringstream ss;
743  ss << "AutomaticallyObtainQuestions: no stats available for phone set: ";
744  for (size_t j = 0; j < phone_sets[i].size(); j++)
745  ss << phone_sets[i][j] << ' ' ;
746  KALDI_WARN << ss.str();
747  }
748  }
749 
750  ClusterKMeansOptions opts; // Just using the default options which are a reasonable
751  // compromise between speed and accuracy.
752 
753  std::vector<int32> assignments;
754  BaseFloat objf_impr = ClusterKMeans(summed_stats_per_set,
755  num_classes,
756  NULL,
757  &assignments,
758  opts);
759 
760  BaseFloat count = SumClusterableNormalizer(summed_stats_per_set);
761 
762  KALDI_LOG << "ClusterKMeans: objf change from clustering [versus single set] is "
763  << (objf_impr/count) << " over " << count << " frames.";
764 
765  sets_out->resize(num_classes);
766  KALDI_ASSERT(assignments.size() == phone_sets.size());
767  for (size_t i = 0; i < assignments.size(); i++) {
768  int32 class_idx = assignments[i];
769  KALDI_ASSERT(static_cast<size_t>(class_idx) < sets_out->size());
770  for (size_t j = 0; j < phone_sets[i].size(); j++)
771  (*sets_out)[class_idx].push_back(phone_sets[i][j]);
772  }
773  for (size_t i = 0; i < sets_out->size(); i++) {
774  std::sort( (*sets_out)[i].begin(), (*sets_out)[i].end() ); // just good
775  // practice to have them sorted as who knows if whatever we need them for
776  // will require sorting...
777  KALDI_ASSERT(IsSortedAndUniq( (*sets_out)[i] ));
778  }
779  DeletePointers(&summed_stats);
780  DeletePointers(&summed_stats_per_set);
781 }
void FilterStatsByKey(const BuildTreeStatsType &stats_in, EventKeyType key, std::vector< EventValueType > &values, bool include_if_present, BuildTreeStatsType *stats_out)
FilterStatsByKey filters the stats according the value of a specified key.
BaseFloat SumClusterableNormalizer(const std::vector< Clusterable * > &vec)
Returns the total normalizer (usually count) of the cluster (pointers may be NULL).
void SortAndUniq(std::vector< T > *vec)
Sorts and uniq's (removes duplicates) from a vector.
Definition: stl-utils.h:39
void SplitStatsByKey(const BuildTreeStatsType &stats_in, EventKeyType key, std::vector< BuildTreeStatsType > *stats_out)
SplitStatsByKey splits stats up according to the value of a particular key, which must be always defi...
static const EventKeyType kPdfClass
Definition: context-dep.h:39
const size_t count
float BaseFloat
Definition: kaldi-types.h:29
void EnsureClusterableVectorNotNull(std::vector< Clusterable * > *stats)
Fills in any (NULL) holes in "stats" vector, with empty stats, because certain algorithms require non...
void SumStatsVec(const std::vector< BuildTreeStatsType > &stats_in, std::vector< Clusterable * > *stats_out)
Sum a vector of stats.
BaseFloat ClusterKMeans(const std::vector< Clusterable * > &points, int32 num_clust, std::vector< Clusterable * > *clusters_out, std::vector< int32 > *assignments_out, ClusterKMeansOptions cfg)
ClusterKMeans is a K-means-like clustering algorithm.
#define KALDI_ERR
Definition: kaldi-error.h:127
#define KALDI_WARN
Definition: kaldi-error.h:130
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:169
std::vector< std::pair< EventType, Clusterable * > > BuildTreeStatsType
bool IsSortedAndUniq(const std::vector< T > &vec)
Returns true if the vector is sorted and contains each element only once.
Definition: stl-utils.h:63
#define KALDI_LOG
Definition: kaldi-error.h:133
void DeletePointers(std::vector< A * > *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:186
void ReadRootsFile ( std::istream &  is,
std::vector< std::vector< int32 > > *  phone_sets,
std::vector< bool > *  is_shared_root,
std::vector< bool > *  is_split_root 
)

Reads the roots file (throws on error).

Format is lines like: "shared split 1 2 3 4", "not-shared not-split 5", and so on. The numbers are indexes of phones.

Definition at line 783 of file build-tree.cc.

References rnnlm::i, kaldi::IsSortedAndUniq(), KALDI_ASSERT, KALDI_ERR, and line_number.

Referenced by main().

786  {
787  KALDI_ASSERT(phone_sets != NULL && is_shared_root != NULL &&
788  is_split_root != NULL && phone_sets->empty()
789  && is_shared_root->empty() && is_split_root->empty());
790 
791  std::string line;
792  int line_number = 0;
793  while ( ! getline(is, line).fail() ) {
794  line_number++;
795  std::istringstream ss(line);
796  std::string shared;
797  ss >> shared;
798  if (ss.fail() && shared != "shared" && shared != "not-shared")
799  KALDI_ERR << "Bad line in roots file: line "<< line_number << ": " << line;
800  is_shared_root->push_back(shared == "shared");
801 
802  std::string split;
803  ss >> split;
804  if (ss.fail() && shared != "split" && shared != "not-split")
805  KALDI_ERR << "Bad line in roots file: line "<< line_number << ": " << line;
806  is_split_root->push_back(split == "split");
807 
808  phone_sets->push_back(std::vector<int32>());
809  int32 i;
810  while ( !(ss >> i).fail() ) {
811  phone_sets->back().push_back(i);
812  }
813  std::sort(phone_sets->back().begin(), phone_sets->back().end());
814  if (!IsSortedAndUniq(phone_sets->back()) || phone_sets->back().empty()
815  || phone_sets->back().front() <= 0)
816  KALDI_ERR << "Bad line in roots file [empty, or contains non-positive "
817  << " or duplicate phone-ids]: line " << line_number << ": "
818  << line;
819  }
820  if (phone_sets->empty())
821  KALDI_ERR << "Empty roots file ";
822 }
#define KALDI_ERR
Definition: kaldi-error.h:127
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:169
int32 line_number
bool IsSortedAndUniq(const std::vector< T > &vec)
Returns true if the vector is sorted and contains each element only once.
Definition: stl-utils.h:63
void ReadSymbolTableAsIntegers ( std::string  filename,
bool  include_eps,
std::vector< int32 > *  syms 
)

included here because it's used in some tree-building calling code.

Reads an OpenFst symbl table, discards the symbols and outputs the integers

Definition at line 428 of file build-tree.cc.

References KALDI_ASSERT, KALDI_ERR, KALDI_WARN, and kaldi::SortAndUniq().

430  {
431  std::ifstream is(filename.c_str());
432  if (!is.good())
433  KALDI_ERR << "ReadSymbolTableAsIntegers: could not open symbol table "<<filename;
434  std::string line;
435  KALDI_ASSERT(syms != NULL);
436  syms->clear();
437  while (getline(is, line)) {
438  std::string sym;
439  int64 index;
440  std::istringstream ss(line);
441  ss >> sym >> index >> std::ws;
442  if (ss.fail() || !ss.eof()) {
443  KALDI_ERR << "Bad line in symbol table: "<< line<<", file is: "<<filename;
444  }
445  if (include_eps || index != 0)
446  syms->push_back(index);
447  if (index == 0 && sym != "<eps>") {
448  KALDI_WARN << "Symbol zero is "<<sym<<", traditionally <eps> is used. Make sure this is not a \"real\" symbol.";
449  }
450  }
451  size_t sz = syms->size();
452  SortAndUniq(syms);
453  if (syms->size() != sz)
454  KALDI_ERR << "Symbol table "<<filename<<" seems to contain duplicate symbols.";
455 }
void SortAndUniq(std::vector< T > *vec)
Sorts and uniq's (removes duplicates) from a vector.
Definition: stl-utils.h:39
#define KALDI_ERR
Definition: kaldi-error.h:127
#define KALDI_WARN
Definition: kaldi-error.h:130
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:169