All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Modules Pages
Intermediate-level functions used in building the tree

These functions are are used in top-level tree-building code (Top-level tree-building functions); see Decision tree internals for documentation. More...

Collaboration diagram for Intermediate-level functions used in building the tree:

Functions

EventMap * TrivialTree (int32 *num_leaves)
 Returns a tree with just one node. More...
 
EventMap * DoTableSplit (const EventMap &orig, EventKeyType key, const BuildTreeStatsType &stats, int32 *num_leaves)
 DoTableSplit does a complete split on this key (e.g. More...
 
EventMap * DoTableSplitMultiple (const EventMap &orig, const std::vector< EventKeyType > &keys, const BuildTreeStatsType &stats, int32 *num_leaves)
 DoTableSplitMultiple does a complete split on all the keys, in order from keys[0], keys[1] and so on. More...
 
int ClusterEventMapGetMapping (const EventMap &e_in, const BuildTreeStatsType &stats, BaseFloat thresh, std::vector< EventMap * > *mapping)
 "ClusterEventMapGetMapping" clusters the leaves of the EventMap, with "thresh" a delta-likelihood threshold to control how many leaves we combine (might be the same as the delta-like threshold used in splitting. More...
 
EventMap * ClusterEventMap (const EventMap &e_in, const BuildTreeStatsType &stats, BaseFloat thresh, int32 *num_removed)
 This is as ClusterEventMapGetMapping but a more convenient interface that exposes less of the internals. More...
 
EventMap * ClusterEventMapRestrictedByKeys (const EventMap &e_in, const BuildTreeStatsType &stats, BaseFloat thresh, const std::vector< EventKeyType > &keys, int32 *num_removed)
 This is as ClusterEventMap, but first splits the stats on the keys specified in "keys" (e.g. More...
 
EventMap * ClusterEventMapRestrictedByMap (const EventMap &e_in, const BuildTreeStatsType &stats, BaseFloat thresh, const EventMap &e_restrict, int32 *num_removed)
 This version of ClusterEventMapRestricted restricts the clustering to only allow things that "e_restrict" maps to the same value to be clustered together. More...
 
EventMap * RenumberEventMap (const EventMap &e_in, int32 *num_leaves)
 RenumberEventMap [intended to be used after calling ClusterEventMap] renumbers an EventMap so its leaves are consecutive. More...
 
EventMap * MapEventMapLeaves (const EventMap &e_in, const std::vector< int32 > &mapping)
 This function remaps the event-map leaves using this mapping, indexed by the number at leaf. More...
 
EventMap * ShareEventMapLeaves (const EventMap &e_in, EventKeyType key, std::vector< std::vector< EventValueType > > &values, int32 *num_leaves)
 ShareEventMapLeaves performs a quite specific function that allows us to generate trees where, for a certain list of phones, and for all states in the phone, all the pdf's are shared. More...
 
EventMap * SplitDecisionTree (const EventMap &orig, const BuildTreeStatsType &stats, Questions &qcfg, BaseFloat thresh, int32 max_leaves,int32 *num_leaves, BaseFloat *objf_impr_out, BaseFloat *smallest_split_change_out)
 Does a decision-tree split at the leaves of an EventMap. More...
 
void CreateRandomQuestions (const BuildTreeStatsType &stats, int32 num_quest, Questions *cfg_out)
 CreateRandomQuestions will initialize a Questions randomly, in a reasonable way [for testing purposes, or when hand-designed questions are not available]. More...
 
BaseFloat FindBestSplitForKey (const BuildTreeStatsType &stats, const Questions &qcfg, EventKeyType key, std::vector< EventValueType > *yes_set)
 FindBestSplitForKey is a function used in DoDecisionTreeSplit. More...
 
EventMap * GetStubMap (int32 P, const std::vector< std::vector< int32 > > &phone_sets, const std::vector< int32 > &phone2num_pdf_classes, const std::vector< bool > &share_roots,int32 *num_leaves)
 GetStubMap is used in tree-building functions to get the initial to-states map, before the decision-tree-building process. More...
 

Detailed Description

These functions are are used in top-level tree-building code (Top-level tree-building functions); see Decision tree internals for documentation.

Function Documentation

EventMap * ClusterEventMap ( const EventMap &  e_in,
const BuildTreeStatsType &  stats,
BaseFloat  thresh,
int32 *  num_removed 
)

This is as ClusterEventMapGetMapping but a more convenient interface that exposes less of the internals.

It uses a bottom-up clustering to combine the leaves, until the log-likelihood decrease from combinging two leaves exceeds the threshold.

Definition at line 699 of file build-tree-utils.cc.

References kaldi::ClusterEventMapGetMapping(), EventMap::Copy(), and kaldi::DeletePointers().

Referenced by kaldi::TestClusterEventMap(), and kaldi::TestClusterEventMapRestricted().

700  {
701  std::vector<EventMap*> mapping;
702  int32 num_removed = ClusterEventMapGetMapping(e_in, stats, thresh, &mapping);
703  EventMap *ans = e_in.Copy(mapping);
704  DeletePointers(&mapping);
705  if (num_removed_ptr != NULL) *num_removed_ptr = num_removed;
706  return ans;
707 
708 }
int ClusterEventMapGetMapping(const EventMap &e_in, const BuildTreeStatsType &stats, BaseFloat thresh, std::vector< EventMap * > *mapping)
"ClusterEventMapGetMapping" clusters the leaves of the EventMap, with "thresh" a delta-likelihood thr...
void DeletePointers(std::vector< A * > *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:198
int ClusterEventMapGetMapping ( const EventMap &  e_in,
const BuildTreeStatsType &  stats,
BaseFloat  thresh,
std::vector< EventMap * > *  mapping 
)

"ClusterEventMapGetMapping" clusters the leaves of the EventMap, with "thresh" a delta-likelihood threshold to control how many leaves we combine (might be the same as the delta-like threshold used in splitting.

Definition at line 599 of file build-tree-utils.cc.

References kaldi::ClusterBottomUp(), kaldi::DeletePointers(), rnnlm::i, KALDI_ASSERT, KALDI_VLOG, KALDI_WARN, kaldi::SplitStatsByMap(), kaldi::SumClusterableNormalizer(), and kaldi::SumStatsVec().

Referenced by kaldi::ClusterEventMap(), kaldi::ClusterEventMapRestrictedByMap(), kaldi::ClusterEventMapRestrictedHelper(), kaldi::TestClusterEventMapGetMappingAndRenumberEventMap(), and kaldi::TestClusterEventMapGetMappingAndRenumberEventMap2().

602  {
603  // First map stats
604  KALDI_ASSERT(stats.size() != 0);
605  std::vector<BuildTreeStatsType> split_stats;
606  SplitStatsByMap(stats, e_in, &split_stats);
607  std::vector<Clusterable*> summed_stats;
608  SumStatsVec(split_stats, &summed_stats);
609 
610  std::vector<int32> indexes;
611  std::vector<Clusterable*> summed_stats_contiguous;
612  size_t max_index = 0;
613  for (size_t i = 0;i < summed_stats.size();i++) {
614  if (summed_stats[i] != NULL) {
615  indexes.push_back(i);
616  summed_stats_contiguous.push_back(summed_stats[i]);
617  if (i > max_index) max_index = i;
618  }
619  }
620  if (summed_stats_contiguous.empty()) {
621  KALDI_WARN << "ClusterBottomUp: nothing to cluster.";
622  return 0; // nothing merged.
623  }
624 
625  std::vector<int32> assignments;
626  BaseFloat normalizer = SumClusterableNormalizer(summed_stats_contiguous),
627  change;
628  change = ClusterBottomUp(summed_stats_contiguous,
629  thresh,
630  0, // no min-clust: use threshold for now.
631  NULL, // don't need clusters out.
632  &assignments); // this algorithm is quadratic, so might be quite slow.
633 
634 
635  KALDI_ASSERT(assignments.size() == summed_stats_contiguous.size() && !assignments.empty());
636  size_t num_clust = * std::max_element(assignments.begin(), assignments.end()) + 1;
637  int32 num_combined = summed_stats_contiguous.size() - num_clust;
638  KALDI_ASSERT(num_combined >= 0);
639 
640  KALDI_VLOG(2) << "ClusterBottomUp combined "<< num_combined
641  << " leaves and gave a likelihood change of " << change
642  << ", normalized = " << (change/normalizer)
643  << ", normalizer = " << normalizer;
644  KALDI_ASSERT(change < 0.0001); // should be negative or zero.
645 
646  KALDI_ASSERT(mapping != NULL);
647  if (max_index >= mapping->size()) mapping->resize(max_index+1, NULL);
648 
649  for (size_t i = 0;i < summed_stats_contiguous.size();i++) {
650  size_t index = indexes[i];
651  size_t new_index = indexes[assignments[i]]; // index assigned by clusterig-- map to existing indices in the map,
652  // that we clustered from, so we don't conflict with indices in other parts
653  // of the tree.
654  KALDI_ASSERT((*mapping)[index] == NULL || "Error: Cluster seems to have been "
655  "called for different parts of the tree with overlapping sets of "
656  "indices.");
657  (*mapping)[index] = new ConstantEventMap(new_index);
658  }
659  DeletePointers(&summed_stats);
660  return num_combined;
661 }
BaseFloat SumClusterableNormalizer(const std::vector< Clusterable * > &vec)
Returns the total normalizer (usually count) of the cluster (pointers may be NULL).
void SplitStatsByMap(const BuildTreeStatsType &stats, const EventMap &e, std::vector< BuildTreeStatsType > *stats_out)
Splits stats according to the EventMap, indexing them at output by the leaf type. ...
float BaseFloat
Definition: kaldi-types.h:29
BaseFloat ClusterBottomUp(const std::vector< Clusterable * > &points, BaseFloat max_merge_thresh, int32 min_clust, std::vector< Clusterable * > *clusters_out, std::vector< int32 > *assignments_out)
A bottom-up clustering algorithm.
void SumStatsVec(const std::vector< BuildTreeStatsType > &stats_in, std::vector< Clusterable * > *stats_out)
Sum a vector of stats.
#define KALDI_WARN
Definition: kaldi-error.h:130
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:169
#define KALDI_VLOG(v)
Definition: kaldi-error.h:136
void DeletePointers(std::vector< A * > *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:198
EventMap * ClusterEventMapRestrictedByKeys ( const EventMap &  e_in,
const BuildTreeStatsType &  stats,
BaseFloat  thresh,
const std::vector< EventKeyType > &  keys,
int32 *  num_removed 
)

This is as ClusterEventMap, but first splits the stats on the keys specified in "keys" (e.g.

typically keys = [ -1, P ]), and only clusters within the classes defined by that splitting. Note– leaves will be non-consecutive at output, use RenumberEventMap.

Definition at line 824 of file build-tree-utils.cc.

References kaldi::ClusterEventMapRestrictedHelper(), EventMap::Copy(), and kaldi::DeletePointers().

Referenced by kaldi::TestClusterEventMapRestricted().

828  {
829  std::vector<EventMap*> leaf_mapping; // For output of ClusterEventMapGetMapping.
830 
831  int32 nr = ClusterEventMapRestrictedHelper(e_in, stats, thresh, keys, &leaf_mapping);
832  if (num_removed != NULL) *num_removed = nr;
833 
834  EventMap *ans = e_in.Copy(leaf_mapping);
835  DeletePointers(&leaf_mapping);
836  return ans;
837 }
static int32 ClusterEventMapRestrictedHelper(const EventMap &e_in, const BuildTreeStatsType &stats, BaseFloat thresh, std::vector< EventKeyType > keys, std::vector< EventMap * > *leaf_mapping)
void DeletePointers(std::vector< A * > *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:198
EventMap * ClusterEventMapRestrictedByMap ( const EventMap &  e_in,
const BuildTreeStatsType &  stats,
BaseFloat  thresh,
const EventMap &  e_restrict,
int32 *  num_removed 
)

This version of ClusterEventMapRestricted restricts the clustering to only allow things that "e_restrict" maps to the same value to be clustered together.

Definition at line 840 of file build-tree-utils.cc.

References kaldi::ClusterEventMapGetMapping(), EventMap::Copy(), kaldi::DeletePointers(), rnnlm::i, and kaldi::SplitStatsByMap().

Referenced by kaldi::BuildTree(), kaldi::BuildTreeTwoLevel(), and kaldi::TestClusterEventMapRestricted().

844  {
845  std::vector<EventMap*> leaf_mapping;
846 
847  std::vector<BuildTreeStatsType> split_stats;
848  int num_removed = 0;
849  SplitStatsByMap(stats, e_restrict, &split_stats);
850  for (size_t i = 0; i < split_stats.size(); i++) {
851  if (!split_stats[i].empty())
852  num_removed += ClusterEventMapGetMapping(e_in, split_stats[i], thresh,
853  &leaf_mapping);
854  }
855 
856  if (num_removed_ptr != NULL) *num_removed_ptr = num_removed;
857 
858  EventMap *ans = e_in.Copy(leaf_mapping);
859  DeletePointers(&leaf_mapping);
860  return ans;
861 }
void SplitStatsByMap(const BuildTreeStatsType &stats, const EventMap &e, std::vector< BuildTreeStatsType > *stats_out)
Splits stats according to the EventMap, indexing them at output by the leaf type. ...
int ClusterEventMapGetMapping(const EventMap &e_in, const BuildTreeStatsType &stats, BaseFloat thresh, std::vector< EventMap * > *mapping)
"ClusterEventMapGetMapping" clusters the leaves of the EventMap, with "thresh" a delta-likelihood thr...
void DeletePointers(std::vector< A * > *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:198
void kaldi::CreateRandomQuestions ( const BuildTreeStatsType &  stats,
int32  num_quest,
Questions *  cfg_out 
)

CreateRandomQuestions will initialize a Questions randomly, in a reasonable way [for testing purposes, or when hand-designed questions are not available].

e.g. num_quest = 5 might be a reasonable value if num_iters > 0, or num_quest = 20 otherwise.

EventMap * DoTableSplit ( const EventMap &  orig,
EventKeyType  key,
const BuildTreeStatsType &  stats,
int32 *  num_leaves 
)

DoTableSplit does a complete split on this key (e.g.

might correspond to central phone (key = P-1), or HMM-state position (key == kPdfClass == -1). Stats used to work out possible values of the event. "num_leaves" is used to allocate new leaves. All stats must have this key defined, or this function will crash.

Definition at line 126 of file build-tree-utils.cc.

References EventMap::Copy(), kaldi::DeletePointers(), KALDI_ASSERT, kaldi::PossibleValues(), and kaldi::SplitStatsByMap().

Referenced by kaldi::DoTableSplitMultiple(), kaldi::TestClusterEventMap(), kaldi::TestClusterEventMapGetMappingAndRenumberEventMap(), kaldi::TestClusterEventMapGetMappingAndRenumberEventMap2(), and kaldi::TestDoTableSplit().

127  {
128  // First-- map the stats to each leaf in the EventMap.
129  std::vector<BuildTreeStatsType> split_stats;
130  SplitStatsByMap(stats, orig, &split_stats);
131  // Now for each leaf that has stats in it, do the table split according to the given name.
132  std::vector<EventMap*> splits(split_stats.size(), NULL);
133  for (EventAnswerType leaf = 0; leaf < (EventAnswerType)split_stats.size(); leaf++) {
134  if (!split_stats[leaf].empty()) {
135  // first work out the possible values the name takes.
136  std::vector<EventValueType> vals; // vals are put here, sorted.
137  bool all_present = PossibleValues(key, split_stats[leaf], &vals);
138  KALDI_ASSERT(all_present); // currently do not support mapping undefined values.
139  KALDI_ASSERT(!vals.empty() && vals.front() >= 0); // don't support mapping negative values
140  // at present time-- would need different EventMap type, not TableEventMap.
141  std::vector<EventMap*> table(vals.back()+1, (EventMap*)NULL);
142  for (size_t idx = 0;idx < vals.size();idx++) {
143  EventValueType val = vals[idx];
144  if (idx == 0) table[val] = new ConstantEventMap(leaf); // reuse current leaf.
145  else table[val] = new ConstantEventMap( (*num_leaves)++ ); // else take new leaf id.
146  }
147  // takes ownershipof stats.
148  splits[leaf] = new TableEventMap(key, table);
149  }
150  }
151  EventMap *ans = orig.Copy(splits);
152  DeletePointers(&splits);
153  return ans;
154 }
void SplitStatsByMap(const BuildTreeStatsType &stats, const EventMap &e, std::vector< BuildTreeStatsType > *stats_out)
Splits stats according to the EventMap, indexing them at output by the leaf type. ...
bool PossibleValues(EventKeyType key, const BuildTreeStatsType &stats, std::vector< EventValueType > *ans)
Convenience function e.g.
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:169
int32 EventAnswerType
As far as the event-map code itself is concerned, things of type EventAnswerType may take any value e...
Definition: event-map.h:56
int32 EventValueType
Given current code, things of type EventValueType should generally be nonnegative and in a reasonably...
Definition: event-map.h:51
void DeletePointers(std::vector< A * > *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:198
EventMap * DoTableSplitMultiple ( const EventMap &  orig,
const std::vector< EventKeyType > &  keys,
const BuildTreeStatsType &  stats,
int32 *  num_leaves 
)

DoTableSplitMultiple does a complete split on all the keys, in order from keys[0], keys[1] and so on.

The stats are used to work out possible values corresponding to the key. "num_leaves" is used to allocate new leaves. All stats must have the keys defined, or this function will crash. Returns a newly allocated event map.

Definition at line 156 of file build-tree-utils.cc.

References EventMap::Copy(), kaldi::DoTableSplit(), and rnnlm::i.

Referenced by kaldi::TestClusterEventMapRestricted(), and kaldi::TestShareEventMapLeaves().

156  {
157 
158  if (keys.empty()) return orig.Copy();
159  else {
160  EventMap *cur = NULL; // would make it &orig, except for const issues.
161  for (size_t i = 0; i < keys.size(); i++) {
162  EventMap *next = DoTableSplit( (cur ? *cur : orig), keys[i], stats, num_leaves);
163  delete cur; // delete intermediate maps.
164  cur = next;
165  }
166  return cur;
167  }
168 }
EventMap * DoTableSplit(const EventMap &orig, EventKeyType key, const BuildTreeStatsType &stats, int32 *num_leaves)
DoTableSplit does a complete split on this key (e.g.
BaseFloat FindBestSplitForKey ( const BuildTreeStatsType &  stats,
const Questions &  qcfg,
EventKeyType  key,
std::vector< EventValueType > *  yes_set 
)

FindBestSplitForKey is a function used in DoDecisionTreeSplit.

It finds the best split for this key, given these stats. It will return 0 if the key was not always defined for the stats.

Definition at line 348 of file build-tree-utils.cc.

References kaldi::AddToClusters(), kaldi::ApproxEqual(), kaldi::ComputeInitialSplit(), kaldi::DeletePointers(), kaldi::EnsureClusterableVectorNotNull(), Questions::GetQuestionsOf(), rnnlm::i, KALDI_ASSERT, KALDI_WARN, RefineClustersOptions::num_iters, kaldi::PossibleValues(), QuestionsForKey::refine_opts, kaldi::RefineClusters(), kaldi::SplitStatsByKey(), and kaldi::SumStatsVec().

Referenced by DecisionTreeSplitter::FindBestSplit().

351  {
352  if (stats.size()<=1) return 0.0; // cannot split if only zero or one instance of stats.
353  if (!PossibleValues(key, stats, NULL)) {
354  yes_set_out->clear();
355  return 0.0; // Can't split as key not always defined.
356  }
357  std::vector<Clusterable*> summed_stats; // indexed by value corresponding to key. owned here.
358  { // compute summed_stats
359  std::vector<BuildTreeStatsType> split_stats;
360  SplitStatsByKey(stats, key, &split_stats);
361  SumStatsVec(split_stats, &summed_stats);
362  }
363 
364  std::vector<EventValueType> yes_set;
365  BaseFloat improvement = ComputeInitialSplit(summed_stats,
366  q_opts, key, &yes_set);
367  // find best basic question.
368 
369  std::vector<int32> assignments(summed_stats.size(), 0); // assigns to "no" (0) by default.
370  for (std::vector<EventValueType>::const_iterator iter = yes_set.begin(); iter != yes_set.end(); ++iter) {
371  KALDI_ASSERT(*iter>=0);
372  if (*iter < (EventValueType)assignments.size()) {
373  // this guard necessary in case stats did not have all the
374  // values present in "yes_set".
375  assignments[*iter] = 1; // assign to "yes" (1).
376  }
377  }
378  std::vector<Clusterable*> clusters(2, (Clusterable*)NULL); // no, yes.
379  kaldi::AddToClusters(summed_stats, assignments, &clusters);
380 
381  EnsureClusterableVectorNotNull(&summed_stats);
383 
384  // even if improvement == 0 we continue; if we do RefineClusters we may get further improvement.
385  // now do the RefineClusters stuff. Note that this is null-op if
386  // q_opts.GetQuestionsOf(key).refine_opts.num_iters == 0. We could check for this but don't bother;
387  // it happens in RefineClusters anyway.
388 
389  if (q_opts.GetQuestionsOf(key).refine_opts.num_iters > 0) {
390  // If we want to refine the questions... (a bit like k-means w/ 2 classes).
391  // Note: the only reason we introduced the if-statement is so the yes_set
392  // doesn't get modified (truncated, actually) if we do the refine stuff with
393  // zero iters.
394  BaseFloat refine_impr = RefineClusters(summed_stats, &clusters, &assignments,
395  q_opts.GetQuestionsOf(key).refine_opts);
396  KALDI_ASSERT(refine_impr > std::min(-1.0, -0.1*fabs(improvement)));
397  // refine_impr should always be positive
398  improvement += refine_impr;
399  yes_set.clear();
400  for (size_t i = 0;i < assignments.size();i++) if (assignments[i] == 1) yes_set.push_back(i);
401  }
402  *yes_set_out = yes_set;
403 
404  DeletePointers(&clusters);
405 #ifdef KALDI_PARANOID
406  { // Check the "ans" is correct.
407  KALDI_ASSERT(clusters.size() == 2 && clusters[0] == 0 && clusters[1] == 0);
408  AddToClusters(summed_stats, assignments, &clusters);
409  BaseFloat impr;
410  if (clusters[0] == NULL || clusters[1] == NULL) impr = 0.0;
411  else impr = clusters[0]->Distance(*(clusters[1]));
412  if (!ApproxEqual(impr, improvement) && fabs(impr-improvement) > 0.01) {
413  KALDI_WARN << "FindBestSplitForKey: improvements do not agree: "<< impr
414  << " vs. " << improvement;
415  }
416  DeletePointers(&clusters);
417  }
418 #endif
419  DeletePointers(&summed_stats);
420  return improvement; // objective-function improvement.
421 }
BaseFloat RefineClusters(const std::vector< Clusterable * > &points, std::vector< Clusterable * > *clusters, std::vector< int32 > *assignments, RefineClustersOptions cfg)
RefineClusters is mainly used internally by other clustering algorithms.
void SplitStatsByKey(const BuildTreeStatsType &stats_in, EventKeyType key, std::vector< BuildTreeStatsType > *stats_out)
SplitStatsByKey splits stats up according to the value of a particular key, which must be always defi...
bool PossibleValues(EventKeyType key, const BuildTreeStatsType &stats, std::vector< EventValueType > *ans)
Convenience function e.g.
float BaseFloat
Definition: kaldi-types.h:29
void EnsureClusterableVectorNotNull(std::vector< Clusterable * > *stats)
Fills in any (NULL) holes in "stats" vector, with empty stats, because certain algorithms require non...
void SumStatsVec(const std::vector< BuildTreeStatsType > &stats_in, std::vector< Clusterable * > *stats_out)
Sum a vector of stats.
void AddToClusters(const std::vector< Clusterable * > &stats, const std::vector< int32 > &assignments, std::vector< Clusterable * > *clusters)
Given stats and a vector "assignments" of the same size (that maps to cluster indices), sums the stats up into "clusters." It will add to any stats already present in "clusters" (although typically "clusters" will be empty when called), and it will extend with NULL pointers for any unseen indices.
#define KALDI_WARN
Definition: kaldi-error.h:130
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:169
BaseFloat ComputeInitialSplit(const std::vector< Clusterable * > &summed_stats, const Questions &q_opts, EventKeyType key, std::vector< EventValueType > *yes_set)
int32 EventValueType
Given current code, things of type EventValueType should generally be nonnegative and in a reasonably...
Definition: event-map.h:51
void DeletePointers(std::vector< A * > *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:198
static bool ApproxEqual(float a, float b, float relative_tolerance=0.001)
return abs(a - b) <= relative_tolerance * (abs(a)+abs(b)).
Definition: kaldi-math.h:262
EventMap * GetStubMap ( int32  P,
const std::vector< std::vector< int32 > > &  phone_sets,
const std::vector< int32 > &  phone2num_pdf_classes,
const std::vector< bool > &  share_roots,
int32 *  num_leaves 
)

GetStubMap is used in tree-building functions to get the initial to-states map, before the decision-tree-building process.

It creates a simple map that splits on groups of phones. For the set of phones in phone_sets[i] it creates either: if share_roots[i] == true, a single leaf node, or if share_roots[i] == false, separate root nodes for each HMM-position (it goes up to the highest position for any phone in the set, although it will warn if you share roots between phones with different numbers of states, which is a weird thing to do but should still work. If any phone is present in "phone_sets" but "phone2num_pdf_classes" does not map it to a length, it is an error. Note that the behaviour of the resulting map is undefined for phones not present in "phone_sets". At entry, this function should be called with (*num_leaves == 0). It will number the leaves starting from (*num_leaves).

Definition at line 865 of file build-tree-utils.cc.

References rnnlm::i, kaldi::IsSortedAndUniq(), rnnlm::j, KALDI_ASSERT, KALDI_WARN, and kaldi::kPdfClass.

Referenced by kaldi::BuildTree(), kaldi::MonophoneContextDependency(), and kaldi::MonophoneContextDependencyShared().

869  {
870 
871  { // Checking inputs.
872  KALDI_ASSERT(!phone_sets.empty() && share_roots.size() == phone_sets.size());
873  std::set<int32> all_phones;
874  for (size_t i = 0; i < phone_sets.size(); i++) {
875  KALDI_ASSERT(IsSortedAndUniq(phone_sets[i]));
876  KALDI_ASSERT(!phone_sets[i].empty());
877  for (size_t j = 0; j < phone_sets[i].size(); j++) {
878  KALDI_ASSERT(all_phones.count(phone_sets[i][j]) == 0); // check not present.
879  all_phones.insert(phone_sets[i][j]);
880  }
881  }
882  }
883 
884  // Initially create a single leaf for each phone set.
885 
886  size_t max_set_size = 0;
887  int32 highest_numbered_phone = 0;
888  for (size_t i = 0; i < phone_sets.size(); i++) {
889  max_set_size = std::max(max_set_size, phone_sets[i].size());
890  highest_numbered_phone =
891  std::max(highest_numbered_phone,
892  * std::max_element(phone_sets[i].begin(), phone_sets[i].end()));
893  }
894 
895  if (phone_sets.size() == 1) { // there is only one set so the recursion finishes.
896  if (share_roots[0]) { // if "shared roots" return a single leaf.
897  return new ConstantEventMap( (*num_leaves_out)++ );
898  } else { // not sharing roots -> work out the length and return a
899  // TableEventMap splitting on length.
900  EventAnswerType max_len = 0;
901  for (size_t i = 0; i < phone_sets[0].size(); i++) {
902  EventAnswerType len;
903  EventValueType phone = phone_sets[0][i];
904  KALDI_ASSERT(static_cast<size_t>(phone) < phone2num_pdf_classes.size());
905  len = phone2num_pdf_classes[phone];
906  KALDI_ASSERT(len > 0);
907  if (i == 0) max_len = len;
908  else {
909  if (len != max_len) {
910  KALDI_WARN << "Mismatching lengths within a phone set: " << len
911  << " vs. " << max_len << " [unusual, but not necessarily fatal]. ";
912  max_len = std::max(len, max_len);
913  }
914  }
915  }
916  std::map<EventValueType, EventAnswerType> m;
917  for (EventAnswerType p = 0; p < max_len; p++)
918  m[p] = (*num_leaves_out)++;
919  return new TableEventMap(kPdfClass, // split on hmm-position
920  m);
921  }
922  } else if (max_set_size == 1
923  && static_cast<int32>(phone_sets.size()) <= 2*highest_numbered_phone) {
924  // create table map splitting on phone-- more efficient.
925  // the part after the && checks that this would not contain a very sparse vector.
926  std::map<EventValueType, EventMap*> m;
927  for (size_t i = 0; i < phone_sets.size(); i++) {
928  std::vector<std::vector<int32> > phone_sets_tmp;
929  phone_sets_tmp.push_back(phone_sets[i]);
930  std::vector<bool> share_roots_tmp;
931  share_roots_tmp.push_back(share_roots[i]);
932  EventMap *this_stub = GetStubMap(P, phone_sets_tmp, phone2num_pdf_classes,
933  share_roots_tmp,
934  num_leaves_out);
935  KALDI_ASSERT(m.count(phone_sets_tmp[0][0]) == 0);
936  m[phone_sets_tmp[0][0]] = this_stub;
937  }
938  return new TableEventMap(P, m);
939  } else {
940  // Do a split. Recurse.
941  size_t half_sz = phone_sets.size() / 2;
942  std::vector<std::vector<int32> >::const_iterator half_phones =
943  phone_sets.begin() + half_sz;
944  std::vector<bool>::const_iterator half_share =
945  share_roots.begin() + half_sz;
946  std::vector<std::vector<int32> > phone_sets_1, phone_sets_2;
947  std::vector<bool> share_roots_1, share_roots_2;
948  phone_sets_1.insert(phone_sets_1.end(), phone_sets.begin(), half_phones);
949  phone_sets_2.insert(phone_sets_2.end(), half_phones, phone_sets.end());
950  share_roots_1.insert(share_roots_1.end(), share_roots.begin(), half_share);
951  share_roots_2.insert(share_roots_2.end(), half_share, share_roots.end());
952 
953  EventMap *map1 = GetStubMap(P, phone_sets_1, phone2num_pdf_classes, share_roots_1, num_leaves_out);
954  EventMap *map2 = GetStubMap(P, phone_sets_2, phone2num_pdf_classes, share_roots_2, num_leaves_out);
955 
956  std::vector<EventKeyType> all_in_first_set;
957  for (size_t i = 0; i < half_sz; i++)
958  for (size_t j = 0; j < phone_sets[i].size(); j++)
959  all_in_first_set.push_back(phone_sets[i][j]);
960  std::sort(all_in_first_set.begin(), all_in_first_set.end());
961  KALDI_ASSERT(IsSortedAndUniq(all_in_first_set));
962  return new SplitEventMap(P, all_in_first_set, map1, map2);
963  }
964 }
static const EventKeyType kPdfClass
Definition: context-dep.h:39
#define KALDI_WARN
Definition: kaldi-error.h:130
EventMap * GetStubMap(int32 P, const std::vector< std::vector< int32 > > &phone_sets, const std::vector< int32 > &phone2num_pdf_classes, const std::vector< bool > &share_roots, int32 *num_leaves_out)
GetStubMap is used in tree-building functions to get the initial to-states map, before the decision-t...
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:169
int32 EventAnswerType
As far as the event-map code itself is concerned, things of type EventAnswerType may take any value e...
Definition: event-map.h:56
int32 EventValueType
Given current code, things of type EventValueType should generally be nonnegative and in a reasonably...
Definition: event-map.h:51
bool IsSortedAndUniq(const std::vector< T > &vec)
Returns true if the vector is sorted and contains each element only once.
Definition: stl-utils.h:75
EventMap * MapEventMapLeaves ( const EventMap &  e_in,
const std::vector< int32 > &  mapping 
)

This function remaps the event-map leaves using this mapping, indexed by the number at leaf.

Definition at line 689 of file build-tree-utils.cc.

References EventMap::Copy(), kaldi::DeletePointers(), and rnnlm::i.

Referenced by kaldi::BuildTreeTwoLevel().

690  {
691  std::vector<EventMap*> mapping(mapping_in.size());
692  for (size_t i = 0; i < mapping_in.size(); i++)
693  mapping[i] = new ConstantEventMap(mapping_in[i]);
694  EventMap *ans = e_in.Copy(mapping);
695  DeletePointers(&mapping);
696  return ans;
697 }
void DeletePointers(std::vector< A * > *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:198
EventMap * RenumberEventMap ( const EventMap &  e_in,
int32 *  num_leaves 
)

RenumberEventMap [intended to be used after calling ClusterEventMap] renumbers an EventMap so its leaves are consecutive.

It puts the number of leaves in *num_leaves. If later you need the mapping of the leaves, modify the function and add a new argument.

Definition at line 664 of file build-tree-utils.cc.

References EventMap::Copy(), kaldi::DeletePointers(), KALDI_ASSERT, EventMap::MultiMap(), and kaldi::SortAndUniq().

Referenced by kaldi::BuildTree(), kaldi::BuildTreeTwoLevel(), kaldi::ShareEventMapLeaves(), kaldi::TestClusterEventMapGetMappingAndRenumberEventMap(), and kaldi::TestClusterEventMapGetMappingAndRenumberEventMap2().

664  {
665  EventType empty_vec;
666  std::vector<EventAnswerType> initial_leaves; // before renumbering.
667  e_in.MultiMap(empty_vec, &initial_leaves);
668  if (initial_leaves.empty()) {
669  KALDI_ASSERT(num_leaves);
670  if (num_leaves) *num_leaves = 0;
671  return e_in.Copy();
672  }
673  SortAndUniq(&initial_leaves);
674  EventAnswerType max_leaf_plus_one = initial_leaves.back() + 1; // will typically, but not always, equal *num_leaves.
675  std::vector<EventMap*> mapping(max_leaf_plus_one, (EventMap*)NULL);
676  std::vector<EventAnswerType>::iterator iter = initial_leaves.begin(), end = initial_leaves.end();
677  EventAnswerType cur_leaf = 0;
678  for (; iter != end; ++iter) {
679  KALDI_ASSERT(*iter >= 0 && *iter<max_leaf_plus_one);
680  mapping[*iter] = new ConstantEventMap(cur_leaf++);
681  }
682  EventMap *ans = e_in.Copy(mapping);
683  DeletePointers(&mapping);
684  KALDI_ASSERT((size_t)cur_leaf == initial_leaves.size());
685  if (num_leaves) *num_leaves = cur_leaf;
686  return ans;
687 }
void SortAndUniq(std::vector< T > *vec)
Sorts and uniq's (removes duplicates) from a vector.
Definition: stl-utils.h:51
std::vector< std::pair< EventKeyType, EventValueType > > EventType
Definition: event-map.h:58
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:169
int32 EventAnswerType
As far as the event-map code itself is concerned, things of type EventAnswerType may take any value e...
Definition: event-map.h:56
void DeletePointers(std::vector< A * > *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:198
EventMap * ShareEventMapLeaves ( const EventMap &  e_in,
EventKeyType  key,
std::vector< std::vector< EventValueType > > &  values,
int32 *  num_leaves 
)

ShareEventMapLeaves performs a quite specific function that allows us to generate trees where, for a certain list of phones, and for all states in the phone, all the pdf's are shared.

Each element of "values" contains a list of phones (may be just one phone), all states of which we want shared together). Typically at input, "key" will equal P, the central-phone position, and "values" will contain just one list containing the silence phone. This function renumbers the event map leaves after doing the sharing, to make the event-map leaves contiguous.

Definition at line 710 of file build-tree-utils.cc.

References EventMap::Copy(), kaldi::DeletePointers(), rnnlm::i, rnnlm::j, KALDI_ASSERT, KALDI_WARN, kaldi::MakeEventPair(), EventMap::MultiMap(), kaldi::RenumberEventMap(), and kaldi::SortAndUniq().

Referenced by kaldi::TestShareEventMapLeaves().

712  {
713  // the use of "pdfs" as the name of the next variable reflects the anticipated
714  // use of this function.
715  std::vector<std::vector<EventAnswerType> > pdfs(values.size());
716  for (size_t i = 0; i < values.size(); i++) {
717  EventType evec;
718  for (size_t j = 0; j < values[i].size(); j++) {
719  evec.push_back(MakeEventPair(key, values[i][j]));
720  size_t size_at_start = pdfs[i].size();
721  e_in.MultiMap(evec, &(pdfs[i])); // append any corresponding pdfs to pdfs[i].
722  if (pdfs[i].size() == size_at_start) { // Nothing added... unexpected.
723  KALDI_WARN << "ShareEventMapLeaves: had no leaves for key = " << key
724  << ", value = " << (values[i][j]);
725  }
726  }
727  SortAndUniq(&(pdfs[i]));
728  }
729  std::vector<EventMap*> remapping;
730  for (size_t i = 0; i < values.size(); i++) {
731  if (pdfs[i].empty())
732  KALDI_WARN << "ShareEventMapLeaves: no leaves in one bucket."; // not expected.
733  else {
734  EventAnswerType map_to_this = pdfs[i][0]; // map all in this bucket
735  // to this value.
736  for (size_t j = 1; j < pdfs[i].size(); j++) {
737  EventAnswerType leaf = pdfs[i][j];
738  KALDI_ASSERT(leaf>=0);
739  if (remapping.size() <= static_cast<size_t>(leaf))
740  remapping.resize(leaf+1, NULL);
741  KALDI_ASSERT(remapping[leaf] == NULL);
742  remapping[leaf] = new ConstantEventMap(map_to_this);
743  }
744  }
745  }
746  EventMap *shared = e_in.Copy(remapping);
747  DeletePointers(&remapping);
748  EventMap *renumbered = RenumberEventMap(*shared, num_leaves);
749  delete shared;
750  return renumbered;
751 }
std::pair< EventKeyType, EventValueType > MakeEventPair(EventKeyType k, EventValueType v)
Definition: event-map.h:62
void SortAndUniq(std::vector< T > *vec)
Sorts and uniq's (removes duplicates) from a vector.
Definition: stl-utils.h:51
std::vector< std::pair< EventKeyType, EventValueType > > EventType
Definition: event-map.h:58
#define KALDI_WARN
Definition: kaldi-error.h:130
EventMap * RenumberEventMap(const EventMap &e_in, int32 *num_leaves)
RenumberEventMap [intended to be used after calling ClusterEventMap] renumbers an EventMap so its lea...
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:169
int32 EventAnswerType
As far as the event-map code itself is concerned, things of type EventAnswerType may take any value e...
Definition: event-map.h:56
void DeletePointers(std::vector< A * > *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:198
EventMap * SplitDecisionTree ( const EventMap &  orig,
const BuildTreeStatsType &  stats,
Questions &  qcfg,
BaseFloat  thresh,
int32  max_leaves,
int32 *  num_leaves,
BaseFloat *  objf_impr_out,
BaseFloat *  smallest_split_change_out 
)

Does a decision-tree split at the leaves of an EventMap.

Parameters
orig[in] The EventMap whose leaves we want to split. [may be either a trivial or a non-trivial one].
stats[in] The statistics for splitting the tree; if you do not want a particular subset of leaves to be split, make sure the stats corresponding to those leaves are not present in "stats".
qcfg[in] Configuration class that contains initial questions (e.g. sets of phones) for each key and says whether to refine these questions during tree building.
thresh[in] A log-likelihood threshold (e.g. 300) that can be used to limit the number of leaves; you can use zero and set max_leaves instead.
max_leaves[in] Will stop leaves being split after they reach this number.
num_leaves[in,out] A pointer used to allocate leaves; always corresponds to the current number of leaves (is incremented when this is increased).
objf_impr_out[out] If non-NULL, will be set to the objective improvement due to splitting (not normalized by the number of frames).
smallest_split_change_outIf non-NULL, will be set to the smallest objective-function improvement that we got from splitting any leaf; useful to provide a threshold for ClusterEventMap.
Returns
The EventMap after splitting is returned; pointer is owned by caller.

Definition at line 532 of file build-tree-utils.cc.

References EventMap::Copy(), count, rnnlm::i, KALDI_ASSERT, KALDI_LOG, and kaldi::SplitStatsByMap().

Referenced by kaldi::BuildTree(), kaldi::BuildTreeTwoLevel(), kaldi::TestClusterEventMapRestricted(), kaldi::TestShareEventMapLeaves(), and kaldi::TestSplitDecisionTree().

539  {
540  KALDI_ASSERT(num_leaves != NULL && *num_leaves > 0); // can't be 0 or input_map would be empty.
541  int32 num_empty_leaves = 0;
542  BaseFloat like_impr = 0.0;
543  BaseFloat smallest_split_change = 1.0e+20;
544  std::vector<DecisionTreeSplitter*> builders;
545  { // set up "builders" [one for each current leaf]. This array is never extended.
546  // the structures generated during splitting remain as trees at each array location.
547  std::vector<BuildTreeStatsType> split_stats;
548  SplitStatsByMap(stats, input_map, &split_stats);
549  KALDI_ASSERT(split_stats.size() != 0);
550  builders.resize(split_stats.size()); // size == #leaves.
551  for (size_t i = 0;i < split_stats.size();i++) {
552  EventAnswerType leaf = static_cast<EventAnswerType>(i);
553  if (split_stats[i].size() == 0) num_empty_leaves++;
554  builders[i] = new DecisionTreeSplitter(leaf, split_stats[i], q_opts);
555  }
556  }
557 
558  { // Do the splitting.
559  int32 count = 0;
560  std::priority_queue<std::pair<BaseFloat, size_t> > queue; // use size_t because logically these
561  // are just indexes into the array, not leaf-ids (after splitting they are no longer leaf id's).
562  // Initialize queue.
563  for (size_t i = 0; i < builders.size(); i++)
564  queue.push(std::make_pair(builders[i]->BestSplit(), i));
565  // Note-- queue's size never changes from now. All the alternatives leaves to split are
566  // inside the "DecisionTreeSplitter*" objects, in a tree structure.
567  while (queue.top().first > thresh
568  && (max_leaves<=0 || *num_leaves < max_leaves)) {
569  smallest_split_change = std::min(smallest_split_change, queue.top().first);
570  size_t i = queue.top().second;
571  like_impr += queue.top().first;
572  builders[i]->DoSplit(num_leaves);
573  queue.pop();
574  queue.push(std::make_pair(builders[i]->BestSplit(), i));
575  count++;
576  }
577  KALDI_LOG << "DoDecisionTreeSplit: split "<< count << " times, #leaves now " << (*num_leaves);
578  }
579 
580  if (smallest_split_change_out)
581  *smallest_split_change_out = smallest_split_change;
582 
583  EventMap *answer = NULL;
584 
585  { // Create the output EventMap.
586  std::vector<EventMap*> sub_trees(builders.size());
587  for (size_t i = 0; i < sub_trees.size();i++) sub_trees[i] = builders[i]->GetMap();
588  answer = input_map.Copy(sub_trees);
589  for (size_t i = 0; i < sub_trees.size();i++) delete sub_trees[i];
590  }
591  // Free up memory.
592  for (size_t i = 0;i < builders.size();i++) delete builders[i];
593 
594  if (obj_impr_out != NULL) *obj_impr_out = like_impr;
595  return answer;
596 }
void SplitStatsByMap(const BuildTreeStatsType &stats, const EventMap &e, std::vector< BuildTreeStatsType > *stats_out)
Splits stats according to the EventMap, indexing them at output by the leaf type. ...
const size_t count
float BaseFloat
Definition: kaldi-types.h:29
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:169
int32 EventAnswerType
As far as the event-map code itself is concerned, things of type EventAnswerType may take any value e...
Definition: event-map.h:56
#define KALDI_LOG
Definition: kaldi-error.h:133
EventMap* kaldi::TrivialTree ( int32 *  num_leaves)
inline

Returns a tree with just one node.

Used @ start of tree-building process. Not really used in current recipes.

Definition at line 152 of file build-tree-utils.h.

Referenced by kaldi::TestClusterEventMap(), kaldi::TestClusterEventMapGetMappingAndRenumberEventMap(), kaldi::TestClusterEventMapGetMappingAndRenumberEventMap2(), kaldi::TestClusterEventMapRestricted(), kaldi::TestDoTableSplit(), kaldi::TestShareEventMapLeaves(), kaldi::TestSplitDecisionTree(), and kaldi::TestTrivialTree().

152  {
153  KALDI_ASSERT(*num_leaves == 0); // in envisaged usage.
154  return new ConstantEventMap( (*num_leaves)++ );
155 }
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:169