Intermediate-level functions used in building the tree

These functions are are used in top-level tree-building code (Top-level tree-building functions); see Decision tree internals for documentation. More...

Collaboration diagram for Intermediate-level functions used in building the tree:

Functions

EventMapTrivialTree (int32 *num_leaves)
 Returns a tree with just one node. More...
 
EventMapDoTableSplit (const EventMap &orig, EventKeyType key, const BuildTreeStatsType &stats, int32 *num_leaves)
 DoTableSplit does a complete split on this key (e.g. More...
 
EventMapDoTableSplitMultiple (const EventMap &orig, const std::vector< EventKeyType > &keys, const BuildTreeStatsType &stats, int32 *num_leaves)
 DoTableSplitMultiple does a complete split on all the keys, in order from keys[0], keys[1] and so on. More...
 
int ClusterEventMapGetMapping (const EventMap &e_in, const BuildTreeStatsType &stats, BaseFloat thresh, std::vector< EventMap * > *mapping)
 "ClusterEventMapGetMapping" clusters the leaves of the EventMap, with "thresh" a delta-likelihood threshold to control how many leaves we combine (might be the same as the delta-like threshold used in splitting. More...
 
EventMapClusterEventMap (const EventMap &e_in, const BuildTreeStatsType &stats, BaseFloat thresh, int32 *num_removed)
 This is as ClusterEventMapGetMapping but a more convenient interface that exposes less of the internals. More...
 
EventMapClusterEventMapRestrictedByKeys (const EventMap &e_in, const BuildTreeStatsType &stats, BaseFloat thresh, const std::vector< EventKeyType > &keys, int32 *num_removed)
 This is as ClusterEventMap, but first splits the stats on the keys specified in "keys" (e.g. More...
 
EventMapClusterEventMapRestrictedByMap (const EventMap &e_in, const BuildTreeStatsType &stats, BaseFloat thresh, const EventMap &e_restrict, int32 *num_removed)
 This version of ClusterEventMapRestricted restricts the clustering to only allow things that "e_restrict" maps to the same value to be clustered together. More...
 
EventMapClusterEventMapToNClustersRestrictedByMap (const EventMap &e_in, const BuildTreeStatsType &stats, int32 num_clusters, const EventMap &e_restrict, int32 *num_removed)
 This version of ClusterEventMapRestrictedByMap clusters to get a specific number of clusters as specified by 'num_clusters'. More...
 
EventMapRenumberEventMap (const EventMap &e_in, int32 *num_leaves)
 RenumberEventMap [intended to be used after calling ClusterEventMap] renumbers an EventMap so its leaves are consecutive. More...
 
EventMapMapEventMapLeaves (const EventMap &e_in, const std::vector< int32 > &mapping)
 This function remaps the event-map leaves using this mapping, indexed by the number at leaf. More...
 
EventMapShareEventMapLeaves (const EventMap &e_in, EventKeyType key, std::vector< std::vector< EventValueType > > &values, int32 *num_leaves)
 ShareEventMapLeaves performs a quite specific function that allows us to generate trees where, for a certain list of phones, and for all states in the phone, all the pdf's are shared. More...
 
EventMapSplitDecisionTree (const EventMap &orig, const BuildTreeStatsType &stats, Questions &qcfg, BaseFloat thresh, int32 max_leaves, int32 *num_leaves, BaseFloat *objf_impr_out, BaseFloat *smallest_split_change_out)
 Does a decision-tree split at the leaves of an EventMap. More...
 
void CreateRandomQuestions (const BuildTreeStatsType &stats, int32 num_quest, Questions *cfg_out)
 CreateRandomQuestions will initialize a Questions randomly, in a reasonable way [for testing purposes, or when hand-designed questions are not available]. More...
 
BaseFloat FindBestSplitForKey (const BuildTreeStatsType &stats, const Questions &qcfg, EventKeyType key, std::vector< EventValueType > *yes_set)
 FindBestSplitForKey is a function used in DoDecisionTreeSplit. More...
 
EventMapGetStubMap (int32 P, const std::vector< std::vector< int32 > > &phone_sets, const std::vector< int32 > &phone2num_pdf_classes, const std::vector< bool > &share_roots, int32 *num_leaves)
 GetStubMap is used in tree-building functions to get the initial to-states map, before the decision-tree-building process. More...
 

Detailed Description

These functions are are used in top-level tree-building code (Top-level tree-building functions); see Decision tree internals for documentation.

Function Documentation

◆ ClusterEventMap()

EventMap * ClusterEventMap ( const EventMap e_in,
const BuildTreeStatsType stats,
BaseFloat  thresh,
int32 num_removed 
)

This is as ClusterEventMapGetMapping but a more convenient interface that exposes less of the internals.

It uses a bottom-up clustering to combine the leaves, until the log-likelihood decrease from combinging two leaves exceeds the threshold.

Definition at line 699 of file build-tree-utils.cc.

References kaldi::ClusterEventMapGetMapping(), EventMap::Copy(), and kaldi::DeletePointers().

Referenced by kaldi::TestClusterEventMap(), and kaldi::TestClusterEventMapRestricted().

700  {
701  std::vector<EventMap*> mapping;
702  int32 num_removed = ClusterEventMapGetMapping(e_in, stats, thresh, &mapping);
703  EventMap *ans = e_in.Copy(mapping);
704  DeletePointers(&mapping);
705  if (num_removed_ptr != NULL) *num_removed_ptr = num_removed;
706  return ans;
707 
708 }
void DeletePointers(std::vector< A *> *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:184
kaldi::int32 int32
int ClusterEventMapGetMapping(const EventMap &e_in, const BuildTreeStatsType &stats, BaseFloat thresh, std::vector< EventMap *> *mapping)
"ClusterEventMapGetMapping" clusters the leaves of the EventMap, with "thresh" a delta-likelihood thr...

◆ ClusterEventMapGetMapping()

int ClusterEventMapGetMapping ( const EventMap e_in,
const BuildTreeStatsType stats,
BaseFloat  thresh,
std::vector< EventMap * > *  mapping 
)

"ClusterEventMapGetMapping" clusters the leaves of the EventMap, with "thresh" a delta-likelihood threshold to control how many leaves we combine (might be the same as the delta-like threshold used in splitting.

Definition at line 599 of file build-tree-utils.cc.

References kaldi::ClusterBottomUp(), kaldi::DeletePointers(), rnnlm::i, KALDI_ASSERT, KALDI_VLOG, KALDI_WARN, kaldi::SplitStatsByMap(), kaldi::SumClusterableNormalizer(), and kaldi::SumStatsVec().

Referenced by kaldi::ClusterEventMap(), kaldi::ClusterEventMapRestrictedByMap(), kaldi::ClusterEventMapRestrictedHelper(), kaldi::TestClusterEventMapGetMappingAndRenumberEventMap(), and kaldi::TestClusterEventMapGetMappingAndRenumberEventMap2().

602  {
603  // First map stats
604  KALDI_ASSERT(stats.size() != 0);
605  std::vector<BuildTreeStatsType> split_stats;
606  SplitStatsByMap(stats, e_in, &split_stats);
607  std::vector<Clusterable*> summed_stats;
608  SumStatsVec(split_stats, &summed_stats);
609 
610  std::vector<int32> indexes;
611  std::vector<Clusterable*> summed_stats_contiguous;
612  size_t max_index = 0;
613  for (size_t i = 0;i < summed_stats.size();i++) {
614  if (summed_stats[i] != NULL) {
615  indexes.push_back(i);
616  summed_stats_contiguous.push_back(summed_stats[i]);
617  if (i > max_index) max_index = i;
618  }
619  }
620  if (summed_stats_contiguous.empty()) {
621  KALDI_WARN << "ClusterBottomUp: nothing to cluster.";
622  return 0; // nothing merged.
623  }
624 
625  std::vector<int32> assignments;
626  BaseFloat normalizer = SumClusterableNormalizer(summed_stats_contiguous),
627  change;
628  change = ClusterBottomUp(summed_stats_contiguous,
629  thresh,
630  0, // no min-clust: use threshold for now.
631  NULL, // don't need clusters out.
632  &assignments); // this algorithm is quadratic, so might be quite slow.
633 
634 
635  KALDI_ASSERT(assignments.size() == summed_stats_contiguous.size() && !assignments.empty());
636  size_t num_clust = * std::max_element(assignments.begin(), assignments.end()) + 1;
637  int32 num_combined = summed_stats_contiguous.size() - num_clust;
638  KALDI_ASSERT(num_combined >= 0);
639 
640  KALDI_VLOG(2) << "ClusterBottomUp combined "<< num_combined
641  << " leaves and gave a likelihood change of " << change
642  << ", normalized = " << (change/normalizer)
643  << ", normalizer = " << normalizer;
644  KALDI_ASSERT(change < 0.0001); // should be negative or zero.
645 
646  KALDI_ASSERT(mapping != NULL);
647  if (max_index >= mapping->size()) mapping->resize(max_index+1, NULL);
648 
649  for (size_t i = 0;i < summed_stats_contiguous.size();i++) {
650  size_t index = indexes[i];
651  size_t new_index = indexes[assignments[i]]; // index assigned by clusterig-- map to existing indices in the map,
652  // that we clustered from, so we don't conflict with indices in other parts
653  // of the tree.
654  KALDI_ASSERT((*mapping)[index] == NULL || "Error: Cluster seems to have been "
655  "called for different parts of the tree with overlapping sets of "
656  "indices.");
657  (*mapping)[index] = new ConstantEventMap(new_index);
658  }
659  DeletePointers(&summed_stats);
660  return num_combined;
661 }
void DeletePointers(std::vector< A *> *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:184
BaseFloat SumClusterableNormalizer(const std::vector< Clusterable *> &vec)
Returns the total normalizer (usually count) of the cluster (pointers may be NULL).
void SplitStatsByMap(const BuildTreeStatsType &stats, const EventMap &e, std::vector< BuildTreeStatsType > *stats_out)
Splits stats according to the EventMap, indexing them at output by the leaf type. ...
kaldi::int32 int32
float BaseFloat
Definition: kaldi-types.h:29
BaseFloat ClusterBottomUp(const std::vector< Clusterable *> &points, BaseFloat max_merge_thresh, int32 min_clust, std::vector< Clusterable *> *clusters_out, std::vector< int32 > *assignments_out)
A bottom-up clustering algorithm.
void SumStatsVec(const std::vector< BuildTreeStatsType > &stats_in, std::vector< Clusterable *> *stats_out)
Sum a vector of stats.
#define KALDI_WARN
Definition: kaldi-error.h:150
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185
#define KALDI_VLOG(v)
Definition: kaldi-error.h:156

◆ ClusterEventMapRestrictedByKeys()

EventMap * ClusterEventMapRestrictedByKeys ( const EventMap e_in,
const BuildTreeStatsType stats,
BaseFloat  thresh,
const std::vector< EventKeyType > &  keys,
int32 num_removed 
)

This is as ClusterEventMap, but first splits the stats on the keys specified in "keys" (e.g.

typically keys = [ -1, P ]), and only clusters within the classes defined by that splitting. Note– leaves will be non-consecutive at output, use RenumberEventMap.

Definition at line 822 of file build-tree-utils.cc.

References kaldi::ClusterEventMapRestrictedHelper(), EventMap::Copy(), and kaldi::DeletePointers().

Referenced by kaldi::TestClusterEventMapRestricted().

826  {
827  std::vector<EventMap*> leaf_mapping; // For output of ClusterEventMapGetMapping.
828 
829  int32 nr = ClusterEventMapRestrictedHelper(e_in, stats, thresh, keys, &leaf_mapping);
830  if (num_removed != NULL) *num_removed = nr;
831 
832  EventMap *ans = e_in.Copy(leaf_mapping);
833  DeletePointers(&leaf_mapping);
834  return ans;
835 }
void DeletePointers(std::vector< A *> *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:184
static int32 ClusterEventMapRestrictedHelper(const EventMap &e_in, const BuildTreeStatsType &stats, BaseFloat thresh, std::vector< EventKeyType > keys, std::vector< EventMap *> *leaf_mapping)
kaldi::int32 int32

◆ ClusterEventMapRestrictedByMap()

EventMap * ClusterEventMapRestrictedByMap ( const EventMap e_in,
const BuildTreeStatsType stats,
BaseFloat  thresh,
const EventMap e_restrict,
int32 num_removed 
)

This version of ClusterEventMapRestricted restricts the clustering to only allow things that "e_restrict" maps to the same value to be clustered together.

Definition at line 838 of file build-tree-utils.cc.

References kaldi::ClusterEventMapGetMapping(), EventMap::Copy(), kaldi::DeletePointers(), rnnlm::i, and kaldi::SplitStatsByMap().

Referenced by kaldi::BuildTree(), kaldi::BuildTreeTwoLevel(), and kaldi::TestClusterEventMapRestricted().

842  {
843  std::vector<EventMap*> leaf_mapping;
844 
845  std::vector<BuildTreeStatsType> split_stats;
846  int num_removed = 0;
847  SplitStatsByMap(stats, e_restrict, &split_stats);
848  for (size_t i = 0; i < split_stats.size(); i++) {
849  if (!split_stats[i].empty())
850  num_removed += ClusterEventMapGetMapping(e_in, split_stats[i], thresh,
851  &leaf_mapping);
852  }
853 
854  if (num_removed_ptr != NULL) *num_removed_ptr = num_removed;
855 
856  EventMap *ans = e_in.Copy(leaf_mapping);
857  DeletePointers(&leaf_mapping);
858  return ans;
859 }
void DeletePointers(std::vector< A *> *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:184
void SplitStatsByMap(const BuildTreeStatsType &stats, const EventMap &e, std::vector< BuildTreeStatsType > *stats_out)
Splits stats according to the EventMap, indexing them at output by the leaf type. ...
int ClusterEventMapGetMapping(const EventMap &e_in, const BuildTreeStatsType &stats, BaseFloat thresh, std::vector< EventMap *> *mapping)
"ClusterEventMapGetMapping" clusters the leaves of the EventMap, with "thresh" a delta-likelihood thr...

◆ ClusterEventMapToNClustersRestrictedByMap()

EventMap * ClusterEventMapToNClustersRestrictedByMap ( const EventMap e_in,
const BuildTreeStatsType stats,
int32  num_clusters_required,
const EventMap e_restrict,
int32 num_removed_ptr 
)

This version of ClusterEventMapRestrictedByMap clusters to get a specific number of clusters as specified by 'num_clusters'.

Definition at line 861 of file build-tree-utils.cc.

References kaldi::ClusterBottomUpCompartmentalized(), EventMap::Copy(), kaldi::DeletePointers(), rnnlm::i, rnnlm::j, KALDI_ASSERT, KALDI_VLOG, KALDI_WARN, kaldi::SplitStatsByMap(), kaldi::SumClusterableNormalizer(), and kaldi::SumStatsVec().

Referenced by kaldi::BuildTree().

866  {
867  std::vector<BuildTreeStatsType> split_stats;
868  SplitStatsByMap(stats, e_restrict, &split_stats);
869 
870  if (num_clusters_required < split_stats.size()) {
871  KALDI_WARN << "num-clusters-required is less than size of map. Not doing anything.";
872  if (num_removed_ptr) *num_removed_ptr = 0;
873  return e_in.Copy();
874  }
875 
876  std::vector<std::vector<int32> > indexes(split_stats.size());
877  std::vector<std::vector<Clusterable*> > summed_stats_contiguous(split_stats.size());
878 
879  BaseFloat normalizer = 0.0;
880 
881  size_t max_index = 0;
882 
883  int32 num_non_empty_clusters_required = num_clusters_required;
884 
885  int32 num_non_empty_clusters_in_map = 0;
886  int32 num_non_empty_clusters = 0;
887 
888  for (size_t i = 0; i < split_stats.size(); i++) {
889  if (!split_stats[i].empty()) {
890  num_non_empty_clusters_in_map++;
891 
892  std::vector<BuildTreeStatsType> split_stats_i;
893  SplitStatsByMap(split_stats[i], e_in, &split_stats_i);
894  std::vector<Clusterable*> summed_stats_i;
895  SumStatsVec(split_stats_i, &summed_stats_i);
896 
897  for (size_t j = 0; j < summed_stats_i.size(); j++) {
898  if (summed_stats_i[j] != NULL) {
899  num_non_empty_clusters++;
900  indexes[i].push_back(j);
901  summed_stats_contiguous[i].push_back(summed_stats_i[j]);
902  if (j > max_index) max_index = j;
903  }
904  }
905 
906  normalizer += SumClusterableNormalizer(summed_stats_contiguous[i]);
907  } else {
908  // Even if split_stats[i] is empty, a cluster will be assigned to
909  // that. To compensate, we decrease the num-clusters required.
910  num_non_empty_clusters_required--;
911  }
912  }
913 
914  KALDI_VLOG(1) << "Number of non-empty clusters in map = " << num_non_empty_clusters_in_map;
915  KALDI_VLOG(1) << "Number of non-empty clusters = " << num_non_empty_clusters;
916 
917  if (num_non_empty_clusters_required > num_non_empty_clusters) {
918  KALDI_WARN << "Cannot get required num-clusters " << num_clusters_required
919  << " as number of non-empty clusters required is larger than "
920  << " number of non-empty clusters: " << num_non_empty_clusters_required
921  << " > " << num_non_empty_clusters;
922  if (num_removed_ptr) *num_removed_ptr = 0;
923  return e_in.Copy();
924  }
925 
926  std::vector<std::vector<int32> > assignments;
928  summed_stats_contiguous,
929  std::numeric_limits<BaseFloat>::infinity(),
930  num_non_empty_clusters_required,
931  NULL, // don't need clusters out.
932  &assignments); // this algorithm is quadratic, so might be quite slow.
933 
934  KALDI_ASSERT(assignments.size() == split_stats.size());
935 
936  int32 num_combined = 0;
937  for (size_t i = 0; i < split_stats.size(); i++) {
938  KALDI_ASSERT(assignments[i].size() == summed_stats_contiguous[i].size());
939  if (assignments[i].size() == 0) continue;
940  size_t num_clust_i = *std::max_element(assignments[i].begin(),
941  assignments[i].end()) + 1;
942  num_combined += summed_stats_contiguous[i].size() - num_clust_i;
943  }
944 
945  KALDI_VLOG(2) << "ClusterBottomUpCompartmentalized combined " << num_combined
946  << " leaves and gave a likelihood change of " << change
947  << ", normalized = " << (change / normalizer)
948  << ", normalizer = " << normalizer;
949  KALDI_ASSERT(change < 0.0001); // should be negative or zero.
950 
951  std::vector<EventMap*> leaf_mapping(max_index + 1, NULL);
952 
953  for (size_t i = 0; i < split_stats.size(); i++) {
954  for (size_t j = 0; j < summed_stats_contiguous[i].size(); j++) {
955  size_t index = indexes[i][j];
956  size_t new_index = indexes[i][assignments[i][j]];
957  // index assigned by clusterig-- map to existing indices in the map,
958  // that we clustered from, so we don't conflict with indices in other parts
959  // of the tree.
960  KALDI_ASSERT(leaf_mapping[index] == NULL || "Error: Cluster seems to have been "
961  "called for different parts of the tree with overlapping sets of "
962  "indices.");
963  leaf_mapping[index] = new ConstantEventMap(new_index);
964  }
965  DeletePointers(&summed_stats_contiguous[i]);
966  }
967 
968  if (num_removed_ptr) *num_removed_ptr = num_combined;
969 
970  EventMap *ans = e_in.Copy(leaf_mapping);
971  DeletePointers(&leaf_mapping);
972  return ans;
973 }
void DeletePointers(std::vector< A *> *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:184
BaseFloat SumClusterableNormalizer(const std::vector< Clusterable *> &vec)
Returns the total normalizer (usually count) of the cluster (pointers may be NULL).
void SplitStatsByMap(const BuildTreeStatsType &stats, const EventMap &e, std::vector< BuildTreeStatsType > *stats_out)
Splits stats according to the EventMap, indexing them at output by the leaf type. ...
kaldi::int32 int32
BaseFloat ClusterBottomUpCompartmentalized(const std::vector< std::vector< Clusterable *> > &points, BaseFloat thresh, int32 min_clust, std::vector< std::vector< Clusterable *> > *clusters_out, std::vector< std::vector< int32 > > *assignments_out)
This is a bottom-up clustering where the points are pre-clustered in a set of compartments, such that only points in the same compartment are clustered together.
float BaseFloat
Definition: kaldi-types.h:29
void SumStatsVec(const std::vector< BuildTreeStatsType > &stats_in, std::vector< Clusterable *> *stats_out)
Sum a vector of stats.
#define KALDI_WARN
Definition: kaldi-error.h:150
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185
#define KALDI_VLOG(v)
Definition: kaldi-error.h:156

◆ CreateRandomQuestions()

void kaldi::CreateRandomQuestions ( const BuildTreeStatsType stats,
int32  num_quest,
Questions cfg_out 
)

CreateRandomQuestions will initialize a Questions randomly, in a reasonable way [for testing purposes, or when hand-designed questions are not available].

e.g. num_quest = 5 might be a reasonable value if num_iters > 0, or num_quest = 20 otherwise.

◆ DoTableSplit()

EventMap * DoTableSplit ( const EventMap orig,
EventKeyType  key,
const BuildTreeStatsType stats,
int32 num_leaves 
)

DoTableSplit does a complete split on this key (e.g.

might correspond to central phone (key = P-1), or HMM-state position (key == kPdfClass == -1). Stats used to work out possible values of the event. "num_leaves" is used to allocate new leaves. All stats must have this key defined, or this function will crash.

Definition at line 126 of file build-tree-utils.cc.

References EventMap::Copy(), kaldi::DeletePointers(), KALDI_ASSERT, kaldi::PossibleValues(), and kaldi::SplitStatsByMap().

Referenced by kaldi::DoTableSplitMultiple(), kaldi::TestClusterEventMap(), kaldi::TestClusterEventMapGetMappingAndRenumberEventMap(), kaldi::TestClusterEventMapGetMappingAndRenumberEventMap2(), and kaldi::TestDoTableSplit().

127  {
128  // First-- map the stats to each leaf in the EventMap.
129  std::vector<BuildTreeStatsType> split_stats;
130  SplitStatsByMap(stats, orig, &split_stats);
131  // Now for each leaf that has stats in it, do the table split according to the given name.
132  std::vector<EventMap*> splits(split_stats.size(), NULL);
133  for (EventAnswerType leaf = 0; leaf < (EventAnswerType)split_stats.size(); leaf++) {
134  if (!split_stats[leaf].empty()) {
135  // first work out the possible values the name takes.
136  std::vector<EventValueType> vals; // vals are put here, sorted.
137  bool all_present = PossibleValues(key, split_stats[leaf], &vals);
138  KALDI_ASSERT(all_present); // currently do not support mapping undefined values.
139  KALDI_ASSERT(!vals.empty() && vals.front() >= 0); // don't support mapping negative values
140  // at present time-- would need different EventMap type, not TableEventMap.
141  std::vector<EventMap*> table(vals.back()+1, (EventMap*)NULL);
142  for (size_t idx = 0;idx < vals.size();idx++) {
143  EventValueType val = vals[idx];
144  if (idx == 0) table[val] = new ConstantEventMap(leaf); // reuse current leaf.
145  else table[val] = new ConstantEventMap( (*num_leaves)++ ); // else take new leaf id.
146  }
147  // takes ownershipof stats.
148  splits[leaf] = new TableEventMap(key, table);
149  }
150  }
151  EventMap *ans = orig.Copy(splits);
152  DeletePointers(&splits);
153  return ans;
154 }
void DeletePointers(std::vector< A *> *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:184
void SplitStatsByMap(const BuildTreeStatsType &stats, const EventMap &e, std::vector< BuildTreeStatsType > *stats_out)
Splits stats according to the EventMap, indexing them at output by the leaf type. ...
bool PossibleValues(EventKeyType key, const BuildTreeStatsType &stats, std::vector< EventValueType > *ans)
Convenience function e.g.
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185
int32 EventAnswerType
As far as the event-map code itself is concerned, things of type EventAnswerType may take any value e...
Definition: event-map.h:56
int32 EventValueType
Given current code, things of type EventValueType should generally be nonnegative and in a reasonably...
Definition: event-map.h:51

◆ DoTableSplitMultiple()

EventMap * DoTableSplitMultiple ( const EventMap orig,
const std::vector< EventKeyType > &  keys,
const BuildTreeStatsType stats,
int32 num_leaves 
)

DoTableSplitMultiple does a complete split on all the keys, in order from keys[0], keys[1] and so on.

The stats are used to work out possible values corresponding to the key. "num_leaves" is used to allocate new leaves. All stats must have the keys defined, or this function will crash. Returns a newly allocated event map.

Definition at line 156 of file build-tree-utils.cc.

References EventMap::Copy(), kaldi::DoTableSplit(), and rnnlm::i.

Referenced by kaldi::TestClusterEventMapRestricted(), and kaldi::TestShareEventMapLeaves().

156  {
157 
158  if (keys.empty()) return orig.Copy();
159  else {
160  EventMap *cur = NULL; // would make it &orig, except for const issues.
161  for (size_t i = 0; i < keys.size(); i++) {
162  EventMap *next = DoTableSplit( (cur ? *cur : orig), keys[i], stats, num_leaves);
163  delete cur; // delete intermediate maps.
164  cur = next;
165  }
166  return cur;
167  }
168 }
EventMap * DoTableSplit(const EventMap &orig, EventKeyType key, const BuildTreeStatsType &stats, int32 *num_leaves)
DoTableSplit does a complete split on this key (e.g.

◆ FindBestSplitForKey()

BaseFloat FindBestSplitForKey ( const BuildTreeStatsType stats,
const Questions qcfg,
EventKeyType  key,
std::vector< EventValueType > *  yes_set 
)

FindBestSplitForKey is a function used in DoDecisionTreeSplit.

It finds the best split for this key, given these stats. It will return 0 if the key was not always defined for the stats.

Definition at line 348 of file build-tree-utils.cc.

References kaldi::AddToClusters(), kaldi::ApproxEqual(), kaldi::ComputeInitialSplit(), kaldi::DeletePointers(), kaldi::EnsureClusterableVectorNotNull(), Questions::GetQuestionsOf(), rnnlm::i, KALDI_ASSERT, KALDI_WARN, RefineClustersOptions::num_iters, kaldi::PossibleValues(), QuestionsForKey::refine_opts, kaldi::RefineClusters(), kaldi::SplitStatsByKey(), and kaldi::SumStatsVec().

Referenced by DecisionTreeSplitter::FindBestSplit().

351  {
352  if (stats.size()<=1) return 0.0; // cannot split if only zero or one instance of stats.
353  if (!PossibleValues(key, stats, NULL)) {
354  yes_set_out->clear();
355  return 0.0; // Can't split as key not always defined.
356  }
357  std::vector<Clusterable*> summed_stats; // indexed by value corresponding to key. owned here.
358  { // compute summed_stats
359  std::vector<BuildTreeStatsType> split_stats;
360  SplitStatsByKey(stats, key, &split_stats);
361  SumStatsVec(split_stats, &summed_stats);
362  }
363 
364  std::vector<EventValueType> yes_set;
365  BaseFloat improvement = ComputeInitialSplit(summed_stats,
366  q_opts, key, &yes_set);
367  // find best basic question.
368 
369  std::vector<int32> assignments(summed_stats.size(), 0); // assigns to "no" (0) by default.
370  for (std::vector<EventValueType>::const_iterator iter = yes_set.begin(); iter != yes_set.end(); ++iter) {
371  KALDI_ASSERT(*iter>=0);
372  if (*iter < (EventValueType)assignments.size()) {
373  // this guard necessary in case stats did not have all the
374  // values present in "yes_set".
375  assignments[*iter] = 1; // assign to "yes" (1).
376  }
377  }
378  std::vector<Clusterable*> clusters(2, (Clusterable*)NULL); // no, yes.
379  kaldi::AddToClusters(summed_stats, assignments, &clusters);
380 
381  EnsureClusterableVectorNotNull(&summed_stats);
383 
384  // even if improvement == 0 we continue; if we do RefineClusters we may get further improvement.
385  // now do the RefineClusters stuff. Note that this is null-op if
386  // q_opts.GetQuestionsOf(key).refine_opts.num_iters == 0. We could check for this but don't bother;
387  // it happens in RefineClusters anyway.
388 
389  if (q_opts.GetQuestionsOf(key).refine_opts.num_iters > 0) {
390  // If we want to refine the questions... (a bit like k-means w/ 2 classes).
391  // Note: the only reason we introduced the if-statement is so the yes_set
392  // doesn't get modified (truncated, actually) if we do the refine stuff with
393  // zero iters.
394  BaseFloat refine_impr = RefineClusters(summed_stats, &clusters, &assignments,
395  q_opts.GetQuestionsOf(key).refine_opts);
396  KALDI_ASSERT(refine_impr > std::min(-1.0, -0.1*fabs(improvement)));
397  // refine_impr should always be positive
398  improvement += refine_impr;
399  yes_set.clear();
400  for (size_t i = 0;i < assignments.size();i++) if (assignments[i] == 1) yes_set.push_back(i);
401  }
402  *yes_set_out = yes_set;
403 
404  DeletePointers(&clusters);
405 #ifdef KALDI_PARANOID
406  { // Check the "ans" is correct.
407  KALDI_ASSERT(clusters.size() == 2 && clusters[0] == 0 && clusters[1] == 0);
408  AddToClusters(summed_stats, assignments, &clusters);
409  BaseFloat impr;
410  if (clusters[0] == NULL || clusters[1] == NULL) impr = 0.0;
411  else impr = clusters[0]->Distance(*(clusters[1]));
412  if (!ApproxEqual(impr, improvement) && fabs(impr-improvement) > 0.01) {
413  KALDI_WARN << "FindBestSplitForKey: improvements do not agree: "<< impr
414  << " vs. " << improvement;
415  }
416  DeletePointers(&clusters);
417  }
418 #endif
419  DeletePointers(&summed_stats);
420  return improvement; // objective-function improvement.
421 }
void DeletePointers(std::vector< A *> *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:184
BaseFloat RefineClusters(const std::vector< Clusterable *> &points, std::vector< Clusterable *> *clusters, std::vector< int32 > *assignments, RefineClustersOptions cfg)
RefineClusters is mainly used internally by other clustering algorithms.
void SplitStatsByKey(const BuildTreeStatsType &stats_in, EventKeyType key, std::vector< BuildTreeStatsType > *stats_out)
SplitStatsByKey splits stats up according to the value of a particular key, which must be always defi...
bool PossibleValues(EventKeyType key, const BuildTreeStatsType &stats, std::vector< EventValueType > *ans)
Convenience function e.g.
float BaseFloat
Definition: kaldi-types.h:29
void EnsureClusterableVectorNotNull(std::vector< Clusterable *> *stats)
Fills in any (NULL) holes in "stats" vector, with empty stats, because certain algorithms require non...
void SumStatsVec(const std::vector< BuildTreeStatsType > &stats_in, std::vector< Clusterable *> *stats_out)
Sum a vector of stats.
void AddToClusters(const std::vector< Clusterable *> &stats, const std::vector< int32 > &assignments, std::vector< Clusterable *> *clusters)
Given stats and a vector "assignments" of the same size (that maps to cluster indices), sums the stats up into "clusters." It will add to any stats already present in "clusters" (although typically "clusters" will be empty when called), and it will extend with NULL pointers for any unseen indices.
#define KALDI_WARN
Definition: kaldi-error.h:150
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185
BaseFloat ComputeInitialSplit(const std::vector< Clusterable *> &summed_stats, const Questions &q_opts, EventKeyType key, std::vector< EventValueType > *yes_set)
int32 EventValueType
Given current code, things of type EventValueType should generally be nonnegative and in a reasonably...
Definition: event-map.h:51
static bool ApproxEqual(float a, float b, float relative_tolerance=0.001)
return abs(a - b) <= relative_tolerance * (abs(a)+abs(b)).
Definition: kaldi-math.h:265

◆ GetStubMap()

EventMap * GetStubMap ( int32  P,
const std::vector< std::vector< int32 > > &  phone_sets,
const std::vector< int32 > &  phone2num_pdf_classes,
const std::vector< bool > &  share_roots,
int32 num_leaves 
)

GetStubMap is used in tree-building functions to get the initial to-states map, before the decision-tree-building process.

It creates a simple map that splits on groups of phones. For the set of phones in phone_sets[i] it creates either: if share_roots[i] == true, a single leaf node, or if share_roots[i] == false, separate root nodes for each HMM-position (it goes up to the highest position for any phone in the set, although it will warn if you share roots between phones with different numbers of states, which is a weird thing to do but should still work. If any phone is present in "phone_sets" but "phone2num_pdf_classes" does not map it to a length, it is an error. Note that the behaviour of the resulting map is undefined for phones not present in "phone_sets". At entry, this function should be called with (*num_leaves == 0). It will number the leaves starting from (*num_leaves).

Definition at line 975 of file build-tree-utils.cc.

References rnnlm::i, kaldi::IsSortedAndUniq(), rnnlm::j, KALDI_ASSERT, KALDI_WARN, and kaldi::kPdfClass.

Referenced by kaldi::BuildTree(), kaldi::MonophoneContextDependency(), and kaldi::MonophoneContextDependencyShared().

979  {
980 
981  { // Checking inputs.
982  KALDI_ASSERT(!phone_sets.empty() && share_roots.size() == phone_sets.size());
983  std::set<int32> all_phones;
984  for (size_t i = 0; i < phone_sets.size(); i++) {
985  KALDI_ASSERT(IsSortedAndUniq(phone_sets[i]));
986  KALDI_ASSERT(!phone_sets[i].empty());
987  for (size_t j = 0; j < phone_sets[i].size(); j++) {
988  KALDI_ASSERT(all_phones.count(phone_sets[i][j]) == 0); // check not present.
989  all_phones.insert(phone_sets[i][j]);
990  }
991  }
992  }
993 
994  // Initially create a single leaf for each phone set.
995 
996  size_t max_set_size = 0;
997  int32 highest_numbered_phone = 0;
998  for (size_t i = 0; i < phone_sets.size(); i++) {
999  max_set_size = std::max(max_set_size, phone_sets[i].size());
1000  highest_numbered_phone =
1001  std::max(highest_numbered_phone,
1002  * std::max_element(phone_sets[i].begin(), phone_sets[i].end()));
1003  }
1004 
1005  if (phone_sets.size() == 1) { // there is only one set so the recursion finishes.
1006  if (share_roots[0]) { // if "shared roots" return a single leaf.
1007  return new ConstantEventMap( (*num_leaves_out)++ );
1008  } else { // not sharing roots -> work out the length and return a
1009  // TableEventMap splitting on length.
1010  EventAnswerType max_len = 0;
1011  for (size_t i = 0; i < phone_sets[0].size(); i++) {
1012  EventAnswerType len;
1013  EventValueType phone = phone_sets[0][i];
1014  KALDI_ASSERT(static_cast<size_t>(phone) < phone2num_pdf_classes.size());
1015  len = phone2num_pdf_classes[phone];
1016  KALDI_ASSERT(len > 0);
1017  if (i == 0) max_len = len;
1018  else {
1019  if (len != max_len) {
1020  KALDI_WARN << "Mismatching lengths within a phone set: " << len
1021  << " vs. " << max_len << " [unusual, but not necessarily fatal]. ";
1022  max_len = std::max(len, max_len);
1023  }
1024  }
1025  }
1026  std::map<EventValueType, EventAnswerType> m;
1027  for (EventAnswerType p = 0; p < max_len; p++)
1028  m[p] = (*num_leaves_out)++;
1029  return new TableEventMap(kPdfClass, // split on hmm-position
1030  m);
1031  }
1032  } else if (max_set_size == 1
1033  && static_cast<int32>(phone_sets.size()) <= 2*highest_numbered_phone) {
1034  // create table map splitting on phone-- more efficient.
1035  // the part after the && checks that this would not contain a very sparse vector.
1036  std::map<EventValueType, EventMap*> m;
1037  for (size_t i = 0; i < phone_sets.size(); i++) {
1038  std::vector<std::vector<int32> > phone_sets_tmp;
1039  phone_sets_tmp.push_back(phone_sets[i]);
1040  std::vector<bool> share_roots_tmp;
1041  share_roots_tmp.push_back(share_roots[i]);
1042  EventMap *this_stub = GetStubMap(P, phone_sets_tmp, phone2num_pdf_classes,
1043  share_roots_tmp,
1044  num_leaves_out);
1045  KALDI_ASSERT(m.count(phone_sets_tmp[0][0]) == 0);
1046  m[phone_sets_tmp[0][0]] = this_stub;
1047  }
1048  return new TableEventMap(P, m);
1049  } else {
1050  // Do a split. Recurse.
1051  size_t half_sz = phone_sets.size() / 2;
1052  std::vector<std::vector<int32> >::const_iterator half_phones =
1053  phone_sets.begin() + half_sz;
1054  std::vector<bool>::const_iterator half_share =
1055  share_roots.begin() + half_sz;
1056  std::vector<std::vector<int32> > phone_sets_1, phone_sets_2;
1057  std::vector<bool> share_roots_1, share_roots_2;
1058  phone_sets_1.insert(phone_sets_1.end(), phone_sets.begin(), half_phones);
1059  phone_sets_2.insert(phone_sets_2.end(), half_phones, phone_sets.end());
1060  share_roots_1.insert(share_roots_1.end(), share_roots.begin(), half_share);
1061  share_roots_2.insert(share_roots_2.end(), half_share, share_roots.end());
1062 
1063  EventMap *map1 = GetStubMap(P, phone_sets_1, phone2num_pdf_classes, share_roots_1, num_leaves_out);
1064  EventMap *map2 = GetStubMap(P, phone_sets_2, phone2num_pdf_classes, share_roots_2, num_leaves_out);
1065 
1066  std::vector<EventKeyType> all_in_first_set;
1067  for (size_t i = 0; i < half_sz; i++)
1068  for (size_t j = 0; j < phone_sets[i].size(); j++)
1069  all_in_first_set.push_back(phone_sets[i][j]);
1070  std::sort(all_in_first_set.begin(), all_in_first_set.end());
1071  KALDI_ASSERT(IsSortedAndUniq(all_in_first_set));
1072  return new SplitEventMap(P, all_in_first_set, map1, map2);
1073  }
1074 }
kaldi::int32 int32
static const EventKeyType kPdfClass
Definition: context-dep.h:39
#define KALDI_WARN
Definition: kaldi-error.h:150
EventMap * GetStubMap(int32 P, const std::vector< std::vector< int32 > > &phone_sets, const std::vector< int32 > &phone2num_pdf_classes, const std::vector< bool > &share_roots, int32 *num_leaves_out)
GetStubMap is used in tree-building functions to get the initial to-states map, before the decision-t...
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185
int32 EventAnswerType
As far as the event-map code itself is concerned, things of type EventAnswerType may take any value e...
Definition: event-map.h:56
int32 EventValueType
Given current code, things of type EventValueType should generally be nonnegative and in a reasonably...
Definition: event-map.h:51
bool IsSortedAndUniq(const std::vector< T > &vec)
Returns true if the vector is sorted and contains each element only once.
Definition: stl-utils.h:63

◆ MapEventMapLeaves()

EventMap * MapEventMapLeaves ( const EventMap e_in,
const std::vector< int32 > &  mapping 
)

This function remaps the event-map leaves using this mapping, indexed by the number at leaf.

Definition at line 689 of file build-tree-utils.cc.

References EventMap::Copy(), kaldi::DeletePointers(), and rnnlm::i.

Referenced by kaldi::BuildTreeTwoLevel().

690  {
691  std::vector<EventMap*> mapping(mapping_in.size());
692  for (size_t i = 0; i < mapping_in.size(); i++)
693  mapping[i] = new ConstantEventMap(mapping_in[i]);
694  EventMap *ans = e_in.Copy(mapping);
695  DeletePointers(&mapping);
696  return ans;
697 }
void DeletePointers(std::vector< A *> *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:184

◆ RenumberEventMap()

EventMap * RenumberEventMap ( const EventMap e_in,
int32 num_leaves 
)

RenumberEventMap [intended to be used after calling ClusterEventMap] renumbers an EventMap so its leaves are consecutive.

It puts the number of leaves in *num_leaves. If later you need the mapping of the leaves, modify the function and add a new argument.

Definition at line 664 of file build-tree-utils.cc.

References EventMap::Copy(), kaldi::DeletePointers(), KALDI_ASSERT, EventMap::MultiMap(), and kaldi::SortAndUniq().

Referenced by kaldi::BuildTree(), kaldi::BuildTreeTwoLevel(), kaldi::ShareEventMapLeaves(), kaldi::TestClusterEventMapGetMappingAndRenumberEventMap(), and kaldi::TestClusterEventMapGetMappingAndRenumberEventMap2().

664  {
665  EventType empty_vec;
666  std::vector<EventAnswerType> initial_leaves; // before renumbering.
667  e_in.MultiMap(empty_vec, &initial_leaves);
668  if (initial_leaves.empty()) {
669  KALDI_ASSERT(num_leaves);
670  if (num_leaves) *num_leaves = 0;
671  return e_in.Copy();
672  }
673  SortAndUniq(&initial_leaves);
674  EventAnswerType max_leaf_plus_one = initial_leaves.back() + 1; // will typically, but not always, equal *num_leaves.
675  std::vector<EventMap*> mapping(max_leaf_plus_one, (EventMap*)NULL);
676  std::vector<EventAnswerType>::iterator iter = initial_leaves.begin(), end = initial_leaves.end();
677  EventAnswerType cur_leaf = 0;
678  for (; iter != end; ++iter) {
679  KALDI_ASSERT(*iter >= 0 && *iter<max_leaf_plus_one);
680  mapping[*iter] = new ConstantEventMap(cur_leaf++);
681  }
682  EventMap *ans = e_in.Copy(mapping);
683  DeletePointers(&mapping);
684  KALDI_ASSERT((size_t)cur_leaf == initial_leaves.size());
685  if (num_leaves) *num_leaves = cur_leaf;
686  return ans;
687 }
void DeletePointers(std::vector< A *> *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:184
void SortAndUniq(std::vector< T > *vec)
Sorts and uniq&#39;s (removes duplicates) from a vector.
Definition: stl-utils.h:39
std::vector< std::pair< EventKeyType, EventValueType > > EventType
Definition: event-map.h:58
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185
int32 EventAnswerType
As far as the event-map code itself is concerned, things of type EventAnswerType may take any value e...
Definition: event-map.h:56

◆ ShareEventMapLeaves()

EventMap * ShareEventMapLeaves ( const EventMap e_in,
EventKeyType  key,
std::vector< std::vector< EventValueType > > &  values,
int32 num_leaves 
)

ShareEventMapLeaves performs a quite specific function that allows us to generate trees where, for a certain list of phones, and for all states in the phone, all the pdf's are shared.

Each element of "values" contains a list of phones (may be just one phone), all states of which we want shared together). Typically at input, "key" will equal P, the central-phone position, and "values" will contain just one list containing the silence phone. This function renumbers the event map leaves after doing the sharing, to make the event-map leaves contiguous.

Definition at line 710 of file build-tree-utils.cc.

References EventMap::Copy(), kaldi::DeletePointers(), rnnlm::i, rnnlm::j, KALDI_ASSERT, KALDI_WARN, kaldi::MakeEventPair(), EventMap::MultiMap(), kaldi::RenumberEventMap(), and kaldi::SortAndUniq().

Referenced by kaldi::TestShareEventMapLeaves().

712  {
713  // the use of "pdfs" as the name of the next variable reflects the anticipated
714  // use of this function.
715  std::vector<std::vector<EventAnswerType> > pdfs(values.size());
716  for (size_t i = 0; i < values.size(); i++) {
717  EventType evec;
718  for (size_t j = 0; j < values[i].size(); j++) {
719  evec.push_back(MakeEventPair(key, values[i][j]));
720  size_t size_at_start = pdfs[i].size();
721  e_in.MultiMap(evec, &(pdfs[i])); // append any corresponding pdfs to pdfs[i].
722  if (pdfs[i].size() == size_at_start) { // Nothing added... unexpected.
723  KALDI_WARN << "ShareEventMapLeaves: had no leaves for key = " << key
724  << ", value = " << (values[i][j]);
725  }
726  }
727  SortAndUniq(&(pdfs[i]));
728  }
729  std::vector<EventMap*> remapping;
730  for (size_t i = 0; i < values.size(); i++) {
731  if (pdfs[i].empty())
732  KALDI_WARN << "ShareEventMapLeaves: no leaves in one bucket."; // not expected.
733  else {
734  EventAnswerType map_to_this = pdfs[i][0]; // map all in this bucket
735  // to this value.
736  for (size_t j = 1; j < pdfs[i].size(); j++) {
737  EventAnswerType leaf = pdfs[i][j];
738  KALDI_ASSERT(leaf>=0);
739  if (remapping.size() <= static_cast<size_t>(leaf))
740  remapping.resize(leaf+1, NULL);
741  KALDI_ASSERT(remapping[leaf] == NULL);
742  remapping[leaf] = new ConstantEventMap(map_to_this);
743  }
744  }
745  }
746  EventMap *shared = e_in.Copy(remapping);
747  DeletePointers(&remapping);
748  EventMap *renumbered = RenumberEventMap(*shared, num_leaves);
749  delete shared;
750  return renumbered;
751 }
void DeletePointers(std::vector< A *> *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:184
std::pair< EventKeyType, EventValueType > MakeEventPair(EventKeyType k, EventValueType v)
Definition: event-map.h:62
void SortAndUniq(std::vector< T > *vec)
Sorts and uniq&#39;s (removes duplicates) from a vector.
Definition: stl-utils.h:39
std::vector< std::pair< EventKeyType, EventValueType > > EventType
Definition: event-map.h:58
#define KALDI_WARN
Definition: kaldi-error.h:150
EventMap * RenumberEventMap(const EventMap &e_in, int32 *num_leaves)
RenumberEventMap [intended to be used after calling ClusterEventMap] renumbers an EventMap so its lea...
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185
int32 EventAnswerType
As far as the event-map code itself is concerned, things of type EventAnswerType may take any value e...
Definition: event-map.h:56

◆ SplitDecisionTree()

EventMap * SplitDecisionTree ( const EventMap orig,
const BuildTreeStatsType stats,
Questions qcfg,
BaseFloat  thresh,
int32  max_leaves,
int32 num_leaves,
BaseFloat objf_impr_out,
BaseFloat smallest_split_change_out 
)

Does a decision-tree split at the leaves of an EventMap.

Parameters
orig[in] The EventMap whose leaves we want to split. [may be either a trivial or a non-trivial one].
stats[in] The statistics for splitting the tree; if you do not want a particular subset of leaves to be split, make sure the stats corresponding to those leaves are not present in "stats".
qcfg[in] Configuration class that contains initial questions (e.g. sets of phones) for each key and says whether to refine these questions during tree building.
thresh[in] A log-likelihood threshold (e.g. 300) that can be used to limit the number of leaves; you can use zero and set max_leaves instead.
max_leaves[in] Will stop leaves being split after they reach this number.
num_leaves[in,out] A pointer used to allocate leaves; always corresponds to the current number of leaves (is incremented when this is increased).
objf_impr_out[out] If non-NULL, will be set to the objective improvement due to splitting (not normalized by the number of frames).
smallest_split_change_outIf non-NULL, will be set to the smallest objective-function improvement that we got from splitting any leaf; useful to provide a threshold for ClusterEventMap.
Returns
The EventMap after splitting is returned; pointer is owned by caller.

Definition at line 532 of file build-tree-utils.cc.

References DecisionTreeSplitter::BestSplit(), EventMap::Copy(), count, DecisionTreeSplitter::DecisionTreeSplitter(), DecisionTreeSplitter::GetMap(), rnnlm::i, KALDI_ASSERT, KALDI_LOG, and kaldi::SplitStatsByMap().

Referenced by kaldi::BuildTree(), kaldi::BuildTreeTwoLevel(), kaldi::TestClusterEventMapRestricted(), kaldi::TestShareEventMapLeaves(), and kaldi::TestSplitDecisionTree().

539  {
540  KALDI_ASSERT(num_leaves != NULL && *num_leaves > 0); // can't be 0 or input_map would be empty.
541  int32 num_empty_leaves = 0;
542  BaseFloat like_impr = 0.0;
543  BaseFloat smallest_split_change = 1.0e+20;
544  std::vector<DecisionTreeSplitter*> builders;
545  { // set up "builders" [one for each current leaf]. This array is never extended.
546  // the structures generated during splitting remain as trees at each array location.
547  std::vector<BuildTreeStatsType> split_stats;
548  SplitStatsByMap(stats, input_map, &split_stats);
549  KALDI_ASSERT(split_stats.size() != 0);
550  builders.resize(split_stats.size()); // size == #leaves.
551  for (size_t i = 0;i < split_stats.size();i++) {
552  EventAnswerType leaf = static_cast<EventAnswerType>(i);
553  if (split_stats[i].size() == 0) num_empty_leaves++;
554  builders[i] = new DecisionTreeSplitter(leaf, split_stats[i], q_opts);
555  }
556  }
557 
558  { // Do the splitting.
559  int32 count = 0;
560  std::priority_queue<std::pair<BaseFloat, size_t> > queue; // use size_t because logically these
561  // are just indexes into the array, not leaf-ids (after splitting they are no longer leaf id's).
562  // Initialize queue.
563  for (size_t i = 0; i < builders.size(); i++)
564  queue.push(std::make_pair(builders[i]->BestSplit(), i));
565  // Note-- queue's size never changes from now. All the alternatives leaves to split are
566  // inside the "DecisionTreeSplitter*" objects, in a tree structure.
567  while (queue.top().first > thresh
568  && (max_leaves<=0 || *num_leaves < max_leaves)) {
569  smallest_split_change = std::min(smallest_split_change, queue.top().first);
570  size_t i = queue.top().second;
571  like_impr += queue.top().first;
572  builders[i]->DoSplit(num_leaves);
573  queue.pop();
574  queue.push(std::make_pair(builders[i]->BestSplit(), i));
575  count++;
576  }
577  KALDI_LOG << "DoDecisionTreeSplit: split "<< count << " times, #leaves now " << (*num_leaves);
578  }
579 
580  if (smallest_split_change_out)
581  *smallest_split_change_out = smallest_split_change;
582 
583  EventMap *answer = NULL;
584 
585  { // Create the output EventMap.
586  std::vector<EventMap*> sub_trees(builders.size());
587  for (size_t i = 0; i < sub_trees.size();i++) sub_trees[i] = builders[i]->GetMap();
588  answer = input_map.Copy(sub_trees);
589  for (size_t i = 0; i < sub_trees.size();i++) delete sub_trees[i];
590  }
591  // Free up memory.
592  for (size_t i = 0;i < builders.size();i++) delete builders[i];
593 
594  if (obj_impr_out != NULL) *obj_impr_out = like_impr;
595  return answer;
596 }
void SplitStatsByMap(const BuildTreeStatsType &stats, const EventMap &e, std::vector< BuildTreeStatsType > *stats_out)
Splits stats according to the EventMap, indexing them at output by the leaf type. ...
kaldi::int32 int32
const size_t count
float BaseFloat
Definition: kaldi-types.h:29
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185
int32 EventAnswerType
As far as the event-map code itself is concerned, things of type EventAnswerType may take any value e...
Definition: event-map.h:56
#define KALDI_LOG
Definition: kaldi-error.h:153

◆ TrivialTree()

EventMap* kaldi::TrivialTree ( int32 num_leaves)
inline

Returns a tree with just one node.

Used @ start of tree-building process. Not really used in current recipes.

Definition at line 152 of file build-tree-utils.h.

Referenced by kaldi::TestClusterEventMap(), kaldi::TestClusterEventMapGetMappingAndRenumberEventMap(), kaldi::TestClusterEventMapGetMappingAndRenumberEventMap2(), kaldi::TestClusterEventMapRestricted(), kaldi::TestDoTableSplit(), kaldi::TestShareEventMapLeaves(), kaldi::TestSplitDecisionTree(), and kaldi::TestTrivialTree().

152  {
153  KALDI_ASSERT(*num_leaves == 0); // in envisaged usage.
154  return new ConstantEventMap( (*num_leaves)++ );
155 }
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185