Some simple functions used in clustering algorithms

See Clustering mechanisms in Kaldi for context. More...

Collaboration diagram for Some simple functions used in clustering algorithms:

Functions

BaseFloat SumClusterableObjf (const std::vector< Clusterable * > &vec)
 Returns the total objective function after adding up all the statistics in the vector (pointers may be NULL). More...
 
BaseFloat SumClusterableNormalizer (const std::vector< Clusterable * > &vec)
 Returns the total normalizer (usually count) of the cluster (pointers may be NULL). More...
 
ClusterableSumClusterable (const std::vector< Clusterable * > &vec)
 Sums stats (ptrs may be NULL). Returns NULL if no non-NULL stats present. More...
 
void EnsureClusterableVectorNotNull (std::vector< Clusterable * > *stats)
 Fills in any (NULL) holes in "stats" vector, with empty stats, because certain algorithms require non-NULL stats. More...
 
void AddToClusters (const std::vector< Clusterable * > &stats, const std::vector< int32 > &assignments, std::vector< Clusterable * > *clusters)
 Given stats and a vector "assignments" of the same size (that maps to cluster indices), sums the stats up into "clusters." It will add to any stats already present in "clusters" (although typically "clusters" will be empty when called), and it will extend with NULL pointers for any unseen indices. More...
 
void AddToClustersOptimized (const std::vector< Clusterable * > &stats, const std::vector< int32 > &assignments, const Clusterable &total, std::vector< Clusterable * > *clusters)
 AddToClustersOptimized does the same as AddToClusters (it sums up the stats within each cluster, except it uses the sum of all the stats ("total") to optimize the computation for speed, if possible. More...
 

Detailed Description

See Clustering mechanisms in Kaldi for context.

Function Documentation

◆ AddToClusters()

void AddToClusters ( const std::vector< Clusterable * > &  stats,
const std::vector< int32 > &  assignments,
std::vector< Clusterable * > *  clusters 
)

Given stats and a vector "assignments" of the same size (that maps to cluster indices), sums the stats up into "clusters." It will add to any stats already present in "clusters" (although typically "clusters" will be empty when called), and it will extend with NULL pointers for any unseen indices.

Call EnsureClusterableStatsNotNull afterwards if you want to ensure all non-NULL clusters. Pointer in "clusters" are owned by caller. Pointers in "stats" do not have to be non-NULL.

Definition at line 108 of file cluster-utils.cc.

References rnnlm::i, and KALDI_ASSERT.

Referenced by kaldi::FindBestSplitForKey(), kaldi::TestAddToClusters(), and kaldi::TestAddToClustersOptimized().

110  {
111  KALDI_ASSERT(assignments.size() == stats.size());
112  int32 size = stats.size();
113  if (size == 0) return; // Nothing to do.
114  KALDI_ASSERT(clusters != NULL);
115  int32 max_assignment = *std::max_element(assignments.begin(),
116  assignments.end());
117  if (static_cast<int32> (clusters->size()) <= max_assignment)
118  clusters->resize(max_assignment + 1, NULL); // extend with NULLs.
119  for (int32 i = 0; i < size; i++) {
120  if (stats[i] != NULL) {
121  if ((*clusters)[assignments[i]] == NULL)
122  (*clusters)[assignments[i]] = stats[i]->Copy();
123  else
124  (*clusters)[assignments[i]]->Add(*(stats[i]));
125  }
126  }
127 }
kaldi::int32 int32
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185

◆ AddToClustersOptimized()

void AddToClustersOptimized ( const std::vector< Clusterable * > &  stats,
const std::vector< int32 > &  assignments,
const Clusterable total,
std::vector< Clusterable * > *  clusters 
)

AddToClustersOptimized does the same as AddToClusters (it sums up the stats within each cluster, except it uses the sum of all the stats ("total") to optimize the computation for speed, if possible.

This will generally only be a significant speedup in the case where there are just two clusters, which can happen in algorithms that are doing binary splits; the idea is that we sum up all the stats in one cluster (the one with the fewest points in it), and then subtract from the total.

Definition at line 130 of file cluster-utils.cc.

References Clusterable::Add(), kaldi::AssertEqual(), Clusterable::Copy(), rnnlm::i, KALDI_ASSERT, Clusterable::Normalizer(), and kaldi::SumClusterableNormalizer().

Referenced by kaldi::ComputeInitialSplit(), and kaldi::TestAddToClustersOptimized().

133  {
134 #ifdef KALDI_PARANOID
135  // Make sure "total" is actually the sum of stats in "stats".
136  {
137  BaseFloat stats_norm = SumClusterableNormalizer(stats),
138  tot_norm = total.Normalizer();
139  AssertEqual(stats_norm, tot_norm, 0.01);
140  }
141 #endif
142 
143  KALDI_ASSERT(assignments.size() == stats.size());
144  int32 size = stats.size();
145  if (size == 0) return; // Nothing to do.
146  KALDI_ASSERT(clusters != NULL);
147  int32 num_clust = 1 + *std::max_element(assignments.begin(),
148  assignments.end());
149  if (static_cast<int32> (clusters->size()) < num_clust)
150  clusters->resize(num_clust, NULL); // extend with NULLs.
151  std::vector<int32> num_stats_for_cluster(num_clust, 0);
152  int32 num_total_stats = 0;
153  for (int32 i = 0; i < size; i++) {
154  if (stats[i] != NULL) {
155  num_total_stats++;
156  num_stats_for_cluster[assignments[i]]++;
157  }
158  }
159  if (num_total_stats == 0) return; // Nothing to do.
160 
161  // it will only ever be efficient to subtract for at most one index.
162  int32 subtract_index = -1;
163  for (int32 c = 0; c < num_clust; c++) {
164  if (num_stats_for_cluster[c] > num_total_stats - num_stats_for_cluster[c]) {
165  subtract_index = c;
166  if ((*clusters)[c] == NULL)
167  (*clusters)[c] = total.Copy();
168  else
169  (*clusters)[c]->Add(total);
170  break;
171  }
172  }
173 
174  for (int32 i = 0; i < size; i++) {
175  if (stats[i] != NULL) {
176  int32 assignment = assignments[i];
177  if (assignment != (int32) subtract_index) {
178  if ((*clusters)[assignment] == NULL)
179  (*clusters)[assignment] = stats[i]->Copy();
180  else
181  (*clusters)[assignment]->Add(*(stats[i]));
182  }
183  if (subtract_index != -1 && assignment != subtract_index)
184  (*clusters)[subtract_index]->Sub(*(stats[i]));
185  }
186  }
187 }
BaseFloat SumClusterableNormalizer(const std::vector< Clusterable *> &vec)
Returns the total normalizer (usually count) of the cluster (pointers may be NULL).
kaldi::int32 int32
float BaseFloat
Definition: kaldi-types.h:29
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185
static void AssertEqual(float a, float b, float relative_tolerance=0.001)
assert abs(a - b) <= relative_tolerance * (abs(a)+abs(b))
Definition: kaldi-math.h:276

◆ EnsureClusterableVectorNotNull()

void EnsureClusterableVectorNotNull ( std::vector< Clusterable * > *  stats)

Fills in any (NULL) holes in "stats" vector, with empty stats, because certain algorithms require non-NULL stats.

If "stats" nonempty, requires it to contain at least one non-NULL pointer that we can call Copy() on.

Definition at line 82 of file cluster-utils.cc.

References Clusterable::Copy(), KALDI_ASSERT, KALDI_ERR, and Clusterable::SetZero().

Referenced by kaldi::AutomaticallyObtainQuestions(), kaldi::FindBestSplitForKey(), kaldi::KMeansClusterPhones(), and kaldi::TestEnsureClusterableVectorNotNull().

82  {
83  KALDI_ASSERT(stats != NULL);
84  std::vector<Clusterable*>::iterator itr = stats->begin(), end = stats->end();
85  if (itr == end) return; // Nothing to do.
86  Clusterable *nonNullExample = NULL;
87  for (; itr != end; ++itr) {
88  if (*itr != NULL) {
89  nonNullExample = *itr;
90  break;
91  }
92  }
93  if (nonNullExample == NULL) {
94  KALDI_ERR << "All stats are NULL."; // crash. logic error.
95  }
96  itr = stats->begin();
97  Clusterable *nonNullExampleCopy = nonNullExample->Copy();
98  // sets stats to zero. do this just once (on copy) for efficiency.
99  nonNullExampleCopy->SetZero();
100  for (; itr != end; ++itr) {
101  if (*itr == NULL) {
102  *itr = nonNullExampleCopy->Copy();
103  }
104  }
105  delete nonNullExampleCopy;
106 }
#define KALDI_ERR
Definition: kaldi-error.h:147
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185

◆ SumClusterable()

Clusterable * SumClusterable ( const std::vector< Clusterable *> &  vec)

Sums stats (ptrs may be NULL). Returns NULL if no non-NULL stats present.

Definition at line 69 of file cluster-utils.cc.

References Clusterable::Add(), Clusterable::Copy(), and rnnlm::i.

Referenced by kaldi::ClusterKMeansOnce(), kaldi::ComputeInitialSplit(), TreeClusterer::Init(), kaldi::InitAmGmm(), kaldi::TestAddToClustersOptimized(), and kaldi::TestSum().

69  {
70  Clusterable *ans = NULL;
71  for (size_t i = 0; i < vec.size(); i++) {
72  if (vec[i] != NULL) {
73  if (ans == NULL)
74  ans = vec[i]->Copy();
75  else
76  ans->Add(*(vec[i]));
77  }
78  }
79  return ans;
80 }

◆ SumClusterableNormalizer()

BaseFloat SumClusterableNormalizer ( const std::vector< Clusterable *> &  vec)

Returns the total normalizer (usually count) of the cluster (pointers may be NULL).

Definition at line 54 of file cluster-utils.cc.

References rnnlm::i, KALDI_ISNAN, and KALDI_WARN.

Referenced by kaldi::AddToClustersOptimized(), kaldi::ClusterEventMapGetMapping(), kaldi::ClusterEventMapToNClustersRestrictedByMap(), kaldi::ClusterKMeansOnce(), kaldi::KMeansClusterPhones(), kaldi::TestAddToClustersOptimized(), and kaldi::TestSumObjfAndSumNormalizer().

54  {
55  BaseFloat ans = 0.0;
56  for (size_t i = 0; i < vec.size(); i++) {
57  if (vec[i] != NULL) {
58  BaseFloat objf = vec[i]->Normalizer();
59  if (KALDI_ISNAN(objf)) {
60  KALDI_WARN << "SumClusterableObjf, NaN objf";
61  } else {
62  ans += objf;
63  }
64  }
65  }
66  return ans;
67 }
float BaseFloat
Definition: kaldi-types.h:29
#define KALDI_WARN
Definition: kaldi-error.h:150
#define KALDI_ISNAN
Definition: kaldi-math.h:72

◆ SumClusterableObjf()

BaseFloat SumClusterableObjf ( const std::vector< Clusterable * > &  vec)

Returns the total objective function after adding up all the statistics in the vector (pointers may be NULL).

Definition at line 39 of file cluster-utils.cc.

References rnnlm::i, KALDI_ISNAN, and KALDI_WARN.

Referenced by kaldi::ClusterKMeansOnce(), kaldi::ComputeInitialSplit(), kaldi::ObjfGivenMap(), kaldi::TestClusterKMeans(), kaldi::TestClusterKMeansVector(), kaldi::TestClusterTopDown(), kaldi::TestRefineClusters(), kaldi::TestSumObjfAndSumNormalizer(), and kaldi::TestTreeCluster().

39  {
40  BaseFloat ans = 0.0;
41  for (size_t i = 0; i < vec.size(); i++) {
42  if (vec[i] != NULL) {
43  BaseFloat objf = vec[i]->Objf();
44  if (KALDI_ISNAN(objf)) {
45  KALDI_WARN << "SumClusterableObjf, NaN objf";
46  } else {
47  ans += objf;
48  }
49  }
50  }
51  return ans;
52 }
float BaseFloat
Definition: kaldi-types.h:29
#define KALDI_WARN
Definition: kaldi-error.h:150
#define KALDI_ISNAN
Definition: kaldi-math.h:72