See Clustering mechanisms in Kaldi for context. More...

Collaboration diagram for Algorithms for clustering:

Classes

struct  RefineClustersOptions
 
struct  ClusterKMeansOptions
 
struct  TreeClusterOptions
 

Functions

BaseFloat ClusterBottomUp (const std::vector< Clusterable * > &points, BaseFloat thresh, int32 min_clust, std::vector< Clusterable * > *clusters_out, std::vector< int32 > *assignments_out)
 A bottom-up clustering algorithm. More...
 
BaseFloat ClusterBottomUpCompartmentalized (const std::vector< std::vector< Clusterable * > > &points, BaseFloat thresh, int32 min_clust, std::vector< std::vector< Clusterable * > > *clusters_out, std::vector< std::vector< int32 > > *assignments_out)
 This is a bottom-up clustering where the points are pre-clustered in a set of compartments, such that only points in the same compartment are clustered together. More...
 
BaseFloat RefineClusters (const std::vector< Clusterable * > &points, std::vector< Clusterable * > *clusters, std::vector< int32 > *assignments, RefineClustersOptions cfg=RefineClustersOptions())
 RefineClusters is mainly used internally by other clustering algorithms. More...
 
BaseFloat ClusterKMeans (const std::vector< Clusterable * > &points, int32 num_clust, std::vector< Clusterable * > *clusters_out, std::vector< int32 > *assignments_out, ClusterKMeansOptions cfg=ClusterKMeansOptions())
 ClusterKMeans is a K-means-like clustering algorithm. More...
 
BaseFloat TreeCluster (const std::vector< Clusterable * > &points, int32 max_clust, std::vector< Clusterable * > *clusters_out, std::vector< int32 > *assignments_out, std::vector< int32 > *clust_assignments_out, int32 *num_leaves_out, TreeClusterOptions cfg=TreeClusterOptions())
 TreeCluster is a top-down clustering algorithm, using a binary tree (not necessarily balanced). More...
 
BaseFloat ClusterTopDown (const std::vector< Clusterable * > &points, int32 max_clust, std::vector< Clusterable * > *clusters_out, std::vector< int32 > *assignments_out, TreeClusterOptions cfg=TreeClusterOptions())
 A clustering algorithm that internally uses TreeCluster, but does not give you the information about the structure of the tree. More...
 

Detailed Description

See Clustering mechanisms in Kaldi for context.

Function Documentation

◆ ClusterBottomUp()

BaseFloat ClusterBottomUp ( const std::vector< Clusterable * > &  points,
BaseFloat  thresh,
int32  min_clust,
std::vector< Clusterable * > *  clusters_out,
std::vector< int32 > *  assignments_out 
)

A bottom-up clustering algorithm.

There are two parameters that control how many clusters we get: a "max_merge_thresh" which is a threshold for merging clusters, and a min_clust which puts a floor on the number of clusters we want. Set max_merge_thresh = large to use the min_clust only, or min_clust to 0 to use the max_merge_thresh only.

The algorithm is:

while (num-clusters > min_clust && smallest_merge_cost <= max_merge_thresh)
merge the closest two clusters.
Parameters
points[in] Points to be clustered (may not contain NULL pointers)
thresh[in] Threshold on cost change from merging clusters; clusters won't be merged if the cost is more than this
min_clust[in] Minimum number of clusters desired; we'll stop merging after reaching this number.
clusters_out[out] If non-NULL, will be set to a vector of size equal to the number of output clusters, containing the clustered statistics. Must be empty when called.
assignments_out[out] If non-NULL, will be resized to the number of points, and each element is the index of the cluster that point was assigned to.
Returns
Returns the total objf change relative to all clusters being separate, which is a negative. Note that this is not the same as what the other clustering algorithms return.

Definition at line 407 of file cluster-utils.cc.

References BottomUpClusterer::Cluster(), kaldi::ContainsNullPointers(), KALDI_ASSERT, and KALDI_VLOG.

Referenced by kaldi::ClusterEventMapGetMapping(), kaldi::ClusterGaussiansToUbm(), and kaldi::TestClusterBottomUp().

411  {
412  KALDI_ASSERT(max_merge_thresh >= 0.0 && min_clust >= 0);
414  int32 npoints = points.size();
415  // make sure fits in uint_smaller and does not hit the -1 which is reserved.
416  KALDI_ASSERT(sizeof(uint_smaller)==sizeof(uint32) ||
417  npoints < static_cast<int32>(static_cast<uint_smaller>(-1)));
418 
419  KALDI_VLOG(2) << "Initializing clustering object.";
420  BottomUpClusterer bc(points, max_merge_thresh, min_clust, clusters_out, assignments_out);
421  BaseFloat ans = bc.Cluster();
422  if (clusters_out) KALDI_ASSERT(!ContainsNullPointers(*clusters_out));
423  return ans;
424 }
kaldi::int32 int32
bool ContainsNullPointers(const std::vector< A *> &v)
Returns true if the vector of pointers contains NULL pointers.
Definition: stl-utils.h:197
float BaseFloat
Definition: kaldi-types.h:29
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185
#define KALDI_VLOG(v)
Definition: kaldi-error.h:156
uint16 uint_smaller

◆ ClusterBottomUpCompartmentalized()

BaseFloat ClusterBottomUpCompartmentalized ( const std::vector< std::vector< Clusterable * > > &  points,
BaseFloat  thresh,
int32  min_clust,
std::vector< std::vector< Clusterable * > > *  clusters_out,
std::vector< std::vector< int32 > > *  assignments_out 
)

This is a bottom-up clustering where the points are pre-clustered in a set of compartments, such that only points in the same compartment are clustered together.

The compartment and pair of points with the smallest merge cost is selected and the points are clustered. The result stays in the same compartment. The code does not merge compartments, and hence assumes that the number of compartments is smaller than the 'min_clust' option. The clusters in "clusters_out" are newly allocated and owned by the caller.

Definition at line 653 of file cluster-utils.cc.

References CompartmentalizedBottomUpClusterer::Cluster(), kaldi::ContainsNullPointers(), and KALDI_ASSERT.

Referenced by kaldi::ClusterEventMapToNClustersRestrictedByMap(), and kaldi::ClusterGaussiansToUbm().

656  {
657  KALDI_ASSERT(thresh >= 0.0 && min_clust >= 0);
658  int32 npoints = 0, num_non_empty_compartments = 0;
659  for (vector< vector<Clusterable*> >::const_iterator itr = points.begin(),
660  end = points.end(); itr != end; ++itr) {
662  npoints += itr->size();
663  if (itr->size() > 0) num_non_empty_compartments++;
664  }
665  KALDI_ASSERT(min_clust >= num_non_empty_compartments); // Code does not merge compartments.
666  // make sure fits in uint_smaller and does not hit the -1 which is reserved.
667  KALDI_ASSERT(sizeof(uint_smaller)==sizeof(uint32) ||
668  npoints < static_cast<int32>(static_cast<uint_smaller>(-1)));
669 
670  CompartmentalizedBottomUpClusterer bc(points, thresh, min_clust);
671  BaseFloat ans = bc.Cluster(clusters_out, assignments_out);
672  if (clusters_out) {
673  for (vector< vector<Clusterable*> >::iterator itr = clusters_out->begin(),
674  end = clusters_out->end(); itr != end; ++itr) {
676  }
677  }
678  return ans;
679 }
kaldi::int32 int32
bool ContainsNullPointers(const std::vector< A *> &v)
Returns true if the vector of pointers contains NULL pointers.
Definition: stl-utils.h:197
float BaseFloat
Definition: kaldi-types.h:29
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185
uint16 uint_smaller

◆ ClusterKMeans()

BaseFloat ClusterKMeans ( const std::vector< Clusterable * > &  points,
int32  num_clust,
std::vector< Clusterable * > *  clusters_out,
std::vector< int32 > *  assignments_out,
ClusterKMeansOptions  cfg = ClusterKMeansOptions() 
)

ClusterKMeans is a K-means-like clustering algorithm.

It starts with pseudo-random initialization of points to clusters and uses RefineClusters to iteratively improve the cluster assignments. It does this for multiple iterations and picks the result with the best objective function.

ClusterKMeans implicitly uses Rand(). It will not necessarily return the same value on different calls. Use sRand() if you want consistent results. The algorithm used in ClusterKMeans is a "k-means-like" algorithm that tries to be as efficient as possible. Firstly, since the algorithm it uses includes random initialization, it tries the whole thing cfg.num_tries times and picks the one with the best objective function. Each try, it does as follows: it randomly initializes points to clusters, and then for cfg.num_iters iterations it calls RefineClusters(). The options to RefineClusters() are given by cfg.refine_cfg. Calling RefineClusters once will always be at least as good as doing one iteration of reassigning points to clusters, but will generally be quite a bit better (without taking too much extra time).

Parameters
points[in] points to be clustered (must be all non-NULL).
num_clust[in] number of clusters requested (it will always return exactly this many, or will fail if num_clust > points.size()).
clusters_out[out] may be NULL; if non-NULL, should be empty when called. Will be set to a vector of statistics corresponding to the output clusters.
assignments_out[out] may be NULL; if non-NULL, will be set to a vector of same size as "points", which says for each point which cluster it is assigned to.
cfg[in] configuration class specifying options to the algorithm.
Returns
Returns the objective function improvement versus everything being in the same cluster.

Definition at line 986 of file cluster-utils.cc.

References kaldi::ClusterKMeansOnce(), kaldi::ContainsNullPointers(), kaldi::DeletePointers(), rnnlm::i, KALDI_ASSERT, ClusterKMeansOptions::num_iters, and ClusterKMeansOptions::num_tries.

Referenced by ClusterKMeansOptions::ClusterKMeansOptions(), TreeClusterer::FindBestSplit(), kaldi::KMeansClusterPhones(), DiagGmm::MergeKmeans(), kaldi::TestClusterKMeans(), and kaldi::TestClusterKMeansVector().

990  {
991  if (points.size() == 0) {
992  if (clusters_out) KALDI_ASSERT(clusters_out->empty()); // or we wouldn't know whether to free the pointers.
993  if (assignments_out) assignments_out->clear();
994  return 0.0;
995  }
996  KALDI_ASSERT(cfg.num_tries>=1 && cfg.num_iters>=1);
997  if (clusters_out) KALDI_ASSERT(clusters_out->empty()); // or we wouldn't know whether to deallocate.
998  if (cfg.num_tries == 1) {
999  std::vector<int32> assignments;
1000  return ClusterKMeansOnce(points, num_clust, clusters_out, (assignments_out != NULL?assignments_out:&assignments), cfg);
1001  } else { // multiple tries.
1002  if (clusters_out) {
1003  KALDI_ASSERT(clusters_out->empty()); // we don't know the ownership of any pointers in there, otherwise.
1004  }
1005  BaseFloat best_ans = 0.0;
1006  for (int32 i = 0;i < cfg.num_tries;i++) {
1007  std::vector<Clusterable*> clusters_tmp;
1008  std::vector<int32> assignments_tmp;
1009  BaseFloat ans = ClusterKMeansOnce(points, num_clust, &clusters_tmp, &assignments_tmp, cfg);
1010  KALDI_ASSERT(!ContainsNullPointers(clusters_tmp));
1011  if (i == 0 || ans > best_ans) {
1012  best_ans = ans;
1013  if (clusters_out) {
1014  if (clusters_out->size()) DeletePointers(clusters_out);
1015  *clusters_out = clusters_tmp;
1016  clusters_tmp.clear(); // suppress deletion of pointers.
1017  }
1018  if (assignments_out) *assignments_out = assignments_tmp;
1019  }
1020  // delete anything remaining in clusters_tmp (we cleared it if we used
1021  // the pointers.
1022  DeletePointers(&clusters_tmp);
1023  }
1024  return best_ans;
1025  }
1026 }
void DeletePointers(std::vector< A *> *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:184
kaldi::int32 int32
bool ContainsNullPointers(const std::vector< A *> &v)
Returns true if the vector of pointers contains NULL pointers.
Definition: stl-utils.h:197
BaseFloat ClusterKMeansOnce(const std::vector< Clusterable *> &points, int32 num_clust, std::vector< Clusterable *> *clusters_out, std::vector< int32 > *assignments_out, ClusterKMeansOptions &cfg)
ClusterKMeansOnce is called internally by ClusterKMeans; it is equivalent to calling ClusterKMeans wi...
float BaseFloat
Definition: kaldi-types.h:29
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185

◆ ClusterTopDown()

BaseFloat ClusterTopDown ( const std::vector< Clusterable * > &  points,
int32  max_clust,
std::vector< Clusterable * > *  clusters_out,
std::vector< int32 > *  assignments_out,
TreeClusterOptions  cfg = TreeClusterOptions() 
)

A clustering algorithm that internally uses TreeCluster, but does not give you the information about the structure of the tree.

The "clusters_out" and "assignments_out" may be NULL if the outputs are not needed.

Parameters
points[in] points to be clustered (must be all non-NULL).
max_clust[in] Maximum number of clusters (you will get exactly this number, if there are at least this many points, except if you set the cfg.thresh value nonzero, in which case that threshold may limit the number of clusters.
clusters_out[out] may be NULL; if non-NULL, should be empty when called. Will be set to a vector of statistics corresponding to the output clusters.
assignments_out[out] may be NULL; if non-NULL, will be set to a vector of same size as "points", which says for each point which cluster it is assigned to.
cfg[in] Configuration object that controls clustering behavior. Most important value is "thresh", which provides an alternative mechanism [other than max_clust] to limit the number of leaves.

Definition at line 1261 of file cluster-utils.cc.

References rnnlm::j, and kaldi::TreeCluster().

Referenced by kaldi::TestClusterTopDown(), and TreeClusterOptions::TreeClusterOptions().

1265  {
1266  int32 num_leaves = 0;
1267  BaseFloat ans = TreeCluster(points, max_clust, clusters_out, assignments_out, NULL, &num_leaves, cfg);
1268  if (clusters_out != NULL) {
1269  for (size_t j = num_leaves;j<clusters_out->size();j++) delete (*clusters_out)[j];
1270  clusters_out->resize(num_leaves); // number of leaf-level clusters in tree.
1271  }
1272  return ans;
1273 }
kaldi::int32 int32
float BaseFloat
Definition: kaldi-types.h:29
BaseFloat TreeCluster(const std::vector< Clusterable *> &points, int32 max_clust, std::vector< Clusterable *> *clusters_out, std::vector< int32 > *assignments_out, std::vector< int32 > *clust_assignments_out, int32 *num_leaves_out, TreeClusterOptions cfg)
TreeCluster is a top-down clustering algorithm, using a binary tree (not necessarily balanced)...

◆ RefineClusters()

BaseFloat RefineClusters ( const std::vector< Clusterable * > &  points,
std::vector< Clusterable * > *  clusters,
std::vector< int32 > *  assignments,
RefineClustersOptions  cfg = RefineClustersOptions() 
)

RefineClusters is mainly used internally by other clustering algorithms.

It starts with a given assignment of points to clusters and keeps trying to improve it by moving points from cluster to cluster, up to a maximum number of iterations.

"clusters" and "assignments" are both input and output variables, and so both MUST be non-NULL.

"top_n" (>=2) is a pruning value: more is more exact, fewer is faster. The algorithm initially finds the "top_n" closest clusters to any given point, and from that point only consider move to those "top_n" clusters. Since RefineClusters is called multiple times from ClusterKMeans (for instance), this is not really a limitation.

Definition at line 895 of file cluster-utils.cc.

References kaldi::ContainsNullPointers(), KALDI_ASSERT, RefineClustersOptions::num_iters, and RefineClusterer::Refine().

Referenced by kaldi::ClusterKMeansOnce(), kaldi::FindBestSplitForKey(), RefineClustersOptions::RefineClustersOptions(), and kaldi::TestRefineClusters().

898  {
899 #ifndef KALDI_PARANOID // don't do this check in "paranoid" mode as we want to expose bugs.
900  if (cfg.num_iters <= 0) { return 0.0; } // nothing to do.
901 #endif
902  KALDI_ASSERT(clusters != NULL && assignments != NULL);
903  KALDI_ASSERT(!ContainsNullPointers(points) && !ContainsNullPointers(*clusters));
904  RefineClusterer rc(points, clusters, assignments, cfg);
905  BaseFloat ans = rc.Refine();
906  KALDI_ASSERT(!ContainsNullPointers(*clusters));
907  return ans;
908 }
bool ContainsNullPointers(const std::vector< A *> &v)
Returns true if the vector of pointers contains NULL pointers.
Definition: stl-utils.h:197
float BaseFloat
Definition: kaldi-types.h:29
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185

◆ TreeCluster()

BaseFloat TreeCluster ( const std::vector< Clusterable * > &  points,
int32  max_clust,
std::vector< Clusterable * > *  clusters_out,
std::vector< int32 > *  assignments_out,
std::vector< int32 > *  clust_assignments_out,
int32 num_leaves_out,
TreeClusterOptions  cfg = TreeClusterOptions() 
)

TreeCluster is a top-down clustering algorithm, using a binary tree (not necessarily balanced).

Returns objf improvement versus having all points in one cluster. The algorithm is:

  • Initialize to 1 cluster (tree with 1 node).
  • Maintain, for each cluster, a "best-binary-split" (using ClusterKMeans to do so). Always split the highest scoring cluster, until we can do no more splits.
Parameters
points[in] Data points to be clustered
max_clust[in] Maximum number of clusters (you will get exactly this number, if there are at least this many points, except if you set the cfg.thresh value nonzero, in which case that threshold may limit the number of clusters.
clusters_out[out] If non-NULL, will be set to the a vector whose first (*num_leaves_out) elements are the leaf clusters, and whose subsequent elements are the nonleaf nodes in the tree, in topological order with the root node last. Must be empty vector when this function is called.
assignments_out[out] If non-NULL, will be set to a vector to a vector the same size as "points", where assignments[i] is the leaf node index i to which the i'th point gets clustered.
clust_assignments_out[out] If non-NULL, will be set to a vector the same size as clusters_out which says for each node (leaf or nonleaf), the index of its parent. For the root node (which is last), assignments_out[i] == i. For each i, assignments_out[i]>=i, i.e. any node's parent is higher numbered than itself. If you don't need this information, consider using instead the ClusterTopDown function.
num_leaves_out[out] If non-NULL, will be set to the number of leaf nodes in the tree.
cfg[in] Configuration object that controls clustering behavior. Most important value is "thresh", which provides an alternative mechanism [other than max_clust] to limit the number of leaves.

Definition at line 1240 of file cluster-utils.cc.

References TreeClusterer::Cluster(), kaldi::ContainsNullPointers(), and KALDI_ASSERT.

Referenced by kaldi::AutomaticallyObtainQuestions(), RegressionTree::BuildTree(), kaldi::ClusterTopDown(), kaldi::TestTreeCluster(), and TreeClusterOptions::TreeClusterOptions().

1246  {
1247  if (points.size() == 0) {
1248  if (clusters_out) clusters_out->clear();
1249  if (assignments_out) assignments_out->clear();
1250  if (clust_assignments_out) clust_assignments_out->clear();
1251  if (num_leaves_out) *num_leaves_out = 0;
1252  return 0.0;
1253  }
1254  TreeClusterer tc(points, max_clust, cfg);
1255  BaseFloat ans = tc.Cluster(clusters_out, assignments_out, clust_assignments_out, num_leaves_out);
1256  if (clusters_out) KALDI_ASSERT(!ContainsNullPointers(*clusters_out));
1257  return ans;
1258 }
bool ContainsNullPointers(const std::vector< A *> &v)
Returns true if the vector of pointers contains NULL pointers.
Definition: stl-utils.h:197
float BaseFloat
Definition: kaldi-types.h:29
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185