All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Modules Pages

See Clustering mechanisms in Kaldi for context. More...

Collaboration diagram for Algorithms for clustering:

Classes

struct  RefineClustersOptions
 
struct  ClusterKMeansOptions
 
struct  TreeClusterOptions
 

Functions

BaseFloat ClusterBottomUp (const std::vector< Clusterable * > &points, BaseFloat thresh, int32 min_clust, std::vector< Clusterable * > *clusters_out, std::vector< int32 > *assignments_out)
 A bottom-up clustering algorithm. More...
 
BaseFloat ClusterBottomUpCompartmentalized (const std::vector< std::vector< Clusterable * > > &points, BaseFloat thresh, int32 min_clust, std::vector< std::vector< Clusterable * > > *clusters_out, std::vector< std::vector< int32 > > *assignments_out)
 This is a bottom-up clustering where the points are pre-clustered in a set of compartments, such that only points in the same compartment are clustered together. More...
 
BaseFloat RefineClusters (const std::vector< Clusterable * > &points, std::vector< Clusterable * > *clusters, std::vector< int32 > *assignments, RefineClustersOptions cfg=RefineClustersOptions())
 RefineClusters is mainly used internally by other clustering algorithms. More...
 
BaseFloat ClusterKMeans (const std::vector< Clusterable * > &points, int32 num_clust,std::vector< Clusterable * > *clusters_out,std::vector< int32 > *assignments_out,ClusterKMeansOptions cfg=ClusterKMeansOptions())
 ClusterKMeans is a K-means-like clustering algorithm. More...
 
BaseFloat TreeCluster (const std::vector< Clusterable * > &points, int32 max_clust,std::vector< Clusterable * > *clusters_out, std::vector< int32 > *assignments_out, std::vector< int32 > *clust_assignments_out, int32 *num_leaves_out, TreeClusterOptions cfg=TreeClusterOptions())
 TreeCluster is a top-down clustering algorithm, using a binary tree (not necessarily balanced). More...
 
BaseFloat ClusterTopDown (const std::vector< Clusterable * > &points, int32 max_clust,std::vector< Clusterable * > *clusters_out, std::vector< int32 > *assignments_out, TreeClusterOptions cfg=TreeClusterOptions())
 A clustering algorithm that internally uses TreeCluster, but does not give you the information about the structure of the tree. More...
 

Detailed Description

See Clustering mechanisms in Kaldi for context.

Function Documentation

BaseFloat ClusterBottomUp ( const std::vector< Clusterable * > &  points,
BaseFloat  thresh,
int32  min_clust,
std::vector< Clusterable * > *  clusters_out,
std::vector< int32 > *  assignments_out 
)

A bottom-up clustering algorithm.

There are two parameters that control how many clusters we get: a "max_merge_thresh" which is a threshold for merging clusters, and a min_clust which puts a floor on the number of clusters we want. Set max_merge_thresh = large to use the min_clust only, or min_clust to 0 to use the max_merge_thresh only.

The algorithm is:

while (num-clusters > min_clust && smallest_merge_cost <= max_merge_thresh)
merge the closest two clusters.
Parameters
points[in] Points to be clustered (may not contain NULL pointers)
thresh[in] Threshold on cost change from merging clusters; clusters won't be merged if the cost is more than this
min_clust[in] Minimum number of clusters desired; we'll stop merging after reaching this number.
clusters_out[out] If non-NULL, will be set to a vector of size equal to the number of output clusters, containing the clustered statistics. Must be empty when called.
assignments_out[out] If non-NULL, will be resized to the number of points, and each element is the index of the cluster that point was assigned to.
Returns
Returns the total objf change relative to all clusters being separate, which is a negative. Note that this is not the same as what the other clustering algorithms return.

Definition at line 407 of file cluster-utils.cc.

References BottomUpClusterer::Cluster(), kaldi::ContainsNullPointers(), KALDI_ASSERT, and KALDI_VLOG.

Referenced by kaldi::ClusterEventMapGetMapping(), kaldi::ClusterGaussiansToUbm(), and kaldi::TestClusterBottomUp().

411  {
412  KALDI_ASSERT(max_merge_thresh >= 0.0 && min_clust >= 0);
414  int32 npoints = points.size();
415  // make sure fits in uint_smaller and does not hit the -1 which is reserved.
416  KALDI_ASSERT(sizeof(uint_smaller)==sizeof(uint32) ||
417  npoints < static_cast<int32>(static_cast<uint_smaller>(-1)));
418 
419  KALDI_VLOG(2) << "Initializing clustering object.";
420  BottomUpClusterer bc(points, max_merge_thresh, min_clust, clusters_out, assignments_out);
421  BaseFloat ans = bc.Cluster();
422  if (clusters_out) KALDI_ASSERT(!ContainsNullPointers(*clusters_out));
423  return ans;
424 }
float BaseFloat
Definition: kaldi-types.h:29
bool ContainsNullPointers(const std::vector< A * > &v)
Returns true if the vector of pointers contains NULL pointers.
Definition: stl-utils.h:211
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:169
#define KALDI_VLOG(v)
Definition: kaldi-error.h:136
uint16 uint_smaller
BaseFloat ClusterBottomUpCompartmentalized ( const std::vector< std::vector< Clusterable * > > &  points,
BaseFloat  thresh,
int32  min_clust,
std::vector< std::vector< Clusterable * > > *  clusters_out,
std::vector< std::vector< int32 > > *  assignments_out 
)

This is a bottom-up clustering where the points are pre-clustered in a set of compartments, such that only points in the same compartment are clustered together.

The compartment and pair of points with the smallest merge cost is selected and the points are clustered. The result stays in the same compartment. The code does not merge compartments, and hence assumes that the number of compartments is smaller than the 'min_clust' option. The clusters in "clusters_out" are newly allocated and owned by the caller.

Definition at line 653 of file cluster-utils.cc.

References CompartmentalizedBottomUpClusterer::Cluster(), kaldi::ContainsNullPointers(), and KALDI_ASSERT.

Referenced by kaldi::ClusterGaussiansToUbm().

656  {
657  KALDI_ASSERT(thresh >= 0.0 && min_clust >= 0);
658  KALDI_ASSERT(min_clust >= points.size()); // Code does not merge compartments.
659  int32 npoints = 0;
660  for (vector< vector<Clusterable*> >::const_iterator itr = points.begin(),
661  end = points.end(); itr != end; ++itr) {
663  npoints += itr->size();
664  }
665  // make sure fits in uint_smaller and does not hit the -1 which is reserved.
666  KALDI_ASSERT(sizeof(uint_smaller)==sizeof(uint32) ||
667  npoints < static_cast<int32>(static_cast<uint_smaller>(-1)));
668 
669  CompartmentalizedBottomUpClusterer bc(points, thresh, min_clust);
670  BaseFloat ans = bc.Cluster(clusters_out, assignments_out);
671  if (clusters_out) {
672  for (vector< vector<Clusterable*> >::iterator itr = clusters_out->begin(),
673  end = clusters_out->end(); itr != end; ++itr) {
675  }
676  }
677  return ans;
678 }
float BaseFloat
Definition: kaldi-types.h:29
bool ContainsNullPointers(const std::vector< A * > &v)
Returns true if the vector of pointers contains NULL pointers.
Definition: stl-utils.h:211
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:169
uint16 uint_smaller
BaseFloat ClusterKMeans ( const std::vector< Clusterable * > &  points,
int32  num_clust,
std::vector< Clusterable * > *  clusters_out,
std::vector< int32 > *  assignments_out,
ClusterKMeansOptions  cfg = ClusterKMeansOptions() 
)

ClusterKMeans is a K-means-like clustering algorithm.

It starts with pseudo-random initialization of points to clusters and uses RefineClusters to iteratively improve the cluster assignments. It does this for multiple iterations and picks the result with the best objective function.

ClusterKMeans implicitly uses Rand(). It will not necessarily return the same value on different calls. Use sRand() if you want consistent results. The algorithm used in ClusterKMeans is a "k-means-like" algorithm that tries to be as efficient as possible. Firstly, since the algorithm it uses includes random initialization, it tries the whole thing cfg.num_tries times and picks the one with the best objective function. Each try, it does as follows: it randomly initializes points to clusters, and then for cfg.num_iters iterations it calls RefineClusters(). The options to RefineClusters() are given by cfg.refine_cfg. Calling RefineClusters once will always be at least as good as doing one iteration of reassigning points to clusters, but will generally be quite a bit better (without taking too much extra time).

Parameters
points[in] points to be clustered (must be all non-NULL).
num_clust[in] number of clusters requested (it will always return exactly this many, or will fail if num_clust > points.size()).
clusters_out[out] may be NULL; if non-NULL, should be empty when called. Will be set to a vector of statistics corresponding to the output clusters.
assignments_out[out] may be NULL; if non-NULL, will be set to a vector of same size as "points", which says for each point which cluster it is assigned to.
cfg[in] configuration class specifying options to the algorithm.
Returns
Returns the objective function improvement versus everything being in the same cluster.

Definition at line 985 of file cluster-utils.cc.

References kaldi::ClusterKMeansOnce(), kaldi::ContainsNullPointers(), kaldi::DeletePointers(), rnnlm::i, KALDI_ASSERT, ClusterKMeansOptions::num_iters, and ClusterKMeansOptions::num_tries.

Referenced by TreeClusterer::FindBestSplit(), kaldi::KMeansClusterPhones(), DiagGmm::MergeKmeans(), kaldi::TestClusterKMeans(), and kaldi::TestClusterKMeansVector().

989  {
990  if (points.size() == 0) {
991  if (clusters_out) KALDI_ASSERT(clusters_out->empty()); // or we wouldn't know whether to free the pointers.
992  if (assignments_out) assignments_out->clear();
993  return 0.0;
994  }
995  KALDI_ASSERT(cfg.num_tries>=1 && cfg.num_iters>=1);
996  if (clusters_out) KALDI_ASSERT(clusters_out->empty()); // or we wouldn't know whether to deallocate.
997  if (cfg.num_tries == 1) {
998  std::vector<int32> assignments;
999  return ClusterKMeansOnce(points, num_clust, clusters_out, (assignments_out != NULL?assignments_out:&assignments), cfg);
1000  } else { // multiple tries.
1001  if (clusters_out) {
1002  KALDI_ASSERT(clusters_out->empty()); // we don't know the ownership of any pointers in there, otherwise.
1003  }
1004  BaseFloat best_ans = 0.0;
1005  for (int32 i = 0;i < cfg.num_tries;i++) {
1006  std::vector<Clusterable*> clusters_tmp;
1007  std::vector<int32> assignments_tmp;
1008  BaseFloat ans = ClusterKMeansOnce(points, num_clust, &clusters_tmp, &assignments_tmp, cfg);
1009  KALDI_ASSERT(!ContainsNullPointers(clusters_tmp));
1010  if (i == 0 || ans > best_ans) {
1011  best_ans = ans;
1012  if (clusters_out) {
1013  if (clusters_out->size()) DeletePointers(clusters_out);
1014  *clusters_out = clusters_tmp;
1015  clusters_tmp.clear(); // suppress deletion of pointers.
1016  }
1017  if (assignments_out) *assignments_out = assignments_tmp;
1018  }
1019  // delete anything remaining in clusters_tmp (we cleared it if we used
1020  // the pointers.
1021  DeletePointers(&clusters_tmp);
1022  }
1023  return best_ans;
1024  }
1025 }
BaseFloat ClusterKMeansOnce(const std::vector< Clusterable * > &points, int32 num_clust, std::vector< Clusterable * > *clusters_out, std::vector< int32 > *assignments_out, ClusterKMeansOptions &cfg)
ClusterKMeansOnce is called internally by ClusterKMeans; it is equivalent to calling ClusterKMeans wi...
float BaseFloat
Definition: kaldi-types.h:29
bool ContainsNullPointers(const std::vector< A * > &v)
Returns true if the vector of pointers contains NULL pointers.
Definition: stl-utils.h:211
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:169
void DeletePointers(std::vector< A * > *v)
Deletes any non-NULL pointers in the vector v, and sets the corresponding entries of v to NULL...
Definition: stl-utils.h:198
BaseFloat ClusterTopDown ( const std::vector< Clusterable * > &  points,
int32  max_clust,
std::vector< Clusterable * > *  clusters_out,
std::vector< int32 > *  assignments_out,
TreeClusterOptions  cfg = TreeClusterOptions() 
)

A clustering algorithm that internally uses TreeCluster, but does not give you the information about the structure of the tree.

The "clusters_out" and "assignments_out" may be NULL if the outputs are not needed.

Parameters
points[in] points to be clustered (must be all non-NULL).
max_clust[in] Maximum number of clusters (you will get exactly this number, if there are at least this many points, except if you set the cfg.thresh value nonzero, in which case that threshold may limit the number of clusters.
clusters_out[out] may be NULL; if non-NULL, should be empty when called. Will be set to a vector of statistics corresponding to the output clusters.
assignments_out[out] may be NULL; if non-NULL, will be set to a vector of same size as "points", which says for each point which cluster it is assigned to.
cfg[in] Configuration object that controls clustering behavior. Most important value is "thresh", which provides an alternative mechanism [other than max_clust] to limit the number of leaves.

Definition at line 1260 of file cluster-utils.cc.

References rnnlm::j, and kaldi::TreeCluster().

Referenced by kaldi::TestClusterTopDown().

1264  {
1265  int32 num_leaves = 0;
1266  BaseFloat ans = TreeCluster(points, max_clust, clusters_out, assignments_out, NULL, &num_leaves, cfg);
1267  if (clusters_out != NULL) {
1268  for (size_t j = num_leaves;j<clusters_out->size();j++) delete (*clusters_out)[j];
1269  clusters_out->resize(num_leaves); // number of leaf-level clusters in tree.
1270  }
1271  return ans;
1272 }
BaseFloat TreeCluster(const std::vector< Clusterable * > &points, int32 max_clust, std::vector< Clusterable * > *clusters_out, std::vector< int32 > *assignments_out, std::vector< int32 > *clust_assignments_out, int32 *num_leaves_out, TreeClusterOptions cfg)
TreeCluster is a top-down clustering algorithm, using a binary tree (not necessarily balanced)...
float BaseFloat
Definition: kaldi-types.h:29
BaseFloat RefineClusters ( const std::vector< Clusterable * > &  points,
std::vector< Clusterable * > *  clusters,
std::vector< int32 > *  assignments,
RefineClustersOptions  cfg = RefineClustersOptions() 
)

RefineClusters is mainly used internally by other clustering algorithms.

It starts with a given assignment of points to clusters and keeps trying to improve it by moving points from cluster to cluster, up to a maximum number of iterations.

"clusters" and "assignments" are both input and output variables, and so both MUST be non-NULL.

"top_n" (>=2) is a pruning value: more is more exact, fewer is faster. The algorithm initially finds the "top_n" closest clusters to any given point, and from that point only consider move to those "top_n" clusters. Since RefineClusters is called multiple times from ClusterKMeans (for instance), this is not really a limitation.

Definition at line 894 of file cluster-utils.cc.

References kaldi::ContainsNullPointers(), KALDI_ASSERT, RefineClustersOptions::num_iters, and RefineClusterer::Refine().

Referenced by kaldi::ClusterKMeansOnce(), kaldi::FindBestSplitForKey(), and kaldi::TestRefineClusters().

897  {
898 #ifndef KALDI_PARANOID // don't do this check in "paranoid" mode as we want to expose bugs.
899  if (cfg.num_iters <= 0) { return 0.0; } // nothing to do.
900 #endif
901  KALDI_ASSERT(clusters != NULL && assignments != NULL);
902  KALDI_ASSERT(!ContainsNullPointers(points) && !ContainsNullPointers(*clusters));
903  RefineClusterer rc(points, clusters, assignments, cfg);
904  BaseFloat ans = rc.Refine();
905  KALDI_ASSERT(!ContainsNullPointers(*clusters));
906  return ans;
907 }
float BaseFloat
Definition: kaldi-types.h:29
bool ContainsNullPointers(const std::vector< A * > &v)
Returns true if the vector of pointers contains NULL pointers.
Definition: stl-utils.h:211
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:169
BaseFloat TreeCluster ( const std::vector< Clusterable * > &  points,
int32  max_clust,
std::vector< Clusterable * > *  clusters_out,
std::vector< int32 > *  assignments_out,
std::vector< int32 > *  clust_assignments_out,
int32 *  num_leaves_out,
TreeClusterOptions  cfg = TreeClusterOptions() 
)

TreeCluster is a top-down clustering algorithm, using a binary tree (not necessarily balanced).

Returns objf improvement versus having all points in one cluster. The algorithm is:

  • Initialize to 1 cluster (tree with 1 node).
  • Maintain, for each cluster, a "best-binary-split" (using ClusterKMeans to do so). Always split the highest scoring cluster, until we can do no more splits.
Parameters
points[in] Data points to be clustered
max_clust[in] Maximum number of clusters (you will get exactly this number, if there are at least this many points, except if you set the cfg.thresh value nonzero, in which case that threshold may limit the number of clusters.
clusters_out[out] If non-NULL, will be set to the a vector whose first (*num_leaves_out) elements are the leaf clusters, and whose subsequent elements are the nonleaf nodes in the tree, in topological order with the root node last. Must be empty vector when this function is called.
assignments_out[out] If non-NULL, will be set to a vector to a vector the same size as "points", where assignments[i] is the leaf node index i to which the i'th point gets clustered.
clust_assignments_out[out] If non-NULL, will be set to a vector the same size as clusters_out which says for each node (leaf or nonleaf), the index of its parent. For the root node (which is last), assignments_out[i] == i. For each i, assignments_out[i]>=i, i.e. any node's parent is higher numbered than itself. If you don't need this information, consider using instead the ClusterTopDown function.
num_leaves_out[out] If non-NULL, will be set to the number of leaf nodes in the tree.
cfg[in] Configuration object that controls clustering behavior. Most important value is "thresh", which provides an alternative mechanism [other than max_clust] to limit the number of leaves.

Definition at line 1239 of file cluster-utils.cc.

References TreeClusterer::Cluster(), kaldi::ContainsNullPointers(), and KALDI_ASSERT.

Referenced by kaldi::AutomaticallyObtainQuestions(), RegressionTree::BuildTree(), kaldi::ClusterTopDown(), and kaldi::TestTreeCluster().

1245  {
1246  if (points.size() == 0) {
1247  if (clusters_out) clusters_out->clear();
1248  if (assignments_out) assignments_out->clear();
1249  if (clust_assignments_out) clust_assignments_out->clear();
1250  if (num_leaves_out) *num_leaves_out = 0;
1251  return 0.0;
1252  }
1253  TreeClusterer tc(points, max_clust, cfg);
1254  BaseFloat ans = tc.Cluster(clusters_out, assignments_out, clust_assignments_out, num_leaves_out);
1255  if (clusters_out) KALDI_ASSERT(!ContainsNullPointers(*clusters_out));
1256  return ans;
1257 }
float BaseFloat
Definition: kaldi-types.h:29
bool ContainsNullPointers(const std::vector< A * > &v)
Returns true if the vector of pointers contains NULL pointers.
Definition: stl-utils.h:211
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:169