See Clustering mechanisms in Kaldi for context. More...

Collaboration diagram for Algorithms for clustering:

Classes
struct	RefineClustersOptions

struct	ClusterKMeansOptions

struct	TreeClusterOptions

Functions
BaseFloat	ClusterBottomUp (const std::vector< Clusterable * > &points, BaseFloat thresh, int32 min_clust, std::vector< Clusterable * > clusters_out, std::vector< int32 > assignments_out)
	A bottom-up clustering algorithm. More...

BaseFloat	ClusterBottomUpCompartmentalized (const std::vector< std::vector< Clusterable * > > &points, BaseFloat thresh, int32 min_clust, std::vector< std::vector< Clusterable * > > clusters_out, std::vector< std::vector< int32 > > assignments_out)
	This is a bottom-up clustering where the points are pre-clustered in a set of compartments, such that only points in the same compartment are clustered together. More...

BaseFloat	RefineClusters (const std::vector< Clusterable * > &points, std::vector< Clusterable * > clusters, std::vector< int32 > assignments, RefineClustersOptions cfg=RefineClustersOptions())
	RefineClusters is mainly used internally by other clustering algorithms. More...

BaseFloat	ClusterKMeans (const std::vector< Clusterable * > &points, int32 num_clust, std::vector< Clusterable * > clusters_out, std::vector< int32 > assignments_out, ClusterKMeansOptions cfg=ClusterKMeansOptions())
	ClusterKMeans is a K-means-like clustering algorithm. More...

BaseFloat	TreeCluster (const std::vector< Clusterable * > &points, int32 max_clust, std::vector< Clusterable * > clusters_out, std::vector< int32 > assignments_out, std::vector< int32 > clust_assignments_out, int32 num_leaves_out, TreeClusterOptions cfg=TreeClusterOptions())
	TreeCluster is a top-down clustering algorithm, using a binary tree (not necessarily balanced). More...

BaseFloat	ClusterTopDown (const std::vector< Clusterable * > &points, int32 max_clust, std::vector< Clusterable * > clusters_out, std::vector< int32 > assignments_out, TreeClusterOptions cfg=TreeClusterOptions())
	A clustering algorithm that internally uses TreeCluster, but does not give you the information about the structure of the tree. More...

Detailed Description

See Clustering mechanisms in Kaldi for context.

Function Documentation

◆ ClusterBottomUp()

BaseFloat ClusterBottomUp	(	const std::vector< Clusterable * > &	points,
		BaseFloat	thresh,
		int32	min_clust,
		std::vector< Clusterable * > *	clusters_out,
		std::vector< int32 > *	assignments_out
	)

A bottom-up clustering algorithm.

There are two parameters that control how many clusters we get: a "max_merge_thresh" which is a threshold for merging clusters, and a min_clust which puts a floor on the number of clusters we want. Set max_merge_thresh = large to use the min_clust only, or min_clust to 0 to use the max_merge_thresh only.

The algorithm is:

while (num-clusters > min_clust && smallest_merge_cost <= max_merge_thresh)

merge the closest two clusters.

Parameters

points	[in] Points to be clustered (may not contain NULL pointers)
thresh	[in] Threshold on cost change from merging clusters; clusters won't be merged if the cost is more than this
min_clust	[in] Minimum number of clusters desired; we'll stop merging after reaching this number.
clusters_out	[out] If non-NULL, will be set to a vector of size equal to the number of output clusters, containing the clustered statistics. Must be empty when called.
assignments_out	[out] If non-NULL, will be resized to the number of points, and each element is the index of the cluster that point was assigned to.

Returns: Returns the total objf change relative to all clusters being separate, which is a negative. Note that this is not the same as what the other clustering algorithms return.

Definition at line 407 of file cluster-utils.cc.

References BottomUpClusterer::Cluster(), kaldi::ContainsNullPointers(), KALDI_ASSERT, and KALDI_VLOG.

Referenced by kaldi::ClusterEventMapGetMapping(), kaldi::ClusterGaussiansToUbm(), and kaldi::TestClusterBottomUp().

                                                              {
   KALDI_ASSERT(max_merge_thresh >= 0.0 && min_clust >= 0);
   KALDI_ASSERT(!ContainsNullPointers(points));
   int32 npoints = points.size();
   // make sure fits in uint_smaller and does not hit the -1 which is reserved.
   KALDI_ASSERT(sizeof(uint_smaller)==sizeof(uint32) ||
                npoints < static_cast<int32>(static_cast<uint_smaller>(-1)));
 
   KALDI_VLOG(2) << "Initializing clustering object.";
   BottomUpClusterer bc(points, max_merge_thresh, min_clust, clusters_out, assignments_out);
   BaseFloat ans = bc.Cluster();
   if (clusters_out) KALDI_ASSERT(!ContainsNullPointers(*clusters_out));
   return ans;
 }

◆ ClusterBottomUpCompartmentalized()

BaseFloat ClusterBottomUpCompartmentalized	(	const std::vector< std::vector< Clusterable * > > &	points,
		BaseFloat	thresh,
		int32	min_clust,
		std::vector< std::vector< Clusterable * > > *	clusters_out,
		std::vector< std::vector< int32 > > *	assignments_out
	)

This is a bottom-up clustering where the points are pre-clustered in a set of compartments, such that only points in the same compartment are clustered together.

The compartment and pair of points with the smallest merge cost is selected and the points are clustered. The result stays in the same compartment. The code does not merge compartments, and hence assumes that the number of compartments is smaller than the 'min_clust' option. The clusters in "clusters_out" are newly allocated and owned by the caller.

Definition at line 653 of file cluster-utils.cc.

References CompartmentalizedBottomUpClusterer::Cluster(), kaldi::ContainsNullPointers(), and KALDI_ASSERT.

Referenced by kaldi::ClusterEventMapToNClustersRestrictedByMap(), and kaldi::ClusterGaussiansToUbm().

                                                     {
   KALDI_ASSERT(thresh >= 0.0 && min_clust >= 0);
   int32 npoints = 0, num_non_empty_compartments = 0;
   for (vector< vector<Clusterable*> >::const_iterator itr = points.begin(),
            end = points.end(); itr != end; ++itr) {
     KALDI_ASSERT(!ContainsNullPointers(*itr));
     npoints += itr->size();
     if (itr->size() > 0) num_non_empty_compartments++;
   }
   KALDI_ASSERT(min_clust >= num_non_empty_compartments);  // Code does not merge compartments.
   // make sure fits in uint_smaller and does not hit the -1 which is reserved.
   KALDI_ASSERT(sizeof(uint_smaller)==sizeof(uint32) ||
                npoints < static_cast<int32>(static_cast<uint_smaller>(-1)));
 
   CompartmentalizedBottomUpClusterer bc(points, thresh, min_clust);
   BaseFloat ans = bc.Cluster(clusters_out, assignments_out);
   if (clusters_out) {
     for (vector< vector<Clusterable*> >::iterator itr = clusters_out->begin(),
              end = clusters_out->end(); itr != end; ++itr) {
       KALDI_ASSERT(!ContainsNullPointers(*itr));
     }
   }
   return ans;
 }

◆ ClusterKMeans()

BaseFloat ClusterKMeans	(	const std::vector< Clusterable * > &	points,
		int32	num_clust,
		std::vector< Clusterable * > *	clusters_out,
		std::vector< int32 > *	assignments_out,
		ClusterKMeansOptions	cfg = `ClusterKMeansOptions()`
	)

ClusterKMeans is a K-means-like clustering algorithm.

It starts with pseudo-random initialization of points to clusters and uses RefineClusters to iteratively improve the cluster assignments. It does this for multiple iterations and picks the result with the best objective function.

ClusterKMeans implicitly uses Rand(). It will not necessarily return the same value on different calls. Use sRand() if you want consistent results. The algorithm used in ClusterKMeans is a "k-means-like" algorithm that tries to be as efficient as possible. Firstly, since the algorithm it uses includes random initialization, it tries the whole thing cfg.num_tries times and picks the one with the best objective function. Each try, it does as follows: it randomly initializes points to clusters, and then for cfg.num_iters iterations it calls RefineClusters(). The options to RefineClusters() are given by cfg.refine_cfg. Calling RefineClusters once will always be at least as good as doing one iteration of reassigning points to clusters, but will generally be quite a bit better (without taking too much extra time).

Parameters

points	[in] points to be clustered (must be all non-NULL).
num_clust	[in] number of clusters requested (it will always return exactly this many, or will fail if num_clust > points.size()).
clusters_out	[out] may be NULL; if non-NULL, should be empty when called. Will be set to a vector of statistics corresponding to the output clusters.
assignments_out	[out] may be NULL; if non-NULL, will be set to a vector of same size as "points", which says for each point which cluster it is assigned to.
cfg	[in] configuration class specifying options to the algorithm.

Returns: Returns the objective function improvement versus everything being in the same cluster.

Definition at line 986 of file cluster-utils.cc.

References kaldi::ClusterKMeansOnce(), kaldi::ContainsNullPointers(), kaldi::DeletePointers(), rnnlm::i, KALDI_ASSERT, ClusterKMeansOptions::num_iters, and ClusterKMeansOptions::num_tries.

Referenced by ClusterKMeansOptions::ClusterKMeansOptions(), TreeClusterer::FindBestSplit(), kaldi::KMeansClusterPhones(), DiagGmm::MergeKmeans(), kaldi::TestClusterKMeans(), and kaldi::TestClusterKMeansVector().

                                                   {
   if (points.size() == 0) {
     if (clusters_out) KALDI_ASSERT(clusters_out->empty());  // or we wouldn't know whether to free the pointers.
     if (assignments_out) assignments_out->clear();
     return 0.0;
   }
   KALDI_ASSERT(cfg.num_tries>=1 && cfg.num_iters>=1);
   if (clusters_out) KALDI_ASSERT(clusters_out->empty());  // or we wouldn't know whether to deallocate.
   if (cfg.num_tries == 1) {
     std::vector<int32> assignments;
     return ClusterKMeansOnce(points, num_clust, clusters_out, (assignments_out != NULL?assignments_out:&assignments), cfg);
   } else {  // multiple tries.
     if (clusters_out) {
       KALDI_ASSERT(clusters_out->empty());  // we don't know the ownership of any pointers in there, otherwise.
     }
     BaseFloat best_ans = 0.0;
     for (int32 i = 0;i < cfg.num_tries;i++) {
       std::vector<Clusterable*> clusters_tmp;
       std::vector<int32> assignments_tmp;
       BaseFloat ans = ClusterKMeansOnce(points, num_clust, &clusters_tmp, &assignments_tmp, cfg);
       KALDI_ASSERT(!ContainsNullPointers(clusters_tmp));
       if (i == 0 || ans > best_ans) {
         best_ans = ans;
         if (clusters_out) {
           if (clusters_out->size()) DeletePointers(clusters_out);
           *clusters_out = clusters_tmp;
           clusters_tmp.clear();  // suppress deletion of pointers.
         }
         if (assignments_out) *assignments_out = assignments_tmp;
       }
       // delete anything remaining in clusters_tmp (we cleared it if we used
       // the pointers.
       DeletePointers(&clusters_tmp);
     }
     return best_ans;
   }
 }

◆ ClusterTopDown()

BaseFloat ClusterTopDown	(	const std::vector< Clusterable * > &	points,
		int32	max_clust,
		std::vector< Clusterable * > *	clusters_out,
		std::vector< int32 > *	assignments_out,
		TreeClusterOptions	cfg = `TreeClusterOptions()`
	)

A clustering algorithm that internally uses TreeCluster, but does not give you the information about the structure of the tree.

The "clusters_out" and "assignments_out" may be NULL if the outputs are not needed.

Parameters

points	[in] points to be clustered (must be all non-NULL).
max_clust	[in] Maximum number of clusters (you will get exactly this number, if there are at least this many points, except if you set the cfg.thresh value nonzero, in which case that threshold may limit the number of clusters.
clusters_out	[out] may be NULL; if non-NULL, should be empty when called. Will be set to a vector of statistics corresponding to the output clusters.
assignments_out	[out] may be NULL; if non-NULL, will be set to a vector of same size as "points", which says for each point which cluster it is assigned to.
cfg	[in] Configuration object that controls clustering behavior. Most important value is "thresh", which provides an alternative mechanism [other than max_clust] to limit the number of leaves.

Definition at line 1261 of file cluster-utils.cc.

References rnnlm::j, and kaldi::TreeCluster().

Referenced by kaldi::TestClusterTopDown(), and TreeClusterOptions::TreeClusterOptions().

                                                  {
   int32 num_leaves = 0;
   BaseFloat ans = TreeCluster(points, max_clust, clusters_out, assignments_out, NULL, &num_leaves, cfg);
   if (clusters_out != NULL) {
     for (size_t j = num_leaves;j<clusters_out->size();j++) delete (*clusters_out)[j];
     clusters_out->resize(num_leaves);  // number of leaf-level clusters in tree.
   }
   return ans;
 }

◆ RefineClusters()

BaseFloat RefineClusters	(	const std::vector< Clusterable * > &	points,
		std::vector< Clusterable * > *	clusters,
		std::vector< int32 > *	assignments,
		RefineClustersOptions	cfg = `RefineClustersOptions()`
	)

RefineClusters is mainly used internally by other clustering algorithms.

It starts with a given assignment of points to clusters and keeps trying to improve it by moving points from cluster to cluster, up to a maximum number of iterations.

"clusters" and "assignments" are both input and output variables, and so both MUST be non-NULL.

"top_n" (>=2) is a pruning value: more is more exact, fewer is faster. The algorithm initially finds the "top_n" closest clusters to any given point, and from that point only consider move to those "top_n" clusters. Since RefineClusters is called multiple times from ClusterKMeans (for instance), this is not really a limitation.

Definition at line 895 of file cluster-utils.cc.

References kaldi::ContainsNullPointers(), KALDI_ASSERT, RefineClustersOptions::num_iters, and RefineClusterer::Refine().

Referenced by kaldi::ClusterKMeansOnce(), kaldi::FindBestSplitForKey(), RefineClustersOptions::RefineClustersOptions(), and kaldi::TestRefineClusters().

                                                     {
 #ifndef KALDI_PARANOID // don't do this check in "paranoid" mode as we want to expose bugs.
   if (cfg.num_iters <= 0) { return 0.0; } // nothing to do.
 #endif
   KALDI_ASSERT(clusters != NULL && assignments != NULL);
   KALDI_ASSERT(!ContainsNullPointers(points) && !ContainsNullPointers(*clusters));
   RefineClusterer rc(points, clusters, assignments, cfg);
   BaseFloat ans = rc.Refine();
   KALDI_ASSERT(!ContainsNullPointers(*clusters));
   return ans;
 }

◆ TreeCluster()

BaseFloat TreeCluster	(	const std::vector< Clusterable * > &	points,
		int32	max_clust,
		std::vector< Clusterable * > *	clusters_out,
		std::vector< int32 > *	assignments_out,
		std::vector< int32 > *	clust_assignments_out,
		int32 *	num_leaves_out,
		TreeClusterOptions	cfg = `TreeClusterOptions()`
	)

TreeCluster is a top-down clustering algorithm, using a binary tree (not necessarily balanced).

Returns objf improvement versus having all points in one cluster. The algorithm is:

Initialize to 1 cluster (tree with 1 node).
Maintain, for each cluster, a "best-binary-split" (using ClusterKMeans to do so). Always split the highest scoring cluster, until we can do no more splits.

Parameters

points	[in] Data points to be clustered
max_clust	[in] Maximum number of clusters (you will get exactly this number, if there are at least this many points, except if you set the cfg.thresh value nonzero, in which case that threshold may limit the number of clusters.
clusters_out	[out] If non-NULL, will be set to the a vector whose first (*num_leaves_out) elements are the leaf clusters, and whose subsequent elements are the nonleaf nodes in the tree, in topological order with the root node last. Must be empty vector when this function is called.
assignments_out	[out] If non-NULL, will be set to a vector to a vector the same size as "points", where assignments[i] is the leaf node index i to which the i'th point gets clustered.
clust_assignments_out	[out] If non-NULL, will be set to a vector the same size as clusters_out which says for each node (leaf or nonleaf), the index of its parent. For the root node (which is last), assignments_out[i] == i. For each i, assignments_out[i]>=i, i.e. any node's parent is higher numbered than itself. If you don't need this information, consider using instead the ClusterTopDown function.
num_leaves_out	[out] If non-NULL, will be set to the number of leaf nodes in the tree.
cfg	[in] Configuration object that controls clustering behavior. Most important value is "thresh", which provides an alternative mechanism [other than max_clust] to limit the number of leaves.

Definition at line 1240 of file cluster-utils.cc.

References TreeClusterer::Cluster(), kaldi::ContainsNullPointers(), and KALDI_ASSERT.

Referenced by kaldi::AutomaticallyObtainQuestions(), RegressionTree::BuildTree(), kaldi::ClusterTopDown(), kaldi::TestTreeCluster(), and TreeClusterOptions::TreeClusterOptions().

                                               {
   if (points.size() == 0) {
     if (clusters_out) clusters_out->clear();
     if (assignments_out) assignments_out->clear();
     if (clust_assignments_out) clust_assignments_out->clear();
     if (num_leaves_out) *num_leaves_out = 0;
     return 0.0;
   }
   TreeClusterer tc(points, max_clust, cfg);
   BaseFloat ans = tc.Cluster(clusters_out, assignments_out, clust_assignments_out, num_leaves_out);
   if (clusters_out) KALDI_ASSERT(!ContainsNullPointers(*clusters_out));
   return ans;
 }

Classes

Functions

Detailed Description

Function Documentation

◆ ClusterBottomUp()

◆ ClusterBottomUpCompartmentalized()

◆ ClusterKMeans()

◆ ClusterTopDown()

◆ RefineClusters()

◆ TreeCluster()