See Clustering mechanisms in Kaldi for context. More...
Classes | |
struct | RefineClustersOptions |
struct | ClusterKMeansOptions |
struct | TreeClusterOptions |
Functions | |
BaseFloat | ClusterBottomUp (const std::vector< Clusterable * > &points, BaseFloat thresh, int32 min_clust, std::vector< Clusterable * > *clusters_out, std::vector< int32 > *assignments_out) |
A bottom-up clustering algorithm. More... | |
BaseFloat | ClusterBottomUpCompartmentalized (const std::vector< std::vector< Clusterable * > > &points, BaseFloat thresh, int32 min_clust, std::vector< std::vector< Clusterable * > > *clusters_out, std::vector< std::vector< int32 > > *assignments_out) |
This is a bottom-up clustering where the points are pre-clustered in a set of compartments, such that only points in the same compartment are clustered together. More... | |
BaseFloat | RefineClusters (const std::vector< Clusterable * > &points, std::vector< Clusterable * > *clusters, std::vector< int32 > *assignments, RefineClustersOptions cfg=RefineClustersOptions()) |
RefineClusters is mainly used internally by other clustering algorithms. More... | |
BaseFloat | ClusterKMeans (const std::vector< Clusterable * > &points, int32 num_clust, std::vector< Clusterable * > *clusters_out, std::vector< int32 > *assignments_out, ClusterKMeansOptions cfg=ClusterKMeansOptions()) |
ClusterKMeans is a K-means-like clustering algorithm. More... | |
BaseFloat | TreeCluster (const std::vector< Clusterable * > &points, int32 max_clust, std::vector< Clusterable * > *clusters_out, std::vector< int32 > *assignments_out, std::vector< int32 > *clust_assignments_out, int32 *num_leaves_out, TreeClusterOptions cfg=TreeClusterOptions()) |
TreeCluster is a top-down clustering algorithm, using a binary tree (not necessarily balanced). More... | |
BaseFloat | ClusterTopDown (const std::vector< Clusterable * > &points, int32 max_clust, std::vector< Clusterable * > *clusters_out, std::vector< int32 > *assignments_out, TreeClusterOptions cfg=TreeClusterOptions()) |
A clustering algorithm that internally uses TreeCluster, but does not give you the information about the structure of the tree. More... | |
See Clustering mechanisms in Kaldi for context.
BaseFloat ClusterBottomUp | ( | const std::vector< Clusterable * > & | points, |
BaseFloat | thresh, | ||
int32 | min_clust, | ||
std::vector< Clusterable * > * | clusters_out, | ||
std::vector< int32 > * | assignments_out | ||
) |
A bottom-up clustering algorithm.
There are two parameters that control how many clusters we get: a "max_merge_thresh" which is a threshold for merging clusters, and a min_clust which puts a floor on the number of clusters we want. Set max_merge_thresh = large to use the min_clust only, or min_clust to 0 to use the max_merge_thresh only.
The algorithm is:
points | [in] Points to be clustered (may not contain NULL pointers) |
thresh | [in] Threshold on cost change from merging clusters; clusters won't be merged if the cost is more than this |
min_clust | [in] Minimum number of clusters desired; we'll stop merging after reaching this number. |
clusters_out | [out] If non-NULL, will be set to a vector of size equal to the number of output clusters, containing the clustered statistics. Must be empty when called. |
assignments_out | [out] If non-NULL, will be resized to the number of points, and each element is the index of the cluster that point was assigned to. |
Definition at line 407 of file cluster-utils.cc.
References BottomUpClusterer::Cluster(), kaldi::ContainsNullPointers(), KALDI_ASSERT, and KALDI_VLOG.
Referenced by kaldi::ClusterEventMapGetMapping(), kaldi::ClusterGaussiansToUbm(), and kaldi::TestClusterBottomUp().
BaseFloat ClusterBottomUpCompartmentalized | ( | const std::vector< std::vector< Clusterable * > > & | points, |
BaseFloat | thresh, | ||
int32 | min_clust, | ||
std::vector< std::vector< Clusterable * > > * | clusters_out, | ||
std::vector< std::vector< int32 > > * | assignments_out | ||
) |
This is a bottom-up clustering where the points are pre-clustered in a set of compartments, such that only points in the same compartment are clustered together.
The compartment and pair of points with the smallest merge cost is selected and the points are clustered. The result stays in the same compartment. The code does not merge compartments, and hence assumes that the number of compartments is smaller than the 'min_clust' option. The clusters in "clusters_out" are newly allocated and owned by the caller.
Definition at line 653 of file cluster-utils.cc.
References CompartmentalizedBottomUpClusterer::Cluster(), kaldi::ContainsNullPointers(), and KALDI_ASSERT.
Referenced by kaldi::ClusterEventMapToNClustersRestrictedByMap(), and kaldi::ClusterGaussiansToUbm().
BaseFloat ClusterKMeans | ( | const std::vector< Clusterable * > & | points, |
int32 | num_clust, | ||
std::vector< Clusterable * > * | clusters_out, | ||
std::vector< int32 > * | assignments_out, | ||
ClusterKMeansOptions | cfg = ClusterKMeansOptions() |
||
) |
ClusterKMeans is a K-means-like clustering algorithm.
It starts with pseudo-random initialization of points to clusters and uses RefineClusters to iteratively improve the cluster assignments. It does this for multiple iterations and picks the result with the best objective function.
ClusterKMeans implicitly uses Rand(). It will not necessarily return the same value on different calls. Use sRand() if you want consistent results. The algorithm used in ClusterKMeans is a "k-means-like" algorithm that tries to be as efficient as possible. Firstly, since the algorithm it uses includes random initialization, it tries the whole thing cfg.num_tries times and picks the one with the best objective function. Each try, it does as follows: it randomly initializes points to clusters, and then for cfg.num_iters iterations it calls RefineClusters(). The options to RefineClusters() are given by cfg.refine_cfg. Calling RefineClusters once will always be at least as good as doing one iteration of reassigning points to clusters, but will generally be quite a bit better (without taking too much extra time).
points | [in] points to be clustered (must be all non-NULL). |
num_clust | [in] number of clusters requested (it will always return exactly this many, or will fail if num_clust > points.size()). |
clusters_out | [out] may be NULL; if non-NULL, should be empty when called. Will be set to a vector of statistics corresponding to the output clusters. |
assignments_out | [out] may be NULL; if non-NULL, will be set to a vector of same size as "points", which says for each point which cluster it is assigned to. |
cfg | [in] configuration class specifying options to the algorithm. |
Definition at line 986 of file cluster-utils.cc.
References kaldi::ClusterKMeansOnce(), kaldi::ContainsNullPointers(), kaldi::DeletePointers(), rnnlm::i, KALDI_ASSERT, ClusterKMeansOptions::num_iters, and ClusterKMeansOptions::num_tries.
Referenced by ClusterKMeansOptions::ClusterKMeansOptions(), TreeClusterer::FindBestSplit(), kaldi::KMeansClusterPhones(), DiagGmm::MergeKmeans(), kaldi::TestClusterKMeans(), and kaldi::TestClusterKMeansVector().
BaseFloat ClusterTopDown | ( | const std::vector< Clusterable * > & | points, |
int32 | max_clust, | ||
std::vector< Clusterable * > * | clusters_out, | ||
std::vector< int32 > * | assignments_out, | ||
TreeClusterOptions | cfg = TreeClusterOptions() |
||
) |
A clustering algorithm that internally uses TreeCluster, but does not give you the information about the structure of the tree.
The "clusters_out" and "assignments_out" may be NULL if the outputs are not needed.
points | [in] points to be clustered (must be all non-NULL). |
max_clust | [in] Maximum number of clusters (you will get exactly this number, if there are at least this many points, except if you set the cfg.thresh value nonzero, in which case that threshold may limit the number of clusters. |
clusters_out | [out] may be NULL; if non-NULL, should be empty when called. Will be set to a vector of statistics corresponding to the output clusters. |
assignments_out | [out] may be NULL; if non-NULL, will be set to a vector of same size as "points", which says for each point which cluster it is assigned to. |
cfg | [in] Configuration object that controls clustering behavior. Most important value is "thresh", which provides an alternative mechanism [other than max_clust] to limit the number of leaves. |
Definition at line 1261 of file cluster-utils.cc.
References rnnlm::j, and kaldi::TreeCluster().
Referenced by kaldi::TestClusterTopDown(), and TreeClusterOptions::TreeClusterOptions().
BaseFloat RefineClusters | ( | const std::vector< Clusterable * > & | points, |
std::vector< Clusterable * > * | clusters, | ||
std::vector< int32 > * | assignments, | ||
RefineClustersOptions | cfg = RefineClustersOptions() |
||
) |
RefineClusters is mainly used internally by other clustering algorithms.
It starts with a given assignment of points to clusters and keeps trying to improve it by moving points from cluster to cluster, up to a maximum number of iterations.
"clusters" and "assignments" are both input and output variables, and so both MUST be non-NULL.
"top_n" (>=2) is a pruning value: more is more exact, fewer is faster. The algorithm initially finds the "top_n" closest clusters to any given point, and from that point only consider move to those "top_n" clusters. Since RefineClusters is called multiple times from ClusterKMeans (for instance), this is not really a limitation.
Definition at line 895 of file cluster-utils.cc.
References kaldi::ContainsNullPointers(), KALDI_ASSERT, RefineClustersOptions::num_iters, and RefineClusterer::Refine().
Referenced by kaldi::ClusterKMeansOnce(), kaldi::FindBestSplitForKey(), RefineClustersOptions::RefineClustersOptions(), and kaldi::TestRefineClusters().
BaseFloat TreeCluster | ( | const std::vector< Clusterable * > & | points, |
int32 | max_clust, | ||
std::vector< Clusterable * > * | clusters_out, | ||
std::vector< int32 > * | assignments_out, | ||
std::vector< int32 > * | clust_assignments_out, | ||
int32 * | num_leaves_out, | ||
TreeClusterOptions | cfg = TreeClusterOptions() |
||
) |
TreeCluster is a top-down clustering algorithm, using a binary tree (not necessarily balanced).
Returns objf improvement versus having all points in one cluster. The algorithm is:
points | [in] Data points to be clustered |
max_clust | [in] Maximum number of clusters (you will get exactly this number, if there are at least this many points, except if you set the cfg.thresh value nonzero, in which case that threshold may limit the number of clusters. |
clusters_out | [out] If non-NULL, will be set to the a vector whose first (*num_leaves_out) elements are the leaf clusters, and whose subsequent elements are the nonleaf nodes in the tree, in topological order with the root node last. Must be empty vector when this function is called. |
assignments_out | [out] If non-NULL, will be set to a vector to a vector the same size as "points", where assignments[i] is the leaf node index i to which the i'th point gets clustered. |
clust_assignments_out | [out] If non-NULL, will be set to a vector the same size as clusters_out which says for each node (leaf or nonleaf), the index of its parent. For the root node (which is last), assignments_out[i] == i. For each i, assignments_out[i]>=i, i.e. any node's parent is higher numbered than itself. If you don't need this information, consider using instead the ClusterTopDown function. |
num_leaves_out | [out] If non-NULL, will be set to the number of leaf nodes in the tree. |
cfg | [in] Configuration object that controls clustering behavior. Most important value is "thresh", which provides an alternative mechanism [other than max_clust] to limit the number of leaves. |
Definition at line 1240 of file cluster-utils.cc.
References TreeClusterer::Cluster(), kaldi::ContainsNullPointers(), and KALDI_ASSERT.
Referenced by kaldi::AutomaticallyObtainQuestions(), RegressionTree::BuildTree(), kaldi::ClusterTopDown(), kaldi::TestTreeCluster(), and TreeClusterOptions::TreeClusterOptions().