FeatureTransformEstimate Class Reference

Class for computing a feature transform used for preconditioning of the training data in neural-networks. More...

#include <get-feature-transform.h>

Inheritance diagram for FeatureTransformEstimate:
Collaboration diagram for FeatureTransformEstimate:

Public Member Functions

void Estimate (const FeatureTransformEstimateOptions &opts, Matrix< BaseFloat > *M, TpMatrix< BaseFloat > *within_cholesky) const
 Estimates the LDA transform matrix m. More...
 
- Public Member Functions inherited from LdaEstimate
 LdaEstimate ()
 
void Init (int32 num_classes, int32 dimension)
 Allocates memory for accumulators. More...
 
int32 NumClasses () const
 Returns the number of classes. More...
 
int32 Dim () const
 Returns the dimensionality of the feature vectors. More...
 
void ZeroAccumulators ()
 Sets all accumulators to zero. More...
 
void Scale (BaseFloat f)
 Scales all accumulators. More...
 
double TotCount ()
 Return total count of the data. More...
 
void Accumulate (const VectorBase< BaseFloat > &data, int32 class_id, BaseFloat weight=1.0)
 Accumulates data. More...
 
void Estimate (const LdaEstimateOptions &opts, Matrix< BaseFloat > *M, Matrix< BaseFloat > *Mfull=NULL) const
 Estimates the LDA transform matrix m. More...
 
void Read (std::istream &in_stream, bool binary, bool add)
 
void Write (std::ostream &out_stream, bool binary) const
 

Static Protected Member Functions

static void EstimateInternal (const FeatureTransformEstimateOptions &opts, const SpMatrix< double > &total_covar, const SpMatrix< double > &between_covar, const Vector< double > &mean, Matrix< BaseFloat > *M, TpMatrix< BaseFloat > *C)
 
- Static Protected Member Functions inherited from LdaEstimate
static void AddMeanOffset (const VectorBase< double > &total_mean, Matrix< BaseFloat > *projection)
 This function modifies the LDA matrix so that it also subtracts the mean feature value. More...
 

Additional Inherited Members

- Protected Member Functions inherited from LdaEstimate
void GetStats (SpMatrix< double > *total_covar, SpMatrix< double > *between_covar, Vector< double > *total_mean, double *sum) const
 Extract a more processed form of the stats. More...
 
LdaEstimateoperator= (const LdaEstimate &other)
 
- Protected Attributes inherited from LdaEstimate
Vector< double > zero_acc_
 
Matrix< double > first_acc_
 
SpMatrix< double > total_second_acc_
 

Detailed Description

Class for computing a feature transform used for preconditioning of the training data in neural-networks.

By preconditioning here, all we really mean is an affine transform of the input data– say if we set up the classification as going from vectors x_i to labels y_i, then this would be a linear transform on X, so we replace x_i with x'_i = A x_i + b. The statistics we use to obtain this transform are the within-class and between class variance statistics, and the global data mean, that we would use to estimate LDA. When designing this, we had a few principles in mind:

  • We want to remove the global mean of the input features (this is well established, I think there is a paper by LeCun explaining why this is a good thing).
  • We would like the transform to make the training process roughly invariant to linear transformations of the input features, meaning that whatever linear transformation you apply prior to this transform, it should 'undo' it.
  • We want directions in which there is a lot of between-class variance to be given a higher variance than directions that have mostly within-class variance– it has been our experience that these 'nuisance directions' will interfere with the training if they are given too large a scaling. It is essential to our method that the number of classes is higher than the dimension of the input feature space, which is normal for speech recognition tasks (~5000 > ~250).

Basically our method is as follows:

  • First subtract the mean.
  • Get the within-class and between-class stats, as for LDA.
  • Normalize the space as for LDA, so that the within-class covariance matrix is unit and the between-class covariance matrix is diagonalized
  • At this stage, if the user asked for dimension reduction then reduce the dimension by taking out dimensions with least between-class variance [note: the current scripts do not do this by default]
  • Apply a transform that reduces the variance of dimensions with low between-class variance, as we'll describe below.
  • Finally, do an SVD of the resulting transform, A = U S V^T, apply a maximum to the diagonal elements of the matrix S (e.g. 5.0), and reconstruct A' = U S' V^T; this is the final transform. The point of this stage is to stop the transform from 'blowing up' any dimensions of the space excessively; this stage was introduced in response to a problem we encountered at one point, and I think normally not very many dimensions of S end up getting floored.

We need to explain the step that applies the dimension-specific scaling, which we described above as, "Apply a transform that reduces the variance of dimensions with low between-class variance". For a particular dimension, let the between-class diagonal covariance element be , and the within-class diagonal covariance is 1 at this point (since we have normalized the within-class covariance to unity); hence, the total variance is + 1. Below, "within-class-factor" is a constant that we set by default to 0.001. We scale the i'th dimension of the features by:

$ sqrt( (within-class-factor + \lambda_i) / (1 + \lambda_i) ) $

If >> 1, this scaling factor approaches 1 (we don't need to scale up dimensions with high between-class variance as they already naturally have a higher variance than other dimensions. As becomes small, this scaling factor approaches sqrt(within-class-factor), so dimensions with very small between-class variance get assigned a small variance equal to within-class-factor, and for dimensions with intermediate between-class variance, they end up with a variance roughly equal to : consider that the variance was originally (1 + ), so by scaling the features by approximately sqrt(() / (1 + )), the variance becomes approximately [this is clear after noting that the variance gets scaled by the square of the feature scale].

Definition at line 133 of file get-feature-transform.h.

Member Function Documentation

◆ Estimate()

void Estimate ( const FeatureTransformEstimateOptions opts,
Matrix< BaseFloat > *  M,
TpMatrix< BaseFloat > *  within_cholesky 
) const

Estimates the LDA transform matrix m.

If Mfull != NULL, it also outputs the full matrix (without dimensionality reduction), which is useful for some purposes. If opts.remove_offset == true, it will output both matrices with an extra column which corresponds to mean-offset removal (the matrix should be multiplied by the feature with a 1 appended to give the correct result, as with other Kaldi transforms.) "within_cholesky" is a pointer to an SpMatrix that, if non-NULL, will be set to the Cholesky factor of the within-class covariance matrix. This is used for perturbing features.

Definition at line 28 of file get-feature-transform.cc.

References count, FeatureTransformEstimate::EstimateInternal(), LdaEstimate::GetStats(), and KALDI_LOG.

Referenced by main().

30  {
31  double count;
32  Vector<double> total_mean;
33  SpMatrix<double> total_covar, between_covar;
34  GetStats(&total_covar, &between_covar, &total_mean, &count);
35  KALDI_LOG << "Data count is " << count;
36  EstimateInternal(opts, total_covar, between_covar, total_mean, M, C);
37 }
const size_t count
void GetStats(SpMatrix< double > *total_covar, SpMatrix< double > *between_covar, Vector< double > *total_mean, double *sum) const
Extract a more processed form of the stats.
Definition: lda-estimate.cc:57
static void EstimateInternal(const FeatureTransformEstimateOptions &opts, const SpMatrix< double > &total_covar, const SpMatrix< double > &between_covar, const Vector< double > &mean, Matrix< BaseFloat > *M, TpMatrix< BaseFloat > *C)
#define KALDI_LOG
Definition: kaldi-error.h:153

◆ EstimateInternal()

void EstimateInternal ( const FeatureTransformEstimateOptions opts,
const SpMatrix< double > &  total_covar,
const SpMatrix< double > &  between_covar,
const Vector< double > &  mean,
Matrix< BaseFloat > *  M,
TpMatrix< BaseFloat > *  C 
)
staticprotected

Definition at line 40 of file get-feature-transform.cc.

References SpMatrix< Real >::AddMat2Sp(), MatrixBase< Real >::AddMatMat(), LdaEstimate::AddMeanOffset(), SpMatrix< Real >::AddSp(), VectorBase< Real >::ApplyCeiling(), TpMatrix< Real >::Cholesky(), MatrixBase< Real >::CopyFromMat(), TpMatrix< Real >::CopyFromTp(), FeatureTransformEstimateOptions::dim, VectorBase< Real >::Dim(), rnnlm::i, MatrixBase< Real >::Invert(), KALDI_ASSERT, KALDI_LOG, kaldi::kNoTrans, kaldi::kTrans, VectorBase< Real >::Max(), FeatureTransformEstimateOptions::max_singular_value, MatrixBase< Real >::MulRowsVec(), rnnlm::n, MatrixBase< Real >::NumCols(), MatrixBase< Real >::NumRows(), PackedMatrix< Real >::NumRows(), MatrixBase< Real >::Range(), FeatureTransformEstimateOptions::remove_offset, TpMatrix< Real >::Resize(), Matrix< Real >::Resize(), MatrixBase< Real >::Row(), kaldi::SortSvd(), MatrixBase< Real >::Svd(), SpMatrix< Real >::Trace(), and FeatureTransformEstimateOptions::within_class_factor.

Referenced by FeatureTransformEstimate::Estimate(), and FeatureTransformEstimateMulti::EstimateTransformPart().

46  {
47 
48  int32 target_dim = opts.dim, dim = total_covar.NumRows();
49  // Interpret zero or negative target_dim as the full dim
50  if (target_dim <= 0)
51  target_dim = dim;
52  // between-class covar is of most rank C-1
53  KALDI_ASSERT(target_dim <= dim);
54 
55  // within-class covariance
56  SpMatrix<double> wc_covar(total_covar);
57  wc_covar.AddSp(-1.0, between_covar);
58  TpMatrix<double> wc_covar_sqrt(dim);
59  try {
60  wc_covar_sqrt.Cholesky(wc_covar);
61  if (C != NULL) {
62  C->Resize(dim);
63  C->CopyFromTp(wc_covar_sqrt);
64  }
65  } catch (...) {
66  BaseFloat smooth = 1.0e-03 * wc_covar.Trace() / wc_covar.NumRows();
67  KALDI_LOG << "Cholesky failed (possibly not +ve definite), so adding " << smooth
68  << " to diagonal and trying again.\n";
69  for (int32 i = 0; i < wc_covar.NumRows(); i++)
70  wc_covar(i, i) += smooth;
71  wc_covar_sqrt.Cholesky(wc_covar);
72  }
73  Matrix<double> wc_covar_sqrt_mat(wc_covar_sqrt);
74  wc_covar_sqrt_mat.Invert();
75 
76  SpMatrix<double> tmp_sp(dim);
77  tmp_sp.AddMat2Sp(1.0, wc_covar_sqrt_mat, kNoTrans, between_covar, 0.0);
78  Matrix<double> tmp_mat(tmp_sp);
79  Matrix<double> svd_u(dim, dim), svd_vt(dim, dim);
80  Vector<double> svd_d(dim);
81  tmp_mat.Svd(&svd_d, &svd_u, &svd_vt);
82  SortSvd(&svd_d, &svd_u);
83 
84  KALDI_LOG << "LDA singular values are " << svd_d;
85 
86  KALDI_LOG << "Sum of all singular values is " << svd_d.Sum();
87  KALDI_LOG << "Sum of selected singular values is " <<
88  SubVector<double>(svd_d, 0, target_dim).Sum();
89 
90  Matrix<double> lda_mat(dim, dim);
91  lda_mat.AddMatMat(1.0, svd_u, kTrans, wc_covar_sqrt_mat, kNoTrans, 0.0);
92 
93  // finally, copy first target_dim rows to m
94  M->Resize(target_dim, dim);
95  M->CopyFromMat(lda_mat.Range(0, target_dim, 0, dim));
96 
97  if (opts.within_class_factor != 1.0) {
98  for (int32 i = 0; i < svd_d.Dim(); i++) {
99  BaseFloat old_var = 1.0 + svd_d(i), // the total variance of that dim..
100  new_var = opts.within_class_factor + svd_d(i), // the variance we want..
101  scale = sqrt(new_var / old_var);
102  if (i < M->NumRows())
103  M->Row(i).Scale(scale);
104  }
105  }
106 
107  if (opts.max_singular_value > 0.0) {
108  int32 rows = M->NumRows(), cols = M->NumCols(),
109  min_dim = std::min(rows, cols);
110  Matrix<BaseFloat> U(rows, min_dim), Vt(min_dim, cols);
111  Vector<BaseFloat> s(min_dim);
112  M->Svd(&s, &U, &Vt); // decompose m = U diag(s) Vt.
113  BaseFloat max_s = s.Max();
114  int32 n;
115  s.ApplyCeiling(opts.max_singular_value, &n);
116  if (n > 0) {
117  KALDI_LOG << "Applied ceiling to " << n << " out of " << s.Dim()
118  << " singular values of transform using ceiling "
119  << opts.max_singular_value << ", max is " << max_s;
120  Vt.MulRowsVec(s);
121  // reconstruct m with the modified singular values:
122  M->AddMatMat(1.0, U, kNoTrans, Vt, kNoTrans, 0.0);
123  }
124  }
125 
126  if (opts.remove_offset)
127  AddMeanOffset(total_mean, M);
128 }
MatrixIndexT NumCols() const
Returns number of columns (or zero for empty matrix).
Definition: kaldi-matrix.h:67
kaldi::int32 int32
void CopyFromMat(const MatrixBase< OtherReal > &M, MatrixTransposeType trans=kNoTrans)
Copy given matrix. (no resize is done).
MatrixIndexT NumRows() const
float BaseFloat
Definition: kaldi-types.h:29
const SubVector< Real > Row(MatrixIndexT i) const
Return specific row of matrix [const].
Definition: kaldi-matrix.h:188
static void AddMeanOffset(const VectorBase< double > &total_mean, Matrix< BaseFloat > *projection)
This function modifies the LDA matrix so that it also subtracts the mean feature value.
struct rnnlm::@11::@12 n
void AddMatMat(const Real alpha, const MatrixBase< Real > &A, MatrixTransposeType transA, const MatrixBase< Real > &B, MatrixTransposeType transB, const Real beta)
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185
MatrixIndexT NumRows() const
Returns number of rows (or zero for empty matrix).
Definition: kaldi-matrix.h:64
void Resize(const MatrixIndexT r, const MatrixIndexT c, MatrixResizeType resize_type=kSetZero, MatrixStrideType stride_type=kDefaultStride)
Sets matrix to a specified size (zero is OK as long as both r and c are zero).
void Svd(VectorBase< Real > *s, MatrixBase< Real > *U, MatrixBase< Real > *Vt) const
Compute SVD (*this) = U diag(s) Vt.
#define KALDI_LOG
Definition: kaldi-error.h:153
void SortSvd(VectorBase< Real > *s, MatrixBase< Real > *U, MatrixBase< Real > *Vt, bool sort_on_absolute_value)
Function to ensure that SVD is sorted.

The documentation for this class was generated from the following files: