Keywords for search: natural gradient, naturalgradient, NG-SGD. More...

#include <nnet-precondition-online.h>

Collaboration diagram for OnlinePreconditioner:

[legend]

Public Member Functions
	OnlinePreconditioner ()

void	SetRank (int32 rank)

void	SetUpdatePeriod (int32 update_period)

void	SetNumSamplesHistory (BaseFloat num_samples_history)

void	SetAlpha (BaseFloat alpha)

void	TurnOnDebug ()

BaseFloat	GetNumSamplesHistory () const

BaseFloat	GetAlpha () const

int32	GetRank () const

int32	GetUpdatePeriod () const

void	PreconditionDirections (CuMatrixBase< BaseFloat > R, CuVectorBase< BaseFloat > row_prod, BaseFloat *scale)

	OnlinePreconditioner (const OnlinePreconditioner &other)

OnlinePreconditioner &	operator= (const OnlinePreconditioner &other)

Private Member Functions
void	PreconditionDirectionsInternal (const int32 t, const BaseFloat rho_t, const Vector< BaseFloat > &d_t, CuMatrixBase< BaseFloat > WJKL_t, CuMatrixBase< BaseFloat > X_t, CuVectorBase< BaseFloat > row_prod, BaseFloat scale)

void	ComputeEt (const VectorBase< BaseFloat > &d_t, BaseFloat beta_t, VectorBase< BaseFloat > e_t, VectorBase< BaseFloat > sqrt_e_t, VectorBase< BaseFloat > *inv_sqrt_e_t) const

void	ComputeZt (int32 N, BaseFloat rho_t, const VectorBase< BaseFloat > &d_t, const VectorBase< BaseFloat > &inv_sqrt_e_t, const MatrixBase< BaseFloat > &K_t, const MatrixBase< BaseFloat > &L_t, SpMatrix< double > *Z_t) const

void	ComputeWt1 (int32 N, const VectorBase< BaseFloat > &d_t, const VectorBase< BaseFloat > &d_t1, BaseFloat rho_t, BaseFloat rho_t1, const MatrixBase< BaseFloat > &U_t, const VectorBase< BaseFloat > &sqrt_c_t, const VectorBase< BaseFloat > &inv_sqrt_e_t, const CuMatrixBase< BaseFloat > &W_t, CuMatrixBase< BaseFloat > J_t, CuMatrixBase< BaseFloat > W_t1) const

void	ReorthogonalizeXt1 (const VectorBase< BaseFloat > &d_t1, BaseFloat rho_t1, CuMatrixBase< BaseFloat > W_t1, CuMatrixBase< BaseFloat > temp_W, CuMatrixBase< BaseFloat > *temp_O)

void	Init (const CuMatrixBase< BaseFloat > &R0)

void	InitDefault (int32 D)

BaseFloat	Eta (int32 N) const

void	SelfTest () const

Static Private Member Functions
static void	InitOrthonormalSpecial (CuMatrixBase< BaseFloat > *R)
	This function creates a matrix with orthonormal rows that is like the following matrix, except with each row normalized to have unit 2-norm: [ 1.1 0 1 0 1 0 0 1.1 0 1 0 1 ] The reason why the first element in each row is 1.1 and not 1, is for symmetry-breaking... More...

Private Attributes
int32	rank_

int32	update_period_

BaseFloat	num_samples_history_

BaseFloat	alpha_

BaseFloat	epsilon_

BaseFloat	delta_

int32	t_

int32	num_updates_skipped_

bool	self_debug_

CuMatrix< BaseFloat >	W_t_

BaseFloat	rho_t_

Vector< BaseFloat >	d_t_

std::mutex	read_write_mutex_

std::mutex	update_mutex_

Detailed Description

Keywords for search: natural gradient, naturalgradient, NG-SGD.

This method is explained in the paper "Parallel training of DNNs with Natural Gradient and Parameter Averaging" by D. Povey, X. Zhang and S. Khudanpur, ICLR Workshop, 2015, where it is referred to as online NG-SGD. Note that the method exported from this header is just the core of the algorithm, and some outer-level parts of it are implemented in class NaturalGradientAffineComponent.

The rest of this extended comment describes the way we keep updated an estimate of the inverse of a scatter matrix, in an online way. This is the same as the estimation of one of the A or B quantities in the paper. This comment is slightly redundant with the paper- actually it precedes the paper- but we keep it in case it is useful in understanging our method.

We consider the problem of doing online estimation of a (scaled-identity plus low-rank) approximation of a Fisher matrix... since the Fisher matrix is a scatter of vector-valued derivatives and we will be given the derivatives (or at least terms in a factorization of the derivatives which need not concern us right now), we can just think of the present task as being the online accumulation of a (low-rank plus scaled-identity) approximation to a variance of a distribution with mean zero.

Later on we'll think about how to get easy access to the inverse of this approximate variance, which is what we really need.

Our approximation to the Fisher matrix (the scatter of derivatives) will be of the following form (and just think of this as an approximate variance matrix of some arbitrary quantities).

F_t =(def) R_t^T D_t R_t + I

(t is the minibatch index), where R_t is an R by D matrix with orthonormal rows (1 <= R < D is our chosen rank), D_t is a positive-definite diagonal matrix, and > 0. Suppose the dimension of F_t is D. Let the vectors whose variance we are approximating be provided in minibatches of size M (M can vary from iteration to iteration, but it won't vary in the normal case, so we omit the subscript t). The batch of gradients is given as X_t Re^{M D}, i.e. each row is one of the vectors whose scatter we're estimating. On the t'th iteration, define the scatter S_t of the input vectors X_t as:

S_t =(def) 1/N X_t^T X_t (eqn:St)

(where N is the minibatch size). Be careful not to confuse the rank R with with input X_t (we would typeface X_t in bold if this were not plain text, to make the distinction clearer). We want F_t to approach some kind of time-weighted average of the S_t quantities, to the extent permitted by the limitation of the rank R. We want the F_t quantities to stay "fresh" (since we'll be doing this in a SGD context and the parameters will be slowly changing). We use a constant 0 < < 1 to control the updating rate. Our update for R_t is based on the power method. Define the smoothed scatter

T_t =(def) S_t + (1-) F_t

we'll use this in place of the observed scatter S_t, to slow down the update. Defining

Y_t =(def) R_t T_t

which can be expanded as follows: Y_t = R_t ( S_t + (1-) F_t ) = R_t ( S_t + (1-) (R_t^T D_t R_t + I) ) = R_t ( S_t + (1-) (R_t^T D_t R_t + I) ) = R_t S_t + (1-) (D_t + I) R_t

It is useful to think of Y_t as having each of the top eigenvectors of the scatter scaled by the corresponding eigenvalue . We compute the following R by R matrix: Z_t =(def) Y_t Y_t^T and do the symmetric eigenvalue decomposition Z_t = U_t C_t U_t^T where C_t is diagonal and U_t orthogonal; the diagonal elements of C_t will be positive (since > 0, T_t is positive definite; since R_t has full row rank and T_t is positive definite, Y_t has full row rank; hence Z_t is positive definite). The diagonal elements of C_t can be thought of as corresponding to the squares of our current estimate of the top eigenvalues of the scatter matrix. [we should check that no element of C_t is <= 0.]

It is easy to show that C_t^{-0.5} U_t^T Z_t U_t C_t^{-0.5} = I, so (C_t^{-0.5} U_t^T Y_t) (Y_t^T U_t C_t^{-0.5}) = I. Define R_{t+1} =(def) C_t^{-0.5} U_t^T Y_t

and it's clear that R_{t+1} R_{t+1}^T = I. We will set D_{t+1} =(def) C_t^{0.5} - {t+1} I (eqn:dt1)

which ensures that for each row r of R_{t+1}, the variance of our scatter matrix F_{t+1} will be the square root of the corresponding diagonal element of C_t. This makes sense because, as we have pointed out, the diagonal elements of C_t can be thought of as corresponding to squared eigenvalues. But a proper treatment of this would require convergence analysis that would get quite complicated. We will choose {t+1} in order to ensure that tr(F_{t+1}) = tr(T_t).

For any t, tr(F_t) = D + tr(D_t) tr(T_t) = tr(S_t) + (1-) tr(F_t) = tr(S_t) + (1-) (D + tr(D_t)) Expanding out D_{t+1} from (eqn:dt1) in the expression for tr(F_{t+1}) below: tr(F_{t+1}) = D {t+1} + tr(D_{t+1}) tr(F_{t+1}) = D {t+1} + tr(C_t^{0.5} - {t+1} I) = (D - R) {t+1} + tr(C_t^{0.5}) and equating tr(F_{t+1}) with T_t (since F_{t+1} is supposed to be a low-rank approximation to T_t), we have tr(F_{t+1}) = tr(T_t) (D - R) {t+1} + tr(C_t^{0.5}) = tr(S_t) + (1-) (D + tr(D_t))

Solving for {t+1}, {t+1} = 1/(D - R) ( tr(S_t) + (1-)(D + tr(D_t)) - tr(C_t^{0.5})). (eqn:rhot1)

Note that it is technically possible that diagonal elements of of D_{t+1} may be negative, but we can still show that F_{t+1} is strictly positive definite if F_t was strictly positive definite.

If the quantities for which we are computing the Fisher matrix are all zero for some, reason, the sequence of F_t will geometrically approach zero, which would cause problems with inversion; to prevent this happening, after setting D_{t+1} and {t+1} as above, we floor {t+1} to a small value (like 1.0e-10).

OK, we have described the updating of R_t, D_t and . Next, we need to figure out how to efficiently multiply by the inverse of F_t. Our experience from working with the old preconditioning method was that it's best not to use the inverse of the Fisher matrix itself, but a version of the Fisher matrix that's smoothed with some constant times the identity. Below, ( is a configuration value, e.g. 4.0 seemed to work well). The following formula is designed to ensure that the smoothing varies proportionally with the scale of F_t:

G_t =(def) F_t + /D tr(F_t) I = R_t^T D_t R_t + ( + /D tr(F_t)) I = R_t^T D_t R_t + I where =(def) + /D tr(F_t) = (1+) + /D tr(D_t) (eqn:betat2)

Define {X}_t =(def) X_t G_t^{-1}. the factor of is inserted arbitrarily as it just happens to be convenient to put unit scale on X_t in the formula for {X}_t; it will anyway be canceled out in the next step. Then our final preconditioned minibatch of vectors is: {X}_t = {X}_t where = sqrt(tr(X_t X_t^T) / tr({X}_t {X}_t^T). The factor of ensures that {X}_t is scaled to have the same overall 2-norm as the input X_t. We found in previous versions of this method that this rescaling was helpful, as otherwise there are certain situations (e.g. at the start of training) where the preconditioned derivatives can get very large. Note that this rescaling introduces a small bias into the training, because now the scale applied to a given sample depends on that sample itself, albeit in an increasingly diluted way as the minibatch size gets large.

To efficiently compute G_t^{-1}, we will use the Woodbury matrix identity. Writing the Woodbury formula for the symmetric case, (A + U D U^T)^{-1} = A^{-1} - A^{-1} U (D^{-1} + U^T A^{-1} U)^{-1} U^T A^{-1} Substituting A = I, D = D_t and U = R_t^T, this becomes G_t^{-1} = 1/ I - 1/^2 R_t^T (D_t^{-1} + 1/ I)^{-1} R_t = 1/ (I - R_t^T E_t R_t) where E_t =(def) 1/ (D_t^{-1} + 1/ I)^{-1}, (eqn:etdef) so e_{tii} = 1/ * 1/(1/d_{tii} + 1/) (eqn:tii) = 1/(/d_{tii} + 1)

We would like an efficient-to-compute expression for {X}_t, without too many separate invocations of kernels on the GPU. {X}_t = X_t G_t^{-1} = X_t - X_t R_t^T E_t R_t For efficient operation on the GPU, we want to reduce the number of high-dimensional operations that we do (defining "high-dimension" as anything involving D or M, but not R, since R is likely small, such as 20). We define W_t =(def) E_t^{0.5} R_t. We will actually be storing W_t on the GPU rather than R_t, in order to reduce the number of operations on the GPU. We can now write:

{X}_t = X_t - X_t W_t^T W_t (eqn:pt2)

The following, which we'll compute on the GPU, are going to be useful in computing quantities like Z_t:

H_t =(def) X_t W_t^T (dim is N by R) J_t =(def) H_t^T X_t (dim is R by D) = W_t X_t^T X_t K_t =(def) J_t J_t^T (dim is R by R, symmetric).. transfer this to CPU. L_t =(def) H_t^T H_t (dim is R by R, symmetric).. transfer this to CPU. = W_t X_t^T X_t W_t^T Note: L_t may also be computed as L_t = J_t W_t^T which may be more efficient if D < N.

Note: after we have computed H_t we can directly compute {X}_t = X_t - H_t W_t

We need to determine how Y_t and Z_t relate to the quantities we just defined. First, we'll expand out H_t, J_t, K_t and L_t in terms of the more fundamental quantities. H_t = X_t R_t^T E_t^{0.5} J_t = E_t^{0.5} R_t X_t^T X_t K_t = E_t^{0.5} R_t X_t^T X_t X_t^T X_t R_t^T E_t^{0.5} L_t = E_t^{0.5} R_t X_t^T X_t R_t^T E_t^{0.5}

we wrote above that Y_t = R_t S_t + (1-) (D_t + I) R_t so Y_t = /N R_t X_t^T X_t + (1-) (D_t + I) R_t = /N E_t^{-0.5} J_t + (1-) (D_t + I) R_t (eqn:yt) We will expand Z_t using the expression for Y_t in the line above: Z_t = Y_t Y_t^T = (/N)^2 E_t^{-0.5} J_t J_t^T E_t^{-0.5} +(/N)(1-) E_t^{-0.5} J_t R_t^T (D_t + I) +(/N)(1-) (D_t + I) R_t J_t^T E_t^{-0.5} +(1-)^2 (D_t + I)^2 = (/N)^2 E_t^{-0.5} K_t E_t^{-0.5} +(/N)(1-) E_t^{-0.5} L_t E_t^{-0.5} (D_t + I) +(/N)(1-) (D_t + I) E_t^{-0.5} L_t E_t^{-0.5} +(1-)^2 (D_t + I)^2 (eqn:Zt) We compute Z_t on the CPU using the expression above, and then do the symmetric eigenvalue decomposition (also on the CPU): Z_t = U_t C_t U_t^T. and we make sure the eigenvalues are sorted from largest to smallest, for reasons that will be mentioned later.

Mathematically, no diagonal element of C_t can be less than (1-)^2 ^2, and since negative or zero elements of C_t would cause us a problem later, we floor C_t to this value. (see below regarding how we ensure R_{t+1} has orthonormal rows).

We will continue the discussion below regarding what we do with C_t and U_t. Next, we need to digress briefly and describe how to compute tr({X}_t {X}_t^T) and tr(X_t X_t^2), since these appear in expressions for (needed to produce the output {X}_t), and for {t+1}. It happens that we need, for purposes of appying "max_change" in the neural net code, the squared 2-norm of each row of the output {X}_t. In order to be able to compute , it's most convenient to compute this squared row-norm for each row of {X}_t, as a vector, to compute tr({X}_t {X}_t^2) from this vector as its sum, and to then work back to compute tr(X_t X_t^2) from the relation between {X}_t and X_t. We can then scale the row-norms we computed for {X}_t, so they apply to {X}_t.

For current purposes, you can imagine that we computed tr({X}_t {X}_t^T) directly. Using (from eqn:pt2) {X}_t = X_t - X_t W_t^T W_t, we can expand tr({X}_t {X}_t^T) as: tr({X}_t {X}_t^T) = tr(X_t X_t^T) + tr(X_t W_t^T W_t W_t^T W_t X_t^T)

2 tr(X_t W_t^T W_t X_t^T) = tr(X_t X_t^T) + tr(W_t X_t^T X_t W_t^T W_t W_t^T)
2 tr(W_t X_t^T X_t W_t^T) = tr(X_t X_t^T) + tr(L_t W_t W_t^T) - 2 tr(L_t) = tr(X_t X_t^T) + tr(L_t E_t) - 2 tr(L_t) and all quantities have already been computed (or are quick to compute, such as the small traces on the right), except tr(X_t X_t^T), so we can write

tr(X_t X_t^T) = tr({X}_t {X}_t^T) - tr(L_t E_t) + 2 tr(L_t) and the above expression can be used to obtain tr(X_t X_t^2). We can then do <– sqrt(tr(X_t X_t^T) / tr({X}_t {X}_t^T)). (or one if the denominator is zero), and then {X}_t <– {X}_t We can then output the per-row squared-l2-norms of Q by scaling those we computed from P by ^2.

OK, the digression on how to compute and tr(X_t X_t^T) is over. We now return to the computation of R_{t+1}, W_{t+1}, {t+1}, D_{t+1} and E_{t+1}.

We found above in (eqn:rhot1) {t+1} = 1/(D - R) ( tr(S_t) + (1-)(D + tr(D_t)) - tr(C_t^{0.5})). Expanding out S_t from its definition in (eqn:St), {t+1} = 1/(D - R) (/N tr(X_t X_t^T) + (1-)(D + tr(D_t)) - tr(C_t^{0.5})). We can compute this directly as all the quantities involved are already known or easy to compute. Next, from (eqn:dt1), we compute D_{t+1} = C_t^{0.5} - {t+1} I At this point if {t+1} is smaller than some small value , e.g. 1.0e-10, we set it to ; as mentioned, we do this to stop F_t approaching zero if all inputs are zero. Next, if any diagonal element D_{t+1,i,i} has absolute value less than , we set it to +. This is to ensure that diagonal elements of E are never zero, which would cause problems.

Next, we compute (from eqn:betat2, eqn:etdef, eqn:tii), {t+1} = {t+1} (1+) + /D tr(D_{t+1}) E_{t+1} = 1/{t+1} (D_{t+1}^{-1} + 1/{t+1} I)^{-1}, i.e.: e_{tii} = 1/({t+1}/d_{t+1,ii} + 1)

We'll want to store D_{t+1}. We next want to compute W_{t+1}.

Before computing W_{t+1}, we need to find an expression for R_{t+1} = C_t^{-0.5} U_t^T Y_t Expanding out Y_t using the expression in (eqn:yt), R_{t+1} = C_t^{-0.5} U_t^T (/N E_t^{-0.5} J_t + (1-) (D_t + I) R_t) = (/N C_t^{-0.5} U_t^T E_t^{-0.5}) J_t +((1-) C_t^{-0.5} U_t^T (D_t + I) E_t^{-0.5}) W_t

What we actually want is W_{t+1} = E_{t+1}^{0.5} R_{t+1}: W_{t+1} = (/N E_{t+1}^{0.5} C_t^{-0.5} U_t^T E_t^{-0.5}) J_t +((1-) E_{t+1}^{0.5} C_t^{-0.5} U_t^T (D_t + I) E_t^{-0.5}) W_t and to minimize the number of matrix-matrix multiplies we can factorize this as: W_{t+1} = A_t B_t A_t = (/N) E_{t+1}^{0.5} C_t^{-0.5} U_t^T E_t^{-0.5} B_t = J_t + (1-)/(/N) (D_t + I) W_t [note: we use the fact that (D_t + I) and E_t^{-0.5} commute because they are diagonal].

A_t is computed on the CPU and transferred from there to the GPU, B_t is computed on the PGU, and the multiplication of A_t with B_t is done on the GPU.

Keeping R_t orthogonal *

Our method requires the R_t matrices to be orthogonal (which we define to mean that R_t R_t^T = I). If roundoff error causes this equality to be significantly violated, it could cause a problem for the stability of our method. We now address our method for making sure that the R_t values stay orthogonal. We do this in the algorithm described above, after creating W_{t+1}. This extra step is only executed if the condition number of C_t (i.e. the ratio of its largest to smallest diagonal element) exceeds a specified threshold, such as 1.0e+06 [this is tested before applying the floor to C_t]. The threshold was determined empirically by finding the largest value needed to ensure a certain level of orthogonality in R_{t+1}. For purposes of the present discussion, since R_{t+1} is not actually stored, define it as E_{t+1}^{-0.5} W_{t+1}. Define the following (and we will just use t instead of t+1 below, as all quantities have the same subscript):

O_t =(def) R_t R_t^T = E_t^{-0.5} W_t W_t^T E_t^{-0.5}

(and we would compute this by computing W_t W_t^T on the GPU, transferring it to the CPU, and doing the rest there). If O_t is not sufficiently close to the unit matrix, we can re-orthogonalize as follows: Do the Cholesky decomposition O_t = C C^T Clearly C^{-1} O_t C^{-T} = I, so if we correct R_t with: R_t <– C^{-1} R_t we can ensure orthogonality. If R_t's first k rows are orthogonal, this transform will not affect them, because of its lower-triangular structure... this is good because (thanks to the eigenvalue sorting), the larger eigenvectors are first and it is more critical to keep them pointing in the same direction. Any loss of orthogonality will be dealt with by modifying the smaller eigenvectors. As a modification to W_t, this would be: W_t <– (E_t^{0.5} C^{-1} E_t^{-0.5}) W_t, and the matrix in parentheses is computed on the CPU, transferred to the GPU, and the multiplication is done there.

Initialization *

Now, a note on what we do on time t = 0, i.e. for the first minibatch. We initialize X_0 to the top R eigenvectors of 1/N X_0 X_0^T, where N is the minibatch size (num-rows of R0). If L is the corresponding RxR diagonal matrix of eigenvalues, then we will set D_0 = L - I. We set to ensure that tr(F_0) = 1/N tr(X_0 X_0^T), tr(D_0) - D = 1/N tr(X_0 X_0^T), tr(L) + R - D = 1/N tr(X_0 X_0^T) = (1/N tr(X_0 X_0^T) - tr(L)) / (D - R)

We then floor to (e.g. 1.0e-10) and also floor the diagonal elements of D_0 to ; this ensures that we won't crash for zero inputs.

A note on multi-threading. This technique was really designed for use with a GPU, where we won't have multi-threading, but we want it to work also on a CPU, where we may have multiple worker threads. Our approach is as follows (we do this when we're about to start updating the parameters R_t, D_t, and derived quantities):

For time t > 0 (where the matrices are already initialized), before starting the part of the computation that updates the parameters (R_t, D_t, and derived quantities), we try to lock a mutex that guards the OnlinePreconditioner. If we can lock it right away, we go ahead and do the update, but if not, we just abandon the attempt to update those quantities.

We will have another mutex to ensure that when we access quantities like W_t, they are all "in sync" (and we don't access them while they are being written by another thread). This mutex will only be locked for short periods of time.

Note: it might be a good idea to make sure that the R_t still retain orthonormal rows even in the presence of roundoff, without errors accumulating. My instinct is that this isn't going to be a problem.

Definition at line 413 of file nnet-precondition-online.h.

Constructor & Destructor Documentation

◆ OnlinePreconditioner() [1/2]

OnlinePreconditioner ( )

Definition at line 27 of file nnet-precondition-online.cc.

Referenced by OnlinePreconditioner::GetUpdatePeriod().

                                           :
     rank_(40), update_period_(1), num_samples_history_(2000.0), alpha_(4.0),
     epsilon_(1.0e-10), delta_(5.0e-04), t_(-1),
     num_updates_skipped_(0), self_debug_(false) { }

◆ OnlinePreconditioner() [2/2]

OnlinePreconditioner ( const OnlinePreconditioner & other )

explicit

Definition at line 595 of file nnet-precondition-online.cc.

                                                                            :
     rank_(other.rank_), update_period_(other.update_period_),
     num_samples_history_(other.num_samples_history_),
     alpha_(other.alpha_), epsilon_(other.epsilon_), delta_(other.delta_),
     t_(other.t_), num_updates_skipped_(other.num_updates_skipped_),
     self_debug_(other.self_debug_), W_t_(other.W_t_),
     rho_t_(other.rho_t_), d_t_(other.d_t_) {
   // use default constructor for the mutexes.
 }

Member Function Documentation

◆ ComputeEt()

void ComputeEt	(	const VectorBase< BaseFloat > &	d_t,
		BaseFloat	beta_t,
		VectorBase< BaseFloat > *	e_t,
		VectorBase< BaseFloat > *	sqrt_e_t,
		VectorBase< BaseFloat > *	inv_sqrt_e_t
	)		const

private

Definition at line 577 of file nnet-precondition-online.cc.

References VectorBase< Real >::ApplyPow(), VectorBase< Real >::CopyFromVec(), rnnlm::d, VectorBase< Real >::Data(), VectorBase< Real >::Dim(), rnnlm::i, and VectorBase< Real >::InvertElements().

Referenced by OnlinePreconditioner::ComputeWt1(), OnlinePreconditioner::GetUpdatePeriod(), OnlinePreconditioner::PreconditionDirectionsInternal(), OnlinePreconditioner::ReorthogonalizeXt1(), and OnlinePreconditioner::SelfTest().

                                                                                 {
   // e_{tii} = 1/(\beta_t/d_{tii} + 1)
   int32 D = d_t.Dim();
   const BaseFloat *d = d_t.Data();
   BaseFloat *e = e_t->Data();
   for (int32 i = 0; i < D; i++)
     e[i] = 1.0 / (beta_t / d[i]  +  1);
   sqrt_e_t->CopyFromVec(*e_t);
   sqrt_e_t->ApplyPow(0.5);
   inv_sqrt_e_t->CopyFromVec(*sqrt_e_t);
   inv_sqrt_e_t->InvertElements();
 }

◆ ComputeWt1()

void ComputeWt1	(	int32	N,
		const VectorBase< BaseFloat > &	d_t,
		const VectorBase< BaseFloat > &	d_t1,
		BaseFloat	rho_t,
		BaseFloat	rho_t1,
		const MatrixBase< BaseFloat > &	U_t,
		const VectorBase< BaseFloat > &	sqrt_c_t,
		const VectorBase< BaseFloat > &	inv_sqrt_e_t,
		const CuMatrixBase< BaseFloat > &	W_t,
		CuMatrixBase< BaseFloat > *	J_t,
		CuMatrixBase< BaseFloat > *	W_t1
	)		const

private

Definition at line 501 of file nnet-precondition-online.cc.

References CuMatrixBase< Real >::AddDiagVecMat(), CuMatrixBase< Real >::AddMatMat(), OnlinePreconditioner::alpha_, OnlinePreconditioner::ComputeEt(), VectorBase< Real >::Dim(), OnlinePreconditioner::Eta(), rnnlm::i, VectorBase< Real >::InvertElements(), rnnlm::j, KALDI_ASSERT, kaldi::kNoTrans, kaldi::kTrans, kaldi::kUndefined, CuMatrixBase< Real >::NumCols(), and VectorBase< Real >::Sum().

Referenced by OnlinePreconditioner::GetUpdatePeriod(), and OnlinePreconditioner::PreconditionDirectionsInternal().

                                                                             {
 
   int32 R = d_t.Dim(), D = W_t.NumCols();
   BaseFloat eta = Eta(N);
 
   // \beta_{t+1} = \rho_{t+1} (1+\alpha) + \alpha/D tr(D_{t+1})
   BaseFloat beta_t1 = rho_t1 * (1.0 + alpha_) + alpha_ * d_t1.Sum() / D;
   KALDI_ASSERT(beta_t1 > 0.0);
   Vector<BaseFloat> e_t1(R, kUndefined), sqrt_e_t1(R, kUndefined),
       inv_sqrt_e_t1(R, kUndefined);
   ComputeEt(d_t1, beta_t1, &e_t1, &sqrt_e_t1, &inv_sqrt_e_t1);
   Vector<BaseFloat> inv_sqrt_c_t(sqrt_c_t);
   inv_sqrt_c_t.InvertElements();
 
   Vector<BaseFloat> w_t_coeff(R);
   for (int32 i = 0; i < R; i++)
     w_t_coeff(i) = (1.0 - eta) / (eta/N) * (d_t(i) + rho_t);
   CuVector<BaseFloat> w_t_coeff_gpu(w_t_coeff);
   // B_t = J_t + (1-\eta)/(\eta/N) (D_t + \rho_t I) W_t
   J_t->AddDiagVecMat(1.0, w_t_coeff_gpu, W_t, kNoTrans, 1.0);
 
   // A_t = (\eta/N) E_{t+1}^{0.5} C_t^{-0.5} U_t^T E_t^{-0.5} B_t
   Matrix<BaseFloat> A_t(U_t, kTrans);
   for (int32 i = 0; i < R; i++) {
     BaseFloat i_factor = (eta / N) * sqrt_e_t1(i) * inv_sqrt_c_t(i);
     for (int32 j = 0; j < R; j++) {
       BaseFloat j_factor = inv_sqrt_e_t(j);
       A_t(i, j) *= i_factor * j_factor;
     }
   }
   // W_{t+1} = A_t B_t
   CuMatrix<BaseFloat> A_t_gpu(A_t);
   W_t1->AddMatMat(1.0, A_t_gpu, kNoTrans, *J_t, kNoTrans, 0.0);
 }

◆ ComputeZt()

void ComputeZt	(	int32	N,
		BaseFloat	rho_t,
		const VectorBase< BaseFloat > &	d_t,
		const VectorBase< BaseFloat > &	inv_sqrt_e_t,
		const MatrixBase< BaseFloat > &	K_t,
		const MatrixBase< BaseFloat > &	L_t,
		SpMatrix< double > *	Z_t
	)		const

private

Definition at line 546 of file nnet-precondition-online.cc.

References VectorBase< Real >::Add(), VectorBase< Real >::Dim(), OnlinePreconditioner::Eta(), rnnlm::i, and rnnlm::j.

Referenced by OnlinePreconditioner::GetUpdatePeriod(), and OnlinePreconditioner::PreconditionDirectionsInternal().

                                                                   {
   // Use doubles because the range of quantities in Z_t can get large (fourth
   // power of data), and we want to avoid overflow.  This routine is fast.
   BaseFloat eta = Eta(N);
   Vector<BaseFloat> d_t_rho_t(d_t);
   d_t_rho_t.Add(rho_t);  // now d_t_rho_t is diag(D_t + \rho_t I).
   double etaN = eta / N, eta1 = 1.0 - eta,
       etaN_sq = etaN * etaN, eta1_sq = eta1 * eta1,
       etaN_eta1 = etaN * eta1;
   int32 R = d_t.Dim();
   for (int32 i = 0; i < R; i++) {
     double inv_sqrt_e_t_i = inv_sqrt_e_t(i), d_t_rho_t_i = d_t_rho_t(i);
     for (int32 j = 0; j <= i; j++) {
       double inv_sqrt_e_t_j = inv_sqrt_e_t(j), d_t_rho_t_j = d_t_rho_t(j),
           L_t_i_j = 0.5 * (L_t(i, j) + L_t(j, i)),
           K_t_i_j = 0.5 * (K_t(i, j) + K_t(j, i));
       // See (eqn:Zt) in header.
       (*Z_t)(i, j) = etaN_sq * inv_sqrt_e_t_i * K_t_i_j * inv_sqrt_e_t_j
           + etaN_eta1 * inv_sqrt_e_t_i * L_t_i_j * inv_sqrt_e_t_j * d_t_rho_t_j
           + etaN_eta1 * d_t_rho_t_i * inv_sqrt_e_t_i * L_t_i_j * inv_sqrt_e_t_j
           + (i == j ? eta1_sq * d_t_rho_t_i * d_t_rho_t_i : 0.0);
     }
   }
 }

◆ Eta()

BaseFloat Eta ( int32 N ) const

private

Definition at line 492 of file nnet-precondition-online.cc.

References KALDI_ASSERT, and OnlinePreconditioner::num_samples_history_.

Referenced by OnlinePreconditioner::ComputeWt1(), OnlinePreconditioner::ComputeZt(), OnlinePreconditioner::GetUpdatePeriod(), and OnlinePreconditioner::PreconditionDirectionsInternal().

                                                  {
   KALDI_ASSERT(num_samples_history_ > 0.0);
   BaseFloat ans = 1.0 - exp(-N / num_samples_history_);
   // Don't let eta approach 1 too closely, as it can lead to NaN's appearing if
   // the input is all zero.
   if (ans > 0.9) ans = 0.9;
   return ans;
 }

◆ GetAlpha()

BaseFloat GetAlpha ( ) const

inline

Definition at line 424 of file nnet-precondition-online.h.

References OnlinePreconditioner::alpha_.

424 { return alpha_; }

kaldi::nnet2::OnlinePreconditioner::alpha_

BaseFloat alpha_

Definition: nnet-precondition-online.h:531

◆ GetNumSamplesHistory()

BaseFloat GetNumSamplesHistory ( ) const

inline

Definition at line 423 of file nnet-precondition-online.h.

References OnlinePreconditioner::num_samples_history_.

423 { return num_samples_history_; }

kaldi::nnet2::OnlinePreconditioner::num_samples_history_

BaseFloat num_samples_history_

Definition: nnet-precondition-online.h:527

◆ GetRank()

int32 GetRank ( ) const

inline

Definition at line 425 of file nnet-precondition-online.h.

References OnlinePreconditioner::rank_.

425 { return rank_; }

kaldi::nnet2::OnlinePreconditioner::rank_

int32 rank_

Definition: nnet-precondition-online.h:516

◆ GetUpdatePeriod()

int32 GetUpdatePeriod ( ) const

inline

◆ Init()

void Init ( const CuMatrixBase< BaseFloat > & R0 )

private

Definition at line 123 of file nnet-precondition-online.cc.

References OnlinePreconditioner::d_t_, rnnlm::i, OnlinePreconditioner::InitDefault(), kaldi::kUndefined, CuMatrixBase< Real >::NumCols(), CuMatrixBase< Real >::NumRows(), OnlinePreconditioner::PreconditionDirections(), OnlinePreconditioner::rank_, OnlinePreconditioner::rho_t_, OnlinePreconditioner::t_, and OnlinePreconditioner::W_t_.

Referenced by OnlinePreconditioner::GetUpdatePeriod(), and OnlinePreconditioner::PreconditionDirections().

                                                                  {
   int32 D = R0.NumCols();
   // for locking reasons it's better to use a different object.
   OnlinePreconditioner this_copy(*this);
   this_copy.InitDefault(D);
 
   CuMatrix<BaseFloat> R0_copy(R0.NumRows(), R0.NumCols(), kUndefined);
   // number of iterations with the same data from a pseudorandom start.
   // this is a faster way of starting than doing eigenvalue decomposition.
   int32 num_init_iters = 3;
   for (int32 i = 0; i < num_init_iters; i++) {
     BaseFloat scale;
     R0_copy.CopyFromMat(R0);
     this_copy.PreconditionDirections(&R0_copy, NULL, &scale);
   }
   rank_ = this_copy.rank_;
   W_t_.Swap(&this_copy.W_t_);
   d_t_.Swap(&this_copy.d_t_);
   rho_t_ = this_copy.rho_t_;
   t_ = 0;
 }

◆ InitDefault()

void InitDefault ( int32 D )

private

Definition at line 75 of file nnet-precondition-online.cc.

References OnlinePreconditioner::alpha_, OnlinePreconditioner::d_t_, OnlinePreconditioner::delta_, OnlinePreconditioner::epsilon_, OnlinePreconditioner::InitOrthonormalSpecial(), KALDI_ASSERT, KALDI_WARN, kaldi::kUndefined, OnlinePreconditioner::num_samples_history_, OnlinePreconditioner::rank_, OnlinePreconditioner::rho_t_, OnlinePreconditioner::t_, and OnlinePreconditioner::W_t_.

Referenced by OnlinePreconditioner::GetUpdatePeriod(), and OnlinePreconditioner::Init().

                                               {
   if (rank_ >= D) {
     KALDI_WARN << "Rank " << rank_ << " of online preconditioner is >= dim " << D
                << ", setting it to "
                << (D - 1) << " (but this is probably still too high)";
     rank_ = D - 1;
   }
   if (rank_ == 0) {
     // Dimension of input data was 1, so the natural gradient preconditioner
     // would always be the unit matrix.
     // We'll handle this as a special case, for generality.
     return;
   }
   KALDI_ASSERT(num_samples_history_ > 0.0 && num_samples_history_ <= 1.0e+6);
   KALDI_ASSERT(alpha_ >= 0.0);
   KALDI_ASSERT(rank_ > 0);
   KALDI_ASSERT(epsilon_ > 0.0 && epsilon_ <= 1.0e-05);  // plausible values.
   KALDI_ASSERT(delta_ > 0.0 && delta_ <= 1.0e-02);  // plausible values.
 
   // to initialize, in the equation
   //   F_t =(def) R_t^T D_t R_t + \rho_t I
   // we will set the orthogonal R_t to a special orthogonal matrix with no zero
   // rows or columns (see the function), rho_t to epsilon,
   // and D_t to epsilon.  But we don't store R_t directly.  Instead, we store
   //   W_t =(def)  E_t^{0.5} R_t,
   // where E_t =(def)  1/\beta_t (D_t^{-1} + 1/\beta_t I)^{-1}
   // from (eqn:tii),
   //  e_{tii} =   1/(\beta_t/d_{tii} + 1),
   // where
   // \beta_t =(def) \rho_t + \alpha/D tr(F_t)
   //         =      epsilon + alpha/D * (epsilon * D + epsilon * rank)
   //         =     epsilon * (1 + alpha * (D + rank) / D)
   // And  d_{tii} is epsilon, so
   //  e_{tii} =   1/((1 + alpha * (D + rank) / D) + 1)  [for each i.]
   //          =   1/(2 + alpha * (D + rank) / D)).
   BaseFloat epsilon = epsilon_;  // we could make this a bit more.
   rho_t_ = epsilon;
   d_t_.Resize(rank_, kUndefined);
   d_t_.Set(epsilon);
   W_t_.Resize(rank_, D, kUndefined);
   // after the next line, W_ will store the orthogonal matrix R_t.
   InitOrthonormalSpecial(&W_t_);
   BaseFloat E_tii = 1.0 / ( 2.0 + (D + rank_) * alpha_ / D );
   // W_t =(def) E_t^{0.5} R_t.
   W_t_.Scale(sqrt(E_tii));
   t_ = 0;
 }

◆ InitOrthonormalSpecial()

void InitOrthonormalSpecial ( CuMatrixBase< BaseFloat > * R )

staticprivate

This function creates a matrix with orthonormal rows that is like the following matrix, except with each row normalized to have unit 2-norm: [ 1.1 0 1 0 1 0 0 1.1 0 1 0 1 ] The reason why the first element in each row is 1.1 and not 1, is for symmetry-breaking...

we don't want any weighted sum of all these rows to be all ones, because the derivative in that direction can be zero in some architectures and it causes us to have to do an inefficient CPU-based renormalization.

Definition at line 45 of file nnet-precondition-online.cc.

References CuMatrixBase< Real >::AddElements(), CuMatrixBase< Real >::AddMatMat(), rnnlm::i, CuMatrixBase< Real >::IsUnit(), KALDI_ASSERT, kaldi::kNoTrans, kaldi::kTrans, CuMatrixBase< Real >::NumCols(), CuMatrixBase< Real >::NumRows(), and CuMatrixBase< Real >::SetZero().

Referenced by OnlinePreconditioner::GetUpdatePeriod(), and OnlinePreconditioner::InitDefault().

                                                                             {
   int32 num_rows = R->NumRows(), num_cols = R->NumCols();
   KALDI_ASSERT(num_cols >= num_rows);
   R->SetZero();
   std::vector<MatrixElement<BaseFloat> > elems;
   elems.reserve(num_cols);
   BaseFloat first_elem = 1.1;
   for (int32 r = 0; r < num_rows; r++) {
     std::vector<int32> cols;  // columns that have an entry for this row
     for (int32 c = r; c < num_cols; c += num_rows)
       cols.push_back(c);
     BaseFloat normalizer = 1.0 / sqrt(first_elem * first_elem +
                                       cols.size() - 1);
     for (size_t i = 0; i < cols.size(); i++) {
       int32 c = cols[i];
       MatrixElement<BaseFloat> e = { r, c,
                                      normalizer * (i == 0 ? first_elem :
                                                    BaseFloat(1.0)) };
       elems.push_back(e);
     }
   }
   R->AddElements(1.0, elems);
   { // TODO: remove this testing code.
     CuMatrix<BaseFloat> prod(num_rows, num_rows);
     prod.AddMatMat(1.0, *R, kNoTrans, *R, kTrans, 0.0);
     KALDI_ASSERT(prod.IsUnit());
   }
 }

◆ operator=()

OnlinePreconditioner & operator= ( const OnlinePreconditioner & other )

Definition at line 605 of file nnet-precondition-online.cc.

References OnlinePreconditioner::alpha_, OnlinePreconditioner::d_t_, OnlinePreconditioner::delta_, OnlinePreconditioner::epsilon_, OnlinePreconditioner::num_samples_history_, OnlinePreconditioner::rank_, OnlinePreconditioner::rho_t_, OnlinePreconditioner::self_debug_, OnlinePreconditioner::t_, OnlinePreconditioner::update_period_, and OnlinePreconditioner::W_t_.

Referenced by OnlinePreconditioner::GetUpdatePeriod().

                                        {
   rank_ = other.rank_;
   update_period_ = other.update_period_;
   num_samples_history_ = other.num_samples_history_;
   alpha_ = other.alpha_;
   epsilon_ = other.epsilon_;
   delta_ = other.delta_;
   t_ = other.t_;
   self_debug_ = other.self_debug_;
   W_t_ = other.W_t_;
   rho_t_ = other.rho_t_;
   d_t_ = other.d_t_;
   return *this;
 }

◆ PreconditionDirections()

void PreconditionDirections	(	CuMatrixBase< BaseFloat > *	R,
		CuVectorBase< BaseFloat > *	row_prod,
		BaseFloat *	scale
	)

Definition at line 145 of file nnet-precondition-online.cc.

References CuVectorBase< Real >::AddDiagMat2(), OnlinePreconditioner::d_t_, OnlinePreconditioner::Init(), kaldi::kNoTrans, CuMatrixBase< Real >::NumCols(), CuMatrixBase< Real >::NumRows(), OnlinePreconditioner::PreconditionDirectionsInternal(), CuMatrixBase< Real >::Range(), OnlinePreconditioner::read_write_mutex_, OnlinePreconditioner::rho_t_, OnlinePreconditioner::t_, and OnlinePreconditioner::W_t_.

Referenced by OnlinePreconditioner::GetUpdatePeriod(), OnlinePreconditioner::Init(), kaldi::nnet2::UnitTestPreconditionDirectionsOnline(), and AffineComponentPreconditionedOnline::Update().

                       {
   if (X_t->NumCols() == 1) {
     // If the dimension of the space equals one then our natural gradient update
     // with rescaling becomes a no-op, but the code wouldn't naturally handle it
     // because rank would be zero.  Support this as a special case.
     if (row_prod)
       row_prod->AddDiagMat2(1.0, *X_t, kNoTrans, 0.0);
     *scale = 1.0;
     return;
   }
 
   if (row_prod == NULL) {
     CuVector<BaseFloat> row_prod_tmp(X_t->NumRows());
     PreconditionDirections(X_t, &row_prod_tmp, scale);
     return;
   }
 
   read_write_mutex_.lock();
   if (t_ == -1) // not initialized
     Init(*X_t);
 
   // Now t_ >= 0.
   // We create local copies  of the class variables... this is intended for
   // multi-threaded safety so we can't read them in an inconsistent state,
   // but we don't really waste anything here (a copy of W_t is needed anyway,
   // if we're to update it).
   int32 t = t_, R = W_t_.NumRows(), D = W_t_.NumCols();
   // space for W_t, J_t, K_t, L_t.
   CuMatrix<BaseFloat> WJKL_t(2 * R, D + R);
   WJKL_t.Range(0, R, 0, D).CopyFromMat(W_t_);
   BaseFloat rho_t(rho_t_);
   Vector<BaseFloat> d_t(d_t_);
   read_write_mutex_.unlock();
   PreconditionDirectionsInternal(t, rho_t, d_t, &WJKL_t, X_t, row_prod, scale);
 }

◆ PreconditionDirectionsInternal()

void PreconditionDirectionsInternal	(	const int32	t,
		const BaseFloat	rho_t,
		const Vector< BaseFloat > &	d_t,
		CuMatrixBase< BaseFloat > *	WJKL_t,
		CuMatrixBase< BaseFloat > *	X_t,
		CuVectorBase< BaseFloat > *	row_prod,
		BaseFloat *	scale
	)

private

Definition at line 300 of file nnet-precondition-online.cc.

Referenced by OnlinePreconditioner::GetUpdatePeriod(), and OnlinePreconditioner::PreconditionDirections().

                       {
   int32 N = X_t->NumRows(),  // Minibatch size.
       D = X_t->NumCols(),  // Dimensions of vectors we're preconditioning
       R = rank_;  // Rank of correction to unit matrix.
   KALDI_ASSERT(R > 0 && R < D);
   BaseFloat eta = Eta(N);
 
   CuMatrix<BaseFloat> H_t(N, R);
   const CuSubMatrix<BaseFloat> W_t(*WJKL_t, 0, R, 0, D);
   // Below, WJ_t and LK_t are combinations of two matrices,
   // which we define in order to combine two separate multiplications into one.
   CuSubMatrix<BaseFloat> J_t(*WJKL_t, R, R, 0, D),
       L_t(*WJKL_t, 0, R, D, R),
       K_t(*WJKL_t, R, R, D, R),
       WJ_t(*WJKL_t, 0, 2 * R, 0, D),
       LK_t(*WJKL_t, 0, 2 * R, D, R);
 
   H_t.AddMatMat(1.0, *X_t, kNoTrans, W_t, kTrans, 0.0);  // H_t = X_t W_t^T
 
   bool locked = update_mutex_.try_lock();
   if (locked) {
     // Just hard-code it here that we do 10 updates before skipping any.
     const int num_initial_updates = 10;
     if (t_ > t || (num_updates_skipped_ < update_period_ - 1 &&
                    t_ >= num_initial_updates)) {
       update_mutex_.unlock();
       // We got the lock but we were already beaten to it by another thread, or
       // we don't want to update yet due to update_period_ > 1 (this saves
       // compute), so release the lock.
       locked = false;
     }
   }
 
   if (!locked) {
     // We're not updating the parameters, either because another thread is
     // working on updating them, or because another thread already did so from
     // the same or later starting point (making our update stale), or because
     // update_period_ > 1.  We just apply the preconditioning and return.
 
     // note: we don't bother with any locks before incrementing
     // num_updates_skipped_ below, because the worst that could happen is that,
     // on very rare occasions, we could skip one or two more updates than we
     // intended.
     num_updates_skipped_++;
 
     BaseFloat tr_Xt_XtT = TraceMatMat(*X_t, *X_t, kTrans);
     // X_hat_t = X_t - H_t W_t
     X_t->AddMatMat(-1.0, H_t, kNoTrans, W_t, kNoTrans, 1.0);
     // each element i of row_prod will be inner product of row i of X_hat_t with
     // itself.
     row_prod->AddDiagMat2(1.0, *X_t, kNoTrans, 0.0);
     BaseFloat tr_Xhat_XhatT = row_prod->Sum();
     KALDI_ASSERT(tr_Xhat_XhatT == tr_Xhat_XhatT);  // Check for NaN.
     BaseFloat gamma_t = (tr_Xhat_XhatT == 0.0 ? 1.0 :
                          sqrt(tr_Xt_XtT / tr_Xhat_XhatT));
     *scale = gamma_t;
     return;
   }
   J_t.AddMatMat(1.0, H_t, kTrans, *X_t, kNoTrans, 0.0);  // J_t = H_t^T X_t
 
   bool compute_lk_together = (N > D);
 
   if (compute_lk_together) {
     // do the following two multiplies in one operation...
     // note
     // L_t = W_t J_t^T
     // K_t = J_t J_t^T
     // Note: L_t was defined as L_t = J_t W_t^T, but it's actually symmetric,
     // so we can compute it as L_t = W_t J_t^T.
     LK_t.AddMatMat(1.0, WJ_t, kNoTrans, J_t, kTrans, 0.0);
   } else {
     K_t.SymAddMat2(1.0, J_t, kNoTrans, 0.0);
     L_t.SymAddMat2(1.0, H_t, kTrans, 0.0);
   }
 
   Matrix<BaseFloat> LK_cpu(LK_t);  // contains L and K on the CPU.
   SubMatrix<BaseFloat> L_t_cpu(LK_cpu, 0, R, 0, R),
       K_t_cpu(LK_cpu, R, R, 0, R);
   if (!compute_lk_together) {
     // the SymAddMat2 operations only set the lower triangle and diagonal.
     L_t_cpu.CopyLowerToUpper();
     K_t_cpu.CopyLowerToUpper();
   }
 
   // beta_t = \rho_t(1+\alpha) + \alpha/D tr(D_t)
   BaseFloat beta_t = rho_t * (1.0 + alpha_) + alpha_ * d_t.Sum() / D;
   Vector<BaseFloat> e_t(R), sqrt_e_t(R), inv_sqrt_e_t(R);
   ComputeEt(d_t, beta_t, &e_t, &sqrt_e_t, &inv_sqrt_e_t);
   KALDI_VLOG(5) << "e_t = " << e_t;
 
   // The double-precision Z_t here, and the scaling, is to avoid potential
   // overflow, because Z_t is proportional to the fourth power of data.
   SpMatrix<double> Z_t_double(R);
   ComputeZt(N, rho_t, d_t, inv_sqrt_e_t, K_t_cpu, L_t_cpu, &Z_t_double);
   BaseFloat z_t_scale = std::max<double>(1.0, Z_t_double.Trace());
   Z_t_double.Scale(1.0 / z_t_scale);
   SpMatrix<BaseFloat> Z_t_scaled(Z_t_double);
 
   Matrix<BaseFloat> U_t(R, R);
   Vector<BaseFloat> c_t(R);
   // do the symmetric eigenvalue decomposition Z_t = U_t C_t U_t^T.
   Z_t_scaled.Eig(&c_t, &U_t);
   SortSvd(&c_t, &U_t);
   c_t.Scale(z_t_scale);
 
   const BaseFloat condition_threshold = 1.0e+06;
   // must_reorthogonalize will be true if the last diagonal element of c_t is
   // negative, since we don't take the absolute value, but this is the right
   // thing anyway.
   bool must_reorthogonalize = (c_t(0) > condition_threshold * c_t(R - 1));
 
   BaseFloat c_t_floor = pow(rho_t * (1 - eta), 2);
   int32 nf;
   c_t.ApplyFloor(c_t_floor, &nf);
   if (nf > 0)
     must_reorthogonalize = true;
   if (nf > 0 && self_debug_) {
     KALDI_WARN << "Floored " << nf << " elements of C_t.";
   }
   BaseFloat tr_Xt_XtT_check;
   if (self_debug_)
     tr_Xt_XtT_check = TraceMatMat(*X_t, *X_t, kTrans);
 
   X_t->AddMatMat(-1.0, H_t, kNoTrans, W_t, kNoTrans, 1.0);  // X_hat_t = X_t - H_t W_t
   // set *row_prod to inner products of each row of X_hat_t with itself.
   row_prod->AddDiagMat2(1.0, *X_t, kNoTrans, 0.0);
 
   BaseFloat tr_Xhat_XhatT = row_prod->Sum();
   //  tr(X_t X_t^T) = tr(X_hat_t X_hat_t^T) - tr(L_t E_t) + 2 tr(L_t)
   double tr_Xt_XtT = tr_Xhat_XhatT;
   for (int32 i = 0; i < R; i++)
     tr_Xt_XtT += L_t_cpu(i, i) * (2.0 - e_t(i));
   if (self_debug_) {
     KALDI_ASSERT(ApproxEqual(tr_Xt_XtT, tr_Xt_XtT_check));
   }
   BaseFloat gamma_t = (tr_Xhat_XhatT == 0.0 ? 1.0 :
                        sqrt(tr_Xt_XtT / tr_Xhat_XhatT));
   *scale = gamma_t;
 
   Vector<BaseFloat> sqrt_c_t(c_t);
   sqrt_c_t.ApplyPow(0.5);
 
   // \rho_{t+1} = 1/(D - R) (\eta/N tr(X_t X_t^T) + (1-\eta)(D \rho_t + tr(D_t)) - tr(C_t^{0.5})).
   BaseFloat rho_t1 = 1.0 / (D - R) * (eta / N * tr_Xt_XtT
                                       + (1-eta)*(D * rho_t + d_t.Sum())
                                       - sqrt_c_t.Sum());
   // D_{t+1} = C_t^{0.5} - \rho_{t+1} I
   Vector<BaseFloat> d_t1(sqrt_c_t);
   d_t1.Add(-rho_t1);
   BaseFloat floor_val = std::max(epsilon_, delta_ * sqrt_c_t.Max());
   if (rho_t1 < floor_val)
     rho_t1 = floor_val;
   d_t1.ApplyFloor(floor_val);
 
   CuMatrix<BaseFloat> W_t1(R, D);  // W_{t+1}
   ComputeWt1(N, d_t, d_t1, rho_t, rho_t1, U_t, sqrt_c_t, inv_sqrt_e_t,
              W_t, &J_t, &W_t1);
 
   if (must_reorthogonalize) {
     if (self_debug_) {
       KALDI_WARN << "Reorthogonalizing.";
     }
     ReorthogonalizeXt1(d_t1,
                        rho_t1,
                        &W_t1,
                        &J_t,
                        &L_t);
   }
 
   // Commit the new parameters.
   read_write_mutex_.lock();
   KALDI_ASSERT(t_ == t);  // we already ensured this.
   t_ = t + 1;
   num_updates_skipped_ = 0;
   W_t_.Swap(&W_t1);
   d_t_.CopyFromVec(d_t1);
   rho_t_ = rho_t1;
 
   if (self_debug_)
     SelfTest();
 
   read_write_mutex_.unlock();
   update_mutex_.unlock();
 }

◆ ReorthogonalizeXt1()

void ReorthogonalizeXt1	(	const VectorBase< BaseFloat > &	d_t1,
		BaseFloat	rho_t1,
		CuMatrixBase< BaseFloat > *	W_t1,
		CuMatrixBase< BaseFloat > *	temp_W,
		CuMatrixBase< BaseFloat > *	temp_O
	)

private

Definition at line 184 of file nnet-precondition-online.cc.

References CuMatrixBase< Real >::AddMatMat(), OnlinePreconditioner::alpha_, TpMatrix< Real >::Cholesky(), OnlinePreconditioner::ComputeEt(), CuMatrixBase< Real >::CopyFromMat(), MatrixBase< Real >::CopyFromTp(), rnnlm::i, TpMatrix< Real >::Invert(), SpMatrix< Real >::IsUnit(), rnnlm::j, KALDI_ERR, KALDI_WARN, kaldi::kNoTrans, kaldi::kTakeLower, kaldi::kUndefined, PackedMatrix< Real >::Max(), CuMatrixBase< Real >::MulRowsVec(), CuMatrixBase< Real >::NumCols(), CuMatrixBase< Real >::NumRows(), MatrixBase< Real >::OrthogonalizeRows(), OnlinePreconditioner::self_debug_, VectorBase< Real >::Sum(), and CuMatrixBase< Real >::SymAddMat2().

Referenced by OnlinePreconditioner::GetUpdatePeriod(), and OnlinePreconditioner::PreconditionDirectionsInternal().

                                      {
   // threshold is a configuration value: a desired threshold on orthogonality,
   // below which we won't reorthogonalize.
   const BaseFloat threshold = 1.0e-03;
 
   int32 R = W_t1->NumRows(), D = W_t1->NumCols();
   BaseFloat beta_t1 = rho_t1 * (1.0 + alpha_) + alpha_ * d_t1.Sum() / D;
   Vector<BaseFloat> e_t1(R, kUndefined), sqrt_e_t1(R, kUndefined),
       inv_sqrt_e_t1(R, kUndefined);
   ComputeEt(d_t1, beta_t1, &e_t1, &sqrt_e_t1, &inv_sqrt_e_t1);
 
   temp_O->SymAddMat2(1.0, *W_t1, kNoTrans, 0.0);
   // O_t =  E_t^{-0.5} W_t W_t^T E_t^{-0.5}
   Matrix<BaseFloat> O_mat(*temp_O);
   SpMatrix<BaseFloat> O(O_mat, kTakeLower);
   for (int32 i = 0; i < R; i++) {
     BaseFloat i_factor = inv_sqrt_e_t1(i);
     for (int32 j = 0; j <= i; j++) {
       BaseFloat j_factor = inv_sqrt_e_t1(j);
       O(i, j) *= i_factor * j_factor;
     }
   }
   if (O.IsUnit(threshold)) {
     if (self_debug_) {
       KALDI_WARN << "Not reorthogonalizing since already orthognoal: " << O;
     }
     return;
   }
   TpMatrix<BaseFloat> C(R);
   try {
     C.Cholesky(O);
     C.Invert();  // Now it's C^{-1}.
     if (!(C.Max() < 100.0))
       KALDI_ERR << "Cholesky out of expected range, "
                 << "reorthogonalizing with Gram-Schmidt";
   } catch (...) {
     // We do a Gram-Schmidt orthogonalization, which is a bit less efficient but
     // more robust than the method using Cholesky.
     KALDI_WARN << "Cholesky or Invert() failed while re-orthogonalizing R_t. "
                << "Re-orthogonalizing on CPU.";
     Matrix<BaseFloat> cpu_W_t1(*W_t1);
     cpu_W_t1.OrthogonalizeRows();
     W_t1->CopyFromMat(cpu_W_t1);
     // at this point cpu_W_t1 represents R_{t+1}- it has orthonormal
     // rows.  Do: W_{t+1} = E_{t+1}^{0.5} R_{t+1}
     CuVector<BaseFloat> sqrt_e_t1_gpu(sqrt_e_t1);
     W_t1->MulRowsVec(sqrt_e_t1_gpu);
     return;
   }
   // Next, compute (E_t^{0.5} C^{-1} E_t^{-0.5})
   // but it's really t+1, not t.
   for (int32 i = 0; i < R; i++) {
     BaseFloat i_factor = sqrt_e_t1(i);
     for (int32 j = 0; j < i; j++) {
       // skip j == i because i_factor * j_factor == 1 for j == i.
       BaseFloat j_factor = inv_sqrt_e_t1(j);
       C(i, j) *= i_factor * j_factor;
     }
   }
   O_mat.CopyFromTp(C);
   temp_O->CopyFromMat(O_mat);
   temp_W->CopyFromMat(*W_t1);
   W_t1->AddMatMat(1.0, *temp_O, kNoTrans, *temp_W, kNoTrans, 0.0);
 }

◆ SelfTest()

void SelfTest ( ) const

private

Definition at line 255 of file nnet-precondition-online.cc.

References CuSpMatrix< Real >::AddMat2(), OnlinePreconditioner::alpha_, OnlinePreconditioner::ComputeEt(), OnlinePreconditioner::d_t_, OnlinePreconditioner::delta_, OnlinePreconditioner::epsilon_, rnnlm::i, SpMatrix< Real >::IsUnit(), rnnlm::j, KALDI_ASSERT, KALDI_WARN, kaldi::kNoTrans, kaldi::kUndefined, OnlinePreconditioner::rho_t_, and OnlinePreconditioner::W_t_.

Referenced by OnlinePreconditioner::GetUpdatePeriod(), and OnlinePreconditioner::PreconditionDirectionsInternal().

                                           {
   KALDI_ASSERT(rho_t_ >= epsilon_);
   BaseFloat d_t_max = d_t_.Max(), d_t_min = d_t_.Min();
   KALDI_ASSERT(d_t_min >= epsilon_);
   KALDI_ASSERT(d_t_min > 0.9 * delta_ * d_t_max);
   KALDI_ASSERT(rho_t_ > 0.9 * delta_ * d_t_max);
 
   int32 D = W_t_.NumCols(), R = W_t_.NumRows();
   BaseFloat beta_t = rho_t_ * (1.0 + alpha_) + alpha_ * d_t_.Sum() / D;
   Vector<BaseFloat> e_t(R, kUndefined), sqrt_e_t(R, kUndefined),
       inv_sqrt_e_t(R, kUndefined);
   ComputeEt(d_t_, beta_t, &e_t, &sqrt_e_t, &inv_sqrt_e_t);
 
   CuSpMatrix<BaseFloat> S(R);
   S.AddMat2(1.0, W_t_, kNoTrans, 0.0);
   SpMatrix<BaseFloat> O(S);
   for (int32 i = 0; i < R; i++) {
     BaseFloat i_factor = inv_sqrt_e_t(i);
     for (int32 j = 0; j <= i; j++) {
       BaseFloat j_factor = inv_sqrt_e_t(j);
       O(i, j) *= i_factor * j_factor;
     }
   }
   if (!O.IsUnit(1.0e-04) || O(0, 0) != O(0, 0)) {
     BaseFloat worst_error = 0.0;
     int32 worst_i = 0, worst_j = 0;
     for (int32 i = 0; i < R; i++) {
       for (int32 j = 0; j < R; j++) {
         BaseFloat elem = O(i, j);
         BaseFloat error = fabs(elem - (i == j ? 1.0 : 0.0));
         if (error > worst_error || error != error) {
           worst_error = error;
           worst_i = i;
           worst_j = j;
         }
       }
     }
     if (worst_error > 1.0e-02 || worst_error != worst_error) {
       KALDI_WARN << "Failed to verify W_t (worst error: O[" << worst_i << ','
                  << worst_j << "] = " << O(worst_i, worst_j)
                  << ", d_t = " << d_t_;
     }
   }
 }

◆ SetAlpha()

void SetAlpha ( BaseFloat alpha )

Definition at line 634 of file nnet-precondition-online.cc.

References OnlinePreconditioner::alpha_, and KALDI_ASSERT.

                                                    {
   KALDI_ASSERT(alpha >= 0.0);
   alpha_ = alpha;
 }

◆ SetNumSamplesHistory()

void SetNumSamplesHistory ( BaseFloat num_samples_history )

Definition at line 629 of file nnet-precondition-online.cc.

References KALDI_ASSERT, and OnlinePreconditioner::num_samples_history_.

                                                                              {
   KALDI_ASSERT(num_samples_history > 0.0 &&
                num_samples_history < 1.0e+6);
   num_samples_history_ = num_samples_history;
 }

◆ SetRank()

void SetRank ( int32 rank )

Definition at line 621 of file nnet-precondition-online.cc.

References KALDI_ASSERT, and OnlinePreconditioner::rank_.

Referenced by kaldi::nnet2::UnitTestPreconditionDirectionsOnline().

                                              {
   KALDI_ASSERT(rank > 0);
   rank_ = rank;
 }

◆ SetUpdatePeriod()

void SetUpdatePeriod ( int32 update_period )

Definition at line 625 of file nnet-precondition-online.cc.

References KALDI_ASSERT, and OnlinePreconditioner::update_period_.

                                                               {
   KALDI_ASSERT(update_period > 0);
   update_period_ = update_period;
 }

◆ TurnOnDebug()

void TurnOnDebug ( )

inline

Definition at line 422 of file nnet-precondition-online.h.

References OnlinePreconditioner::self_debug_.

Referenced by kaldi::nnet2::UnitTestPreconditionDirectionsOnline().

422 { self_debug_ = true; }

kaldi::nnet2::OnlinePreconditioner::self_debug_

bool self_debug_

Definition: nnet-precondition-online.h:555

Member Data Documentation

◆ alpha_

BaseFloat alpha_

private

Definition at line 531 of file nnet-precondition-online.h.

Referenced by OnlinePreconditioner::ComputeWt1(), OnlinePreconditioner::GetAlpha(), OnlinePreconditioner::InitDefault(), OnlinePreconditioner::operator=(), OnlinePreconditioner::PreconditionDirectionsInternal(), OnlinePreconditioner::ReorthogonalizeXt1(), OnlinePreconditioner::SelfTest(), and OnlinePreconditioner::SetAlpha().

◆ d_t_

Vector<BaseFloat> d_t_

private

Definition at line 559 of file nnet-precondition-online.h.

Referenced by OnlinePreconditioner::Init(), OnlinePreconditioner::InitDefault(), OnlinePreconditioner::operator=(), OnlinePreconditioner::PreconditionDirections(), OnlinePreconditioner::PreconditionDirectionsInternal(), and OnlinePreconditioner::SelfTest().

◆ delta_

BaseFloat delta_

private

Definition at line 543 of file nnet-precondition-online.h.

Referenced by OnlinePreconditioner::InitDefault(), OnlinePreconditioner::operator=(), OnlinePreconditioner::PreconditionDirectionsInternal(), and OnlinePreconditioner::SelfTest().

◆ epsilon_

BaseFloat epsilon_

private

Definition at line 537 of file nnet-precondition-online.h.

Referenced by OnlinePreconditioner::InitDefault(), OnlinePreconditioner::operator=(), OnlinePreconditioner::PreconditionDirectionsInternal(), and OnlinePreconditioner::SelfTest().

◆ num_samples_history_

BaseFloat num_samples_history_

private

Definition at line 527 of file nnet-precondition-online.h.

Referenced by OnlinePreconditioner::Eta(), OnlinePreconditioner::GetNumSamplesHistory(), OnlinePreconditioner::InitDefault(), OnlinePreconditioner::operator=(), and OnlinePreconditioner::SetNumSamplesHistory().

◆ num_updates_skipped_

int32 num_updates_skipped_

private

Definition at line 552 of file nnet-precondition-online.h.

Referenced by OnlinePreconditioner::PreconditionDirectionsInternal().

◆ rank_

int32 rank_

private

Definition at line 516 of file nnet-precondition-online.h.

Referenced by OnlinePreconditioner::GetRank(), OnlinePreconditioner::Init(), OnlinePreconditioner::InitDefault(), OnlinePreconditioner::operator=(), OnlinePreconditioner::PreconditionDirectionsInternal(), and OnlinePreconditioner::SetRank().

◆ read_write_mutex_

std::mutex read_write_mutex_

private

Definition at line 563 of file nnet-precondition-online.h.

Referenced by OnlinePreconditioner::PreconditionDirections(), and OnlinePreconditioner::PreconditionDirectionsInternal().

◆ rho_t_

BaseFloat rho_t_

private

Definition at line 558 of file nnet-precondition-online.h.

Referenced by OnlinePreconditioner::Init(), OnlinePreconditioner::InitDefault(), OnlinePreconditioner::operator=(), OnlinePreconditioner::PreconditionDirections(), OnlinePreconditioner::PreconditionDirectionsInternal(), and OnlinePreconditioner::SelfTest().

◆ self_debug_

bool self_debug_

private

Definition at line 555 of file nnet-precondition-online.h.

Referenced by OnlinePreconditioner::operator=(), OnlinePreconditioner::PreconditionDirectionsInternal(), OnlinePreconditioner::ReorthogonalizeXt1(), and OnlinePreconditioner::TurnOnDebug().

◆ t_

int32 t_

private

Definition at line 546 of file nnet-precondition-online.h.

Referenced by OnlinePreconditioner::Init(), OnlinePreconditioner::InitDefault(), OnlinePreconditioner::operator=(), OnlinePreconditioner::PreconditionDirections(), and OnlinePreconditioner::PreconditionDirectionsInternal().

◆ update_mutex_

std::mutex update_mutex_

private

Definition at line 567 of file nnet-precondition-online.h.

Referenced by OnlinePreconditioner::PreconditionDirectionsInternal().

◆ update_period_

int32 update_period_

private

Definition at line 521 of file nnet-precondition-online.h.

Referenced by OnlinePreconditioner::GetUpdatePeriod(), OnlinePreconditioner::operator=(), OnlinePreconditioner::PreconditionDirectionsInternal(), and OnlinePreconditioner::SetUpdatePeriod().

◆ W_t_

CuMatrix<BaseFloat> W_t_

private

Definition at line 557 of file nnet-precondition-online.h.

Referenced by OnlinePreconditioner::Init(), OnlinePreconditioner::InitDefault(), OnlinePreconditioner::operator=(), OnlinePreconditioner::PreconditionDirections(), OnlinePreconditioner::PreconditionDirectionsInternal(), and OnlinePreconditioner::SelfTest().

The documentation for this class was generated from the following files:

nnet2/nnet-precondition-online.h
nnet2/nnet-precondition-online.cc

Public Member Functions

Private Member Functions

Static Private Member Functions

Private Attributes

Detailed Description

Constructor & Destructor Documentation

◆ OnlinePreconditioner() [1/2]

◆ OnlinePreconditioner() [2/2]

Member Function Documentation

◆ ComputeEt()

◆ ComputeWt1()

◆ ComputeZt()

◆ Eta()

◆ GetAlpha()

◆ GetNumSamplesHistory()

◆ GetRank()

◆ GetUpdatePeriod()

◆ Init()

◆ InitDefault()

◆ InitOrthonormalSpecial()

◆ operator=()

◆ PreconditionDirections()

◆ PreconditionDirectionsInternal()

◆ ReorthogonalizeXt1()

◆ SelfTest()

◆ SetAlpha()

◆ SetNumSamplesHistory()

◆ SetRank()

◆ SetUpdatePeriod()

◆ TurnOnDebug()

Member Data Documentation

◆ alpha_

◆ d_t_

◆ delta_

◆ epsilon_

◆ num_samples_history_

◆ num_updates_skipped_

◆ rank_

◆ read_write_mutex_

◆ rho_t_

◆ self_debug_

◆ t_

◆ update_mutex_

◆ update_period_

◆ W_t_