Keywords for search: natural gradient, naturalgradient, NG-SGD. More...

#include <natural-gradient-online.h>

Collaboration diagram for OnlineNaturalGradient:

[legend]

Public Member Functions
	OnlineNaturalGradient ()

void	SetRank (int32 rank)

void	SetUpdatePeriod (int32 update_period)

void	SetNumSamplesHistory (BaseFloat num_samples_history)

void	SetNumMinibatchesHistory (BaseFloat num_minibatches_history)

void	SetAlpha (BaseFloat alpha)

void	TurnOnDebug ()

BaseFloat	GetNumSamplesHistory () const

BaseFloat	GetNumMinibatchesHistory () const

BaseFloat	GetAlpha () const

int32	GetRank () const

int32	GetUpdatePeriod () const

void	Freeze (bool frozen)

void	PreconditionDirections (CuMatrixBase< BaseFloat > X, BaseFloat scale)
	This call implements the main functionality of this class. More...

	OnlineNaturalGradient (const OnlineNaturalGradient &other)

OnlineNaturalGradient &	operator= (const OnlineNaturalGradient &other)

void	Swap (OnlineNaturalGradient *other)

Private Member Functions
void	PreconditionDirectionsInternal (const BaseFloat rho_t, const BaseFloat tr_X_Xt, bool updating, const Vector< BaseFloat > &d_t, CuMatrixBase< BaseFloat > WJKL_t, CuMatrixBase< BaseFloat > X_t)

bool	Updating () const

void	ComputeEt (const VectorBase< BaseFloat > &d_t, BaseFloat beta_t, VectorBase< BaseFloat > e_t, VectorBase< BaseFloat > sqrt_e_t, VectorBase< BaseFloat > *inv_sqrt_e_t) const

void	ComputeZt (int32 N, BaseFloat rho_t, const VectorBase< BaseFloat > &d_t, const VectorBase< BaseFloat > &inv_sqrt_e_t, const MatrixBase< BaseFloat > &K_t, const MatrixBase< BaseFloat > &L_t, SpMatrix< double > *Z_t) const

void	ComputeWt1 (int32 N, const VectorBase< BaseFloat > &d_t, const VectorBase< BaseFloat > &d_t1, BaseFloat rho_t, BaseFloat rho_t1, const MatrixBase< BaseFloat > &U_t, const VectorBase< BaseFloat > &sqrt_c_t, const VectorBase< BaseFloat > &inv_sqrt_e_t, const CuMatrixBase< BaseFloat > &W_t, CuMatrixBase< BaseFloat > J_t, CuMatrixBase< BaseFloat > W_t1) const

void	ReorthogonalizeRt1 (const VectorBase< BaseFloat > &d_t1, BaseFloat rho_t1, CuMatrixBase< BaseFloat > W_t1, CuMatrixBase< BaseFloat > temp_W, CuMatrixBase< BaseFloat > *temp_O)

void	Init (const CuMatrixBase< BaseFloat > &R0)

void	InitDefault (int32 D)

BaseFloat	Eta (int32 N) const

void	SelfTest () const

Static Private Member Functions
static void	InitOrthonormalSpecial (CuMatrixBase< BaseFloat > *R)
	This function creates a matrix with orthonormal rows that is like the following matrix, except with each row normalized to have unit 2-norm: [ 1.1 0 1 0 1 0 0 1.1 0 1 0 1 ] The reason why the first element in each row is 1.1 and not 1, is for symmetry-breaking... More...

Private Attributes
int32	rank_

int32	update_period_

BaseFloat	num_samples_history_

BaseFloat	num_minibatches_history_

BaseFloat	alpha_

BaseFloat	epsilon_

BaseFloat	delta_

bool	frozen_

int32	t_

bool	self_debug_

CuMatrix< BaseFloat >	W_t_

BaseFloat	rho_t_

Vector< BaseFloat >	d_t_

Detailed Description

Keywords for search: natural gradient, naturalgradient, NG-SGD.

This method is explained in the paper "Parallel training of DNNs with Natural Gradient and Parameter Averaging" by D. Povey, X. Zhang and S. Khudanpur, ICLR Workshop, 2015, where it is referred to as online NG-SGD. Note that the method exported from this header is just the core of the algorithm, and some outer-level parts of it are implemented in class NaturalGradientAffineComponent.

The rest of this extended comment describes the way we keep updated an estimate of the inverse of a scatter matrix, in an online way. This is the same as the estimation of one of the A or B quantities in the paper. This comment is slightly redundant with the paper- actually it precedes the paper- but we keep it in case it is useful in understanging our method.

We consider the problem of doing online estimation of a (scaled-identity plus low-rank) approximation of a Fisher matrix... since the Fisher matrix is a scatter of vector-valued derivatives and we will be given the derivatives (or at least terms in a factorization of the derivatives which need not concern us right now), we can just think of the present task as being the online accumulation of a (low-rank plus scaled-identity) approximation to a variance of a distribution with mean zero.

Later on we'll think about how to get easy access to the inverse of this approximate variance, which is what we really need.

Our approximation to the Fisher matrix (the scatter of derivatives) will be of the following form (and just think of this as an approximate variance matrix of some arbitrary quantities).

F_t =(def) R_t^T D_t R_t + I

(t is the minibatch index), where R_t is an R by D matrix with orthonormal rows (1 <= R < D is our chosen rank), D_t is a positive-definite diagonal matrix, and > 0. Suppose the dimension of F_t is D. Let the vectors whose variance we are approximating be provided in minibatches of size M (M can vary from iteration to iteration, but it won't vary in the normal case, so we omit the subscript t). The batch of gradients is given as X_t Re^{M D}, i.e. each row is one of the vectors whose scatter we're estimating. On the t'th iteration, define the scatter S_t of the input vectors X_t as:

S_t =(def) 1/N X_t^T X_t (eqn:St)

(where N is the minibatch size). Be careful not to confuse the rank R with with input X_t (we would typeface X_t in bold if this were not plain text, to make the distinction clearer). We want F_t to approach some kind of time-weighted average of the S_t quantities, to the extent permitted by the limitation of the rank R. We want the F_t quantities to stay "fresh" (since we'll be doing this in a SGD context and the parameters will be slowly changing). We use a constant 0 < < 1 to control the updating rate. Our update for R_t is based on the power method. Define the smoothed scatter

T_t =(def) S_t + (1-) F_t

[note: F_{t+1} will be set to a low-rank approximation of T_t, which is where the recursion comes in.]

We'll use this in place of the observed scatter S_t, to slow down the update. Defining

Y_t =(def) R_t T_t

which can be expanded as follows: Y_t = R_t ( S_t + (1-) F_t ) = R_t ( S_t + (1-) (R_t^T D_t R_t + I) ) = R_t ( S_t + (1-) (R_t^T D_t R_t + I) ) = R_t S_t + (1-) (D_t + I) R_t

It is useful to think of Y_t as having each of the top eigenvectors of the scatter scaled by the corresponding eigenvalue . We compute the following R by R matrix: Z_t =(def) Y_t Y_t^T and do the symmetric eigenvalue decomposition Z_t = U_t C_t U_t^T where C_t is diagonal and U_t orthogonal; the diagonal elements of C_t will be positive (since > 0, T_t is positive definite; since R_t has full row rank and T_t is positive definite, Y_t has full row rank; hence Z_t is positive definite). The diagonal elements of C_t can be thought of as corresponding to the squares of our current estimate of the top eigenvalues of the scatter matrix. [we should check that no element of C_t is <= 0.]

It is easy to show that C_t^{-0.5} U_t^T Z_t U_t C_t^{-0.5} = I, so (C_t^{-0.5} U_t^T Y_t) (Y_t^T U_t C_t^{-0.5}) = I. Define R_{t+1} =(def) C_t^{-0.5} U_t^T Y_t

and it's clear that R_{t+1} R_{t+1}^T = I. We will set D_{t+1} =(def) C_t^{0.5} - {t+1} I (eqn:dt1)

which ensures that for each row r of R_{t+1}, the variance of our scatter matrix F_{t+1} will be the square root of the corresponding diagonal element of C_t. This makes sense because, as we have pointed out, the diagonal elements of C_t can be thought of as corresponding to squared eigenvalues. But a proper treatment of this would require convergence analysis that would get quite complicated. We will choose {t+1} in order to ensure that tr(F_{t+1}) = tr(T_t).

For any t, tr(F_t) = D + tr(D_t) tr(T_t) = tr(S_t) + (1-) tr(F_t) = tr(S_t) + (1-) (D + tr(D_t)) Expanding out D_{t+1} from (eqn:dt1) in the expression for tr(F_{t+1}) below: tr(F_{t+1}) = D {t+1} + tr(D_{t+1}) tr(F_{t+1}) = D {t+1} + tr(C_t^{0.5} - {t+1} I) = (D - R) {t+1} + tr(C_t^{0.5}) and equating tr(F_{t+1}) with T_t (since F_{t+1} is supposed to be a low-rank approximation to T_t), we have tr(F_{t+1}) = tr(T_t) (D - R) {t+1} + tr(C_t^{0.5}) = tr(S_t) + (1-) (D + tr(D_t))

Solving for {t+1}, {t+1} = 1/(D - R) ( tr(S_t) + (1-)(D + tr(D_t)) - tr(C_t^{0.5})). (eqn:rhot1)

Note that it is technically possible that diagonal elements of of D_{t+1} may be negative, but we can still show that F_{t+1} is strictly positive definite if F_t was strictly positive definite.

If the quantities for which we are computing the Fisher matrix are all zero for some, reason, the sequence of F_t will geometrically approach zero, which would cause problems with inversion; to prevent this happening, after setting D_{t+1} and {t+1} as above, we floor {t+1} to a small value (like 1.0e-10).

OK, we have described the updating of R_t, D_t and . Next, we need to figure out how to efficiently multiply by the inverse of F_t. Our experience from working with the old preconditioning method was that it's best not to use the inverse of the Fisher matrix itself, but a version of the Fisher matrix that's smoothed with some constant times the identity. Below, ( is a configuration value, e.g. 4.0 seemed to work well). The following formula is designed to ensure that the smoothing varies proportionally with the scale of F_t:

G_t =(def) F_t + /D tr(F_t) I = R_t^T D_t R_t + ( + /D tr(F_t)) I = R_t^T D_t R_t + I where =(def) + /D tr(F_t) = (1+) + /D tr(D_t) (eqn:betat2)

Define {X}_t =(def) X_t G_t^{-1}. the factor of is inserted arbitrarily as it just happens to be convenient to put unit scale on X_t in the formula for {X}_t; it will anyway be canceled out in the next step. Then our final preconditioned minibatch of vectors is: {X}_t = {X}_t where = sqrt(tr(X_t X_t^T) / tr({X}_t {X}_t^T). The factor of ensures that {X}_t is scaled to have the same overall 2-norm as the input X_t. We found in previous versions of this method that this rescaling was helpful, as otherwise there are certain situations (e.g. at the start of training) where the preconditioned derivatives can get very large. Note that this rescaling introduces a small bias into the training, because now the scale applied to a given sample depends on that sample itself, albeit in an increasingly diluted way as the minibatch size gets large.

To efficiently compute G_t^{-1}, we will use the Woodbury matrix identity. Writing the Woodbury formula for the symmetric case, (A + U D U^T)^{-1} = A^{-1} - A^{-1} U (D^{-1} + U^T A^{-1} U)^{-1} U^T A^{-1} Substituting A = I, D = D_t and U = R_t^T, this becomes G_t^{-1} = 1/ I - 1/^2 R_t^T (D_t^{-1} + 1/ I)^{-1} R_t = 1/ (I - R_t^T E_t R_t) where E_t =(def) 1/ (D_t^{-1} + 1/ I)^{-1}, (eqn:etdef) so e_{tii} = 1/ * 1/(1/d_{tii} + 1/) (eqn:tii) = 1/(/d_{tii} + 1)

We would like an efficient-to-compute expression for {X}_t, without too many separate invocations of kernels on the GPU. {X}_t = X_t G_t^{-1} = X_t - X_t R_t^T E_t R_t For efficient operation on the GPU, we want to reduce the number of high-dimensional operations that we do (defining "high-dimension" as anything involving D or M, but not R, since R is likely small, such as 20). We define W_t =(def) E_t^{0.5} R_t. We will actually be storing W_t on the GPU rather than R_t, in order to reduce the number of operations on the GPU. We can now write:

{X}_t = X_t - X_t W_t^T W_t (eqn:pt2)

The following, which we'll compute on the GPU, are going to be useful in computing quantities like Z_t:

H_t =(def) X_t W_t^T (dim is N by R) J_t =(def) H_t^T X_t (dim is R by D) = W_t X_t^T X_t K_t =(def) J_t J_t^T (dim is R by R, symmetric).. transfer this to CPU. L_t =(def) H_t^T H_t (dim is R by R, symmetric).. transfer this to CPU. = W_t X_t^T X_t W_t^T Note: L_t may also be computed as L_t = J_t W_t^T which may be more efficient if D < N.

Note: after we have computed H_t we can directly compute {X}_t = X_t - H_t W_t

We need to determine how Y_t and Z_t relate to the quantities we just defined. First, we'll expand out H_t, J_t, K_t and L_t in terms of the more fundamental quantities. H_t = X_t R_t^T E_t^{0.5} J_t = E_t^{0.5} R_t X_t^T X_t K_t = E_t^{0.5} R_t X_t^T X_t X_t^T X_t R_t^T E_t^{0.5} L_t = E_t^{0.5} R_t X_t^T X_t R_t^T E_t^{0.5}

we wrote above that Y_t = R_t S_t + (1-) (D_t + I) R_t so Y_t = /N R_t X_t^T X_t + (1-) (D_t + I) R_t = /N E_t^{-0.5} J_t + (1-) (D_t + I) R_t (eqn:yt) We will expand Z_t using the expression for Y_t in the line above: Z_t = Y_t Y_t^T = (/N)^2 E_t^{-0.5} J_t J_t^T E_t^{-0.5} +(/N)(1-) E_t^{-0.5} J_t R_t^T (D_t + I) +(/N)(1-) (D_t + I) R_t J_t^T E_t^{-0.5} +(1-)^2 (D_t + I)^2 = (/N)^2 E_t^{-0.5} K_t E_t^{-0.5} +(/N)(1-) E_t^{-0.5} L_t E_t^{-0.5} (D_t + I) +(/N)(1-) (D_t + I) E_t^{-0.5} L_t E_t^{-0.5} +(1-)^2 (D_t + I)^2 (eqn:Zt) We compute Z_t on the CPU using the expression above, and then do the symmetric eigenvalue decomposition (also on the CPU): Z_t = U_t C_t U_t^T. and we make sure the eigenvalues are sorted from largest to smallest, for reasons that will be mentioned later.

Mathematically, no diagonal element of C_t can be less than (1-)^2 ^2, and since negative or zero elements of C_t would cause us a problem later, we floor C_t to this value. (see below regarding how we ensure R_{t+1} has orthonormal rows).

We will continue the discussion below regarding what we do with C_t and U_t. Next, we need to digress briefly and describe how to compute tr({X}_t {X}_t^T) and tr(X_t X_t^2), since these appear in expressions for (needed to produce the output {X}_t), and for {t+1}. It happens that we need, for purposes of appying "max_change" in the neural net code, the squared 2-norm of each row of the output {X}_t. In order to be able to compute , it's most convenient to compute this squared row-norm for each row of {X}_t, as a vector, to compute tr({X}_t {X}_t^2) from this vector as its sum, and to then work back to compute tr(X_t X_t^2) from the relation between {X}_t and X_t. We can then scale the row-norms we computed for {X}_t, so they apply to {X}_t.

For current purposes, you can imagine that we computed tr({X}_t {X}_t^T) directly. Using (from eqn:pt2) {X}_t = X_t - X_t W_t^T W_t, we can expand tr({X}_t {X}_t^T) as: tr({X}_t {X}_t^T) = tr(X_t X_t^T) + tr(X_t W_t^T W_t W_t^T W_t X_t^T)

2 tr(X_t W_t^T W_t X_t^T) = tr(X_t X_t^T) + tr(W_t X_t^T X_t W_t^T W_t W_t^T)
2 tr(W_t X_t^T X_t W_t^T) = tr(X_t X_t^T) + tr(L_t W_t W_t^T) - 2 tr(L_t) = tr(X_t X_t^T) + tr(L_t E_t) - 2 tr(L_t) and all quantities have already been computed (or are quick to compute, such as the small traces on the right), except tr(X_t X_t^T), so we can write

tr(X_t X_t^T) = tr({X}_t {X}_t^T) - tr(L_t E_t) + 2 tr(L_t) and the above expression can be used to obtain tr(X_t X_t^2). We can then do <– sqrt(tr(X_t X_t^T) / tr({X}_t {X}_t^T)). (or one if the denominator is zero), and then {X}_t <– {X}_t We can then output the per-row squared-l2-norms of Q by scaling those we computed from P by ^2.

OK, the digression on how to compute and tr(X_t X_t^T) is over. We now return to the computation of R_{t+1}, W_{t+1}, {t+1}, D_{t+1} and E_{t+1}.

We found above in (eqn:rhot1) {t+1} = 1/(D - R) ( tr(S_t) + (1-)(D + tr(D_t)) - tr(C_t^{0.5})). Expanding out S_t from its definition in (eqn:St), {t+1} = 1/(D - R) (/N tr(X_t X_t^T) + (1-)(D + tr(D_t)) - tr(C_t^{0.5})). We can compute this directly as all the quantities involved are already known or easy to compute. Next, from (eqn:dt1), we compute D_{t+1} = C_t^{0.5} - {t+1} I At this point if {t+1} is smaller than some small value , e.g. 1.0e-10, we set it to ; as mentioned, we do this to stop F_t approaching zero if all inputs are zero. Next, if any diagonal element D_{t+1,i,i} has absolute value less than , we set it to +. This is to ensure that diagonal elements of E are never zero, which would cause problems.

Next, we compute (from eqn:betat2, eqn:etdef, eqn:tii), {t+1} = {t+1} (1+) + /D tr(D_{t+1}) E_{t+1} = 1/{t+1} (D_{t+1}^{-1} + 1/{t+1} I)^{-1}, i.e.: e_{tii} = 1/({t+1}/d_{t+1,ii} + 1)

We'll want to store D_{t+1}. We next want to compute W_{t+1}.

Before computing W_{t+1}, we need to find an expression for R_{t+1} = C_t^{-0.5} U_t^T Y_t Expanding out Y_t using the expression in (eqn:yt), R_{t+1} = C_t^{-0.5} U_t^T (/N E_t^{-0.5} J_t + (1-) (D_t + I) R_t) = (/N C_t^{-0.5} U_t^T E_t^{-0.5}) J_t +((1-) C_t^{-0.5} U_t^T (D_t + I) E_t^{-0.5}) W_t

What we actually want is W_{t+1} = E_{t+1}^{0.5} R_{t+1}: W_{t+1} = (/N E_{t+1}^{0.5} C_t^{-0.5} U_t^T E_t^{-0.5}) J_t +((1-) E_{t+1}^{0.5} C_t^{-0.5} U_t^T (D_t + I) E_t^{-0.5}) W_t and to minimize the number of matrix-matrix multiplies we can factorize this as: W_{t+1} = A_t B_t A_t = (/N) E_{t+1}^{0.5} C_t^{-0.5} U_t^T E_t^{-0.5} B_t = J_t + (1-)/(/N) (D_t + I) W_t [note: we use the fact that (D_t + I) and E_t^{-0.5} commute because they are diagonal].

A_t is computed on the CPU and transferred from there to the GPU, B_t is computed on the PGU, and the multiplication of A_t with B_t is done on the GPU.

Keeping R_t orthogonal *

Our method requires the R_t matrices to be orthogonal (which we define to mean that R_t R_t^T = I). If roundoff error causes this equality to be significantly violated, it could cause a problem for the stability of our method. We now address our method for making sure that the R_t values stay orthogonal. We do this in the algorithm described above, after creating W_{t+1}. This extra step is only executed if the condition number of C_t (i.e. the ratio of its largest to smallest diagonal element) exceeds a specified threshold, such as 1.0e+06 [this is tested before applying the floor to C_t]. The threshold was determined empirically by finding the largest value needed to ensure a certain level of orthogonality in R_{t+1}. For purposes of the present discussion, since R_{t+1} is not actually stored, define it as E_{t+1}^{-0.5} W_{t+1}. Define the following (and we will just use t instead of t+1 below, as all quantities have the same subscript):

O_t =(def) R_t R_t^T = E_t^{-0.5} W_t W_t^T E_t^{-0.5}

(and we would compute this by computing W_t W_t^T on the GPU, transferring it to the CPU, and doing the rest there). If O_t is not sufficiently close to the unit matrix, we can re-orthogonalize as follows: Do the Cholesky decomposition O_t = C C^T Clearly C^{-1} O_t C^{-T} = I, so if we correct R_t with: R_t <– C^{-1} R_t we can ensure orthogonality. If R_t's first k rows are orthogonal, this transform will not affect them, because of its lower-triangular structure... this is good because (thanks to the eigenvalue sorting), the larger eigenvectors are first and it is more critical to keep them pointing in the same direction. Any loss of orthogonality will be dealt with by modifying the smaller eigenvectors. As a modification to W_t, this would be: W_t <– (E_t^{0.5} C^{-1} E_t^{-0.5}) W_t, and the matrix in parentheses is computed on the CPU, transferred to the GPU, and the multiplication is done there.

Initialization *

Now, a note on what we do on time t = 0, i.e. for the first minibatch. We initialize R_0 to the top R eigenvectors of 1/N X_0 X_0^T, where N is the minibatch size (num-rows of X0). If L is the corresponding RxR diagonal matrix of eigenvalues, then we will set D_0 = L - I. We set to ensure that tr(F_0) = 1/N tr(X_0 X_0^T), tr(D_0) - D = 1/N tr(X_0 X_0^T), tr(L) + R - D = 1/N tr(X_0 X_0^T) = (1/N tr(X_0 X_0^T) - tr(L)) / (D - R)

We then floor to (e.g. 1.0e-10) and also floor the diagonal elements of D_0 to ; this ensures that we won't crash for zero inputs.

A note on multi-threading. This technique was really designed for use with a GPU, where we won't have multi-threading, but we want it to work also on a CPU, where we may have multiple worker threads. Our approach is as follows (we do this when we're about to start updating the parameters R_t, D_t, and derived quantities):

For time t > 0 (where the matrices are already initialized), before starting the part of the computation that updates the parameters (R_t, D_t, and derived quantities), we try to lock a mutex that guards the OnlinePreconditioner. If we can lock it right away, we go ahead and do the update, but if not, we just abandon the attempt to update those quantities.

We will have another mutex to ensure that when we access quantities like W_t, they are all "in sync" (and we don't access them while they are being written by another thread). This mutex will only be locked for short periods of time.

Note: it might be a good idea to make sure that the R_t still retain orthonormal rows even in the presence of roundoff, without errors accumulating. My instinct is that this isn't going to be a problem.

Definition at line 414 of file natural-gradient-online.h.

Constructor & Destructor Documentation

◆ OnlineNaturalGradient() [1/2]

OnlineNaturalGradient ( )

Definition at line 27 of file natural-gradient-online.cc.

Referenced by OnlineNaturalGradient::Freeze().

                                             :
     rank_(40), update_period_(1), num_samples_history_(2000.0),
     num_minibatches_history_(0.0), alpha_(4.0),
     epsilon_(1.0e-10), delta_(5.0e-04), frozen_(false), t_(0),
     self_debug_(false), rho_t_(-1.0e+10) { }

◆ OnlineNaturalGradient() [2/2]

OnlineNaturalGradient ( const OnlineNaturalGradient & other )

explicit

Definition at line 578 of file natural-gradient-online.cc.

                                                                               :
     rank_(other.rank_), update_period_(other.update_period_),
     num_samples_history_(other.num_samples_history_),
     num_minibatches_history_(other.num_minibatches_history_),
     alpha_(other.alpha_), epsilon_(other.epsilon_), delta_(other.delta_),
     frozen_(other.frozen_), t_(other.t_),
     self_debug_(other.self_debug_), W_t_(other.W_t_),
     rho_t_(other.rho_t_), d_t_(other.d_t_) { }

Member Function Documentation

◆ ComputeEt()

void ComputeEt	(	const VectorBase< BaseFloat > &	d_t,
		BaseFloat	beta_t,
		VectorBase< BaseFloat > *	e_t,
		VectorBase< BaseFloat > *	sqrt_e_t,
		VectorBase< BaseFloat > *	inv_sqrt_e_t
	)		const

private

Definition at line 560 of file natural-gradient-online.cc.

References VectorBase< Real >::ApplyPow(), VectorBase< Real >::CopyFromVec(), rnnlm::d, VectorBase< Real >::Data(), VectorBase< Real >::Dim(), rnnlm::i, and VectorBase< Real >::InvertElements().

Referenced by OnlineNaturalGradient::ComputeWt1(), OnlineNaturalGradient::Freeze(), OnlineNaturalGradient::PreconditionDirectionsInternal(), OnlineNaturalGradient::ReorthogonalizeRt1(), and OnlineNaturalGradient::SelfTest().

                                                                                 {
   // e_{tii} = 1/(\beta_t/d_{tii} + 1)
   int32 D = d_t.Dim();
   const BaseFloat *d = d_t.Data();
   BaseFloat *e = e_t->Data();
   for (int32 i = 0; i < D; i++)
     e[i] = 1.0 / (beta_t / d[i]  +  1);
   sqrt_e_t->CopyFromVec(*e_t);
   sqrt_e_t->ApplyPow(0.5);
   inv_sqrt_e_t->CopyFromVec(*sqrt_e_t);
   inv_sqrt_e_t->InvertElements();
 }

◆ ComputeWt1()

void ComputeWt1	(	int32	N,
		const VectorBase< BaseFloat > &	d_t,
		const VectorBase< BaseFloat > &	d_t1,
		BaseFloat	rho_t,
		BaseFloat	rho_t1,
		const MatrixBase< BaseFloat > &	U_t,
		const VectorBase< BaseFloat > &	sqrt_c_t,
		const VectorBase< BaseFloat > &	inv_sqrt_e_t,
		const CuMatrixBase< BaseFloat > &	W_t,
		CuMatrixBase< BaseFloat > *	J_t,
		CuMatrixBase< BaseFloat > *	W_t1
	)		const

private

Definition at line 484 of file natural-gradient-online.cc.

References CuMatrixBase< Real >::AddDiagVecMat(), CuMatrixBase< Real >::AddMatMat(), OnlineNaturalGradient::alpha_, OnlineNaturalGradient::ComputeEt(), VectorBase< Real >::Dim(), OnlineNaturalGradient::Eta(), rnnlm::i, VectorBase< Real >::InvertElements(), rnnlm::j, KALDI_ASSERT, kaldi::kNoTrans, kaldi::kTrans, kaldi::kUndefined, CuMatrixBase< Real >::NumCols(), and VectorBase< Real >::Sum().

Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirectionsInternal().

                                                                             {
 
   int32 R = d_t.Dim(), D = W_t.NumCols();
   BaseFloat eta = Eta(N);
 
   // \beta_{t+1} = \rho_{t+1} (1+\alpha) + \alpha/D tr(D_{t+1})
   BaseFloat beta_t1 = rho_t1 * (1.0 + alpha_) + alpha_ * d_t1.Sum() / D;
   KALDI_ASSERT(beta_t1 > 0.0);
   Vector<BaseFloat> e_t1(R, kUndefined), sqrt_e_t1(R, kUndefined),
       inv_sqrt_e_t1(R, kUndefined);
   ComputeEt(d_t1, beta_t1, &e_t1, &sqrt_e_t1, &inv_sqrt_e_t1);
   Vector<BaseFloat> inv_sqrt_c_t(sqrt_c_t);
   inv_sqrt_c_t.InvertElements();
 
   Vector<BaseFloat> w_t_coeff(R);
   for (int32 i = 0; i < R; i++)
     w_t_coeff(i) = (1.0 - eta) / (eta/N) * (d_t(i) + rho_t);
   CuVector<BaseFloat> w_t_coeff_gpu(w_t_coeff);
   // B_t = J_t + (1-\eta)/(\eta/N) (D_t + \rho_t I) W_t
   J_t->AddDiagVecMat(1.0, w_t_coeff_gpu, W_t, kNoTrans, 1.0);
 
   // A_t = (\eta/N) E_{t+1}^{0.5} C_t^{-0.5} U_t^T E_t^{-0.5}
   Matrix<BaseFloat> A_t(U_t, kTrans);
   for (int32 i = 0; i < R; i++) {
     BaseFloat i_factor = (eta / N) * sqrt_e_t1(i) * inv_sqrt_c_t(i);
     for (int32 j = 0; j < R; j++) {
       BaseFloat j_factor = inv_sqrt_e_t(j);
       A_t(i, j) *= i_factor * j_factor;
     }
   }
   // W_{t+1} = A_t B_t
   CuMatrix<BaseFloat> A_t_gpu(A_t);
   W_t1->AddMatMat(1.0, A_t_gpu, kNoTrans, *J_t, kNoTrans, 0.0);
 }

◆ ComputeZt()

void ComputeZt	(	int32	N,
		BaseFloat	rho_t,
		const VectorBase< BaseFloat > &	d_t,
		const VectorBase< BaseFloat > &	inv_sqrt_e_t,
		const MatrixBase< BaseFloat > &	K_t,
		const MatrixBase< BaseFloat > &	L_t,
		SpMatrix< double > *	Z_t
	)		const

private

Definition at line 529 of file natural-gradient-online.cc.

References VectorBase< Real >::Add(), VectorBase< Real >::Dim(), OnlineNaturalGradient::Eta(), rnnlm::i, and rnnlm::j.

Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirectionsInternal().

                                                                   {
   // Use doubles because the range of quantities in Z_t can get large (fourth
   // power of data), and we want to avoid overflow.  This routine is fast.
   BaseFloat eta = Eta(N);
   Vector<BaseFloat> d_t_rho_t(d_t);
   d_t_rho_t.Add(rho_t);  // now d_t_rho_t is diag(D_t + \rho_t I).
   double etaN = eta / N, eta1 = 1.0 - eta,
       etaN_sq = etaN * etaN, eta1_sq = eta1 * eta1,
       etaN_eta1 = etaN * eta1;
   int32 R = d_t.Dim();
   for (int32 i = 0; i < R; i++) {
     double inv_sqrt_e_t_i = inv_sqrt_e_t(i), d_t_rho_t_i = d_t_rho_t(i);
     for (int32 j = 0; j <= i; j++) {
       double inv_sqrt_e_t_j = inv_sqrt_e_t(j), d_t_rho_t_j = d_t_rho_t(j),
           L_t_i_j = 0.5 * (L_t(i, j) + L_t(j, i)),
           K_t_i_j = 0.5 * (K_t(i, j) + K_t(j, i));
       // See (eqn:Zt) in header.
       (*Z_t)(i, j) = etaN_sq * inv_sqrt_e_t_i * K_t_i_j * inv_sqrt_e_t_j
           + etaN_eta1 * inv_sqrt_e_t_i * L_t_i_j * inv_sqrt_e_t_j * d_t_rho_t_j
           + etaN_eta1 * d_t_rho_t_i * inv_sqrt_e_t_i * L_t_i_j * inv_sqrt_e_t_j
           + (i == j ? eta1_sq * d_t_rho_t_i * d_t_rho_t_i : 0.0);
     }
   }
 }

◆ Eta()

BaseFloat Eta ( int32 N ) const

private

Definition at line 470 of file natural-gradient-online.cc.

References KALDI_ASSERT, OnlineNaturalGradient::num_minibatches_history_, and OnlineNaturalGradient::num_samples_history_.

Referenced by OnlineNaturalGradient::ComputeWt1(), OnlineNaturalGradient::ComputeZt(), OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirectionsInternal().

                                                   {
   if (num_minibatches_history_ > 0.0) {
     KALDI_ASSERT(num_minibatches_history_ > 1.0);
     return 1.0 / num_minibatches_history_;
   } else {
     KALDI_ASSERT(num_samples_history_ > 0.0);
     BaseFloat ans = 1.0 - exp(-N / num_samples_history_);
     // Don't let eta approach 1 too closely, as it can lead to NaN's appearing if
     // the input is all zero.
     if (ans > 0.9) ans = 0.9;
     return ans;
   }
 }

◆ Freeze()

void Freeze ( bool frozen )

inline

Definition at line 437 of file natural-gradient-online.h.

Referenced by LstmNonlinearityComponent::ConsolidateMemory(), TimeHeightConvolutionComponent::FreezeNaturalGradient(), LstmNonlinearityComponent::FreezeNaturalGradient(), TdnnComponent::FreezeNaturalGradient(), NaturalGradientAffineComponent::FreezeNaturalGradient(), LinearComponent::FreezeNaturalGradient(), and NaturalGradientPerElementScaleComponent::FreezeNaturalGradient().

437 { frozen_ = frozen; }

kaldi::nnet3::OnlineNaturalGradient::frozen_

bool frozen_

Definition: natural-gradient-online.h:603

◆ GetAlpha()

BaseFloat GetAlpha ( ) const

inline

Definition at line 432 of file natural-gradient-online.h.

References OnlineNaturalGradient::alpha_.

Referenced by LstmNonlinearityComponent::ConsolidateMemory(), TimeHeightConvolutionComponent::Info(), TdnnComponent::Info(), NaturalGradientAffineComponent::Info(), LinearComponent::Info(), TimeHeightConvolutionComponent::Write(), TdnnComponent::Write(), NaturalGradientAffineComponent::Write(), and LinearComponent::Write().

432 { return alpha_; }

kaldi::nnet3::OnlineNaturalGradient::alpha_

BaseFloat alpha_

Definition: natural-gradient-online.h:583

◆ GetNumMinibatchesHistory()

BaseFloat GetNumMinibatchesHistory ( ) const

inline

Definition at line 431 of file natural-gradient-online.h.

References OnlineNaturalGradient::num_minibatches_history_.

Referenced by TimeHeightConvolutionComponent::Info(), and TimeHeightConvolutionComponent::Write().

431 { return num_minibatches_history_; }

kaldi::nnet3::OnlineNaturalGradient::num_minibatches_history_

BaseFloat num_minibatches_history_

Definition: natural-gradient-online.h:578

◆ GetNumSamplesHistory()

BaseFloat GetNumSamplesHistory ( ) const

inline

Definition at line 430 of file natural-gradient-online.h.

References OnlineNaturalGradient::num_samples_history_.

Referenced by TdnnComponent::Info(), NaturalGradientAffineComponent::Info(), LinearComponent::Info(), TdnnComponent::Write(), NaturalGradientAffineComponent::Write(), and LinearComponent::Write().

430 { return num_samples_history_; }

kaldi::nnet3::OnlineNaturalGradient::num_samples_history_

BaseFloat num_samples_history_

Definition: natural-gradient-online.h:566

◆ GetRank()

int32 GetRank ( ) const

inline

Definition at line 433 of file natural-gradient-online.h.

References OnlineNaturalGradient::rank_.

Referenced by LstmNonlinearityComponent::ConsolidateMemory(), TimeHeightConvolutionComponent::Info(), TdnnComponent::Info(), NaturalGradientAffineComponent::Info(), LinearComponent::Info(), TimeHeightConvolutionComponent::Write(), TdnnComponent::Write(), NaturalGradientAffineComponent::Write(), and LinearComponent::Write().

433 { return rank_; }

kaldi::nnet3::OnlineNaturalGradient::rank_

int32 rank_

Definition: natural-gradient-online.h:553

◆ GetUpdatePeriod()

int32 GetUpdatePeriod ( ) const

inline

Definition at line 434 of file natural-gradient-online.h.

References OnlineNaturalGradient::update_period_.

Referenced by LstmNonlinearityComponent::ConsolidateMemory(), TdnnComponent::Info(), NaturalGradientAffineComponent::Info(), LinearComponent::Info(), NaturalGradientAffineComponent::Write(), and LinearComponent::Write().

434 { return update_period_; }

kaldi::nnet3::OnlineNaturalGradient::update_period_

int32 update_period_

Definition: natural-gradient-online.h:558

◆ Init()

void Init ( const CuMatrixBase< BaseFloat > & R0 )

private

Definition at line 122 of file natural-gradient-online.cc.

References OnlineNaturalGradient::d_t_, OnlineNaturalGradient::frozen_, rnnlm::i, OnlineNaturalGradient::InitDefault(), kaldi::kUndefined, CuMatrixBase< Real >::NumCols(), CuMatrixBase< Real >::NumRows(), OnlineNaturalGradient::PreconditionDirections(), OnlineNaturalGradient::rank_, OnlineNaturalGradient::rho_t_, OnlineNaturalGradient::t_, and OnlineNaturalGradient::W_t_.

Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirections().

                                                                   {
   int32 D = X0.NumCols();
   // for locking reasons it's better to use a different object.
   OnlineNaturalGradient this_copy(*this);
   this_copy.InitDefault(D);
   this_copy.t_ = 1;  // Prevent recursion to Init() again.
 
   CuMatrix<BaseFloat> X0_copy(X0.NumRows(), X0.NumCols(), kUndefined);
   // 'num_iters' is number of iterations with the same data from a pseudorandom
   // start.  this is a faster way of starting than doing eigenvalue
   // decomposition.
   //
   // Note: we only do three iterations of initialization if we have enough data
   // that it's reasonably possible to estimate the subspace of dimension
   // this_copy.rank_.  If we don't have more than that many rows in our initial
   // minibatch X0, we just do one iteration... this gives us almost exactly
   // (barring small effects due to epsilon_ > 0) the row subspace of X0 after
   // one iteration anyway.
   int32 num_init_iters;
   if (X0.NumRows() <= this_copy.rank_)
     num_init_iters = 1;
   else
     num_init_iters = 3;
 
   this_copy.frozen_ = false;   // un-freeze if it was frozen, so we can
                                // initialize.
   for (int32 i = 0; i < num_init_iters; i++) {
     BaseFloat scale;
     X0_copy.CopyFromMat(X0);
     this_copy.PreconditionDirections(&X0_copy, &scale);
   }
   rank_ = this_copy.rank_;
   W_t_.Swap(&this_copy.W_t_);
   d_t_.Swap(&this_copy.d_t_);
   rho_t_ = this_copy.rho_t_;
 }

◆ InitDefault()

void InitDefault ( int32 D )

private

Definition at line 71 of file natural-gradient-online.cc.

References OnlineNaturalGradient::alpha_, OnlineNaturalGradient::d_t_, OnlineNaturalGradient::delta_, OnlineNaturalGradient::epsilon_, OnlineNaturalGradient::InitOrthonormalSpecial(), KALDI_ASSERT, KALDI_WARN, kaldi::kUndefined, OnlineNaturalGradient::num_minibatches_history_, OnlineNaturalGradient::num_samples_history_, OnlineNaturalGradient::rank_, OnlineNaturalGradient::rho_t_, OnlineNaturalGradient::t_, and OnlineNaturalGradient::W_t_.

Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::Init().

                                                {
   if (rank_ >= D) {
     KALDI_WARN << "Rank " << rank_ << " of online preconditioner is >= dim " << D
                << ", setting it to "
                << (D - 1) << " (but this is probably still too high)";
     rank_ = D - 1;
   }
   if (rank_ == 0) {
     // Dimension of input data was 1, so the natural gradient preconditioner
     // would always be the unit matrix.
     // We'll handle this as a special case, for generality.
     return;
   }
   KALDI_ASSERT(num_samples_history_ > 0.0 && num_samples_history_ <= 1.0e+06);
   KALDI_ASSERT((num_minibatches_history_ == 0.0 ||
                 num_minibatches_history_ > 1.0) &&
                num_minibatches_history_ < 1.0e+06);
   KALDI_ASSERT(alpha_ >= 0.0);
   KALDI_ASSERT(rank_ > 0);
   KALDI_ASSERT(epsilon_ > 0.0 && epsilon_ <= 1.0e-05);  // plausible values.
   KALDI_ASSERT(delta_ > 0.0 && delta_ <= 1.0e-02);  // plausible values.
 
   // to initialize, in the equation
   //   F_t =(def) R_t^T D_t R_t + \rho_t I
   // we will set the orthogonal R_t to a special orthogonal matrix with no zero
   // rows or columns (see the function), rho_t to epsilon,
   // and D_t to epsilon.  But we don't store R_t directly.  Instead, we store
   //   W_t =(def)  E_t^{0.5} R_t,
   // where E_t =(def)  1/\beta_t (D_t^{-1} + 1/\beta_t I)^{-1}
   // from (eqn:tii),
   //  e_{tii} =   1/(\beta_t/d_{tii} + 1),
   // where
   // \beta_t =(def) \rho_t + \alpha/D tr(F_t)
   //         =      epsilon + alpha/D * (epsilon * D + epsilon * rank)
   //         =     epsilon * (1 + alpha * (D + rank) / D)
   // And  d_{tii} is epsilon, so
   //  e_{tii} =   1/((1 + alpha * (D + rank) / D) + 1)  [for each i.]
   //          =   1/(2 + alpha * (D + rank) / D)).
   BaseFloat epsilon = epsilon_;  // we could make this a bit more.
   rho_t_ = epsilon;
   d_t_.Resize(rank_, kUndefined);
   d_t_.Set(epsilon);
   W_t_.Resize(rank_, D, kUndefined);
   // after the next line, W_ will store the orthogonal matrix R_t.
   InitOrthonormalSpecial(&W_t_);
   BaseFloat E_tii = 1.0 / ( 2.0 + (D + rank_) * alpha_ / D );
   // W_t =(def) E_t^{0.5} R_t.
   W_t_.Scale(sqrt(E_tii));
   t_ = 0;
 }

◆ InitOrthonormalSpecial()

void InitOrthonormalSpecial ( CuMatrixBase< BaseFloat > * R )

staticprivate

This function creates a matrix with orthonormal rows that is like the following matrix, except with each row normalized to have unit 2-norm: [ 1.1 0 1 0 1 0 0 1.1 0 1 0 1 ] The reason why the first element in each row is 1.1 and not 1, is for symmetry-breaking...

we don't want any weighted sum of all these rows to be all ones, because the derivative in that direction can be zero in some architectures and it causes us to have to do an inefficient CPU-based renormalization.

Definition at line 46 of file natural-gradient-online.cc.

References CuMatrixBase< Real >::AddElements(), rnnlm::i, KALDI_ASSERT, CuMatrixBase< Real >::NumCols(), CuMatrixBase< Real >::NumRows(), and CuMatrixBase< Real >::SetZero().

Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::InitDefault().

                                                                              {
   int32 num_rows = R->NumRows(), num_cols = R->NumCols();
   KALDI_ASSERT(num_cols >= num_rows);
   R->SetZero();
   std::vector<MatrixElement<BaseFloat> > elems;
   elems.reserve(num_cols);
   BaseFloat first_elem = 1.1;
   for (int32 r = 0; r < num_rows; r++) {
     std::vector<int32> cols;  // columns that have an entry for this row
     for (int32 c = r; c < num_cols; c += num_rows)
       cols.push_back(c);
     BaseFloat normalizer = 1.0 / sqrt(first_elem * first_elem +
                                       cols.size() - 1);
     for (size_t i = 0; i < cols.size(); i++) {
       int32 c = cols[i];
       MatrixElement<BaseFloat> e = { r, c,
                                      normalizer * (i == 0 ? first_elem :
                                                    BaseFloat(1.0)) };
       elems.push_back(e);
     }
   }
   R->AddElements(1.0, elems);
 }

◆ operator=()

OnlineNaturalGradient & operator= ( const OnlineNaturalGradient & other )

Definition at line 588 of file natural-gradient-online.cc.

References OnlineNaturalGradient::alpha_, OnlineNaturalGradient::d_t_, OnlineNaturalGradient::delta_, OnlineNaturalGradient::epsilon_, OnlineNaturalGradient::num_samples_history_, OnlineNaturalGradient::rank_, OnlineNaturalGradient::rho_t_, OnlineNaturalGradient::self_debug_, OnlineNaturalGradient::t_, OnlineNaturalGradient::update_period_, and OnlineNaturalGradient::W_t_.

Referenced by OnlineNaturalGradient::Freeze().

                                         {
   rank_ = other.rank_;
   update_period_ = other.update_period_;
   num_samples_history_ = other.num_samples_history_;
   alpha_ = other.alpha_;
   epsilon_ = other.epsilon_;
   delta_ = other.delta_;
   t_ = other.t_;
   self_debug_ = other.self_debug_;
   W_t_ = other.W_t_;
   rho_t_ = other.rho_t_;
   d_t_ = other.d_t_;
   return *this;
 }

◆ PreconditionDirections()

void PreconditionDirections	(	CuMatrixBase< BaseFloat > *	X,
		BaseFloat *	scale
	)

This call implements the main functionality of this class.

Parameters

[in,out]	R	The "R" pointer is both the input (R in the comment, X in the paper), and the output (P in the comment, X with a hat on it in the paper). Each row of R is viewed as a vector in some space, where we're estimating a smoothed Fisher matrix and then multiplying by the inverse of that smoothed Fisher matrix.
[out]	scale	If non-NULL, a scaling factor is written to here, and the output 'R' should be multiplied by this factor by the user (we don't do it internally, to save an operation). The factor is chosen so that the vector 2-norm of R is the same after the natural gradient as it was before. (The pointer being NULL or non-NULL doesn't affect the magnitude of R; in any case the user will probably want to do this rescaling, the question being whether they want to do so manually or not.

Definition at line 159 of file natural-gradient-online.cc.

References OnlineNaturalGradient::d_t_, OnlineNaturalGradient::Init(), kaldi::kTrans, CuMatrixBase< Real >::NumCols(), NVTX_RANGE, OnlineNaturalGradient::PreconditionDirectionsInternal(), CuMatrixBase< Real >::Range(), OnlineNaturalGradient::rho_t_, OnlineNaturalGradient::t_, kaldi::TraceMatMat(), OnlineNaturalGradient::Updating(), and OnlineNaturalGradient::W_t_.

Referenced by LstmNonlinearityComponent::Backprop(), ConstantComponent::Backprop(), LinearComponent::Backprop(), PerElementOffsetComponent::Backprop(), ConstantFunctionComponent::Backprop(), ScaleAndOffsetComponent::BackpropInternal(), LstmNonlinearityComponent::ConsolidateMemory(), OnlineNaturalGradient::Freeze(), OnlineNaturalGradient::Init(), kaldi::nnet3::UnitTestPreconditionDirectionsOnline(), NaturalGradientRepeatedAffineComponent::Update(), NaturalGradientAffineComponent::Update(), NaturalGradientPerElementScaleComponent::Update(), TimeHeightConvolutionComponent::UpdateNaturalGradient(), and TdnnComponent::UpdateNaturalGradient().

                       {
   NVTX_RANGE(__func__);
   if (X_t->NumCols() == 1) {
     // If the dimension of the space equals one then our natural gradient update
     // with rescaling becomes a no-op, but the code wouldn't naturally handle it
     // because rank would be zero.  Support this as a special case.
     if (scale)
       *scale = 1.0;
     return;
   }
 
   if (t_ == 0) // not initialized
     Init(*X_t);
 
   int32 R = W_t_.NumRows(), D = W_t_.NumCols();
   // space for W_t, J_t, K_t, L_t.
   CuMatrix<BaseFloat> WJKL_t(2 * R, D + R);
   WJKL_t.Range(0, R, 0, D).CopyFromMat(W_t_);
   BaseFloat rho_t(rho_t_);
   Vector<BaseFloat> d_t(d_t_);
 
   bool updating = Updating();
 
   BaseFloat initial_product;
   initial_product = TraceMatMat(*X_t, *X_t, kTrans);
 
   PreconditionDirectionsInternal(rho_t, initial_product,
                                  updating, d_t, &WJKL_t, X_t);
 
   if (scale) {
     if (initial_product <= 0.0) {
       *scale = 1.0;
     } else {
       BaseFloat final_product = TraceMatMat(*X_t, *X_t, kTrans);
       *scale = sqrt(initial_product / final_product);
     }
   }
   t_ += 1;
 }

◆ PreconditionDirectionsInternal()

void PreconditionDirectionsInternal	(	const BaseFloat	rho_t,
		const BaseFloat	tr_X_Xt,
		bool	updating,
		const Vector< BaseFloat > &	d_t,
		CuMatrixBase< BaseFloat > *	WJKL_t,
		CuMatrixBase< BaseFloat > *	X_t
	)

private

Definition at line 324 of file natural-gradient-online.cc.

Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirections().

                                   {
   NVTX_RANGE(__func__);
   int32 N = X_t->NumRows(),  // Minibatch size.
       D = X_t->NumCols(),  // Dimensions of vectors we're preconditioning
       R = rank_;  // Rank of correction to unit matrix.
   KALDI_ASSERT(R > 0 && R < D);
   BaseFloat eta = Eta(N);
 
   CuMatrix<BaseFloat> H_t(N, R);
   const CuSubMatrix<BaseFloat> W_t(*WJKL_t, 0, R, 0, D);
   // Below, WJ_t and LK_t are combinations of two matrices,
   // which we define in order to combine two separate multiplications into one.
   CuSubMatrix<BaseFloat> J_t(*WJKL_t, R, R, 0, D),
       L_t(*WJKL_t, 0, R, D, R),
       K_t(*WJKL_t, R, R, D, R),
       WJ_t(*WJKL_t, 0, 2 * R, 0, D),
       LK_t(*WJKL_t, 0, 2 * R, D, R);
 
   H_t.AddMatMat(1.0, *X_t, kNoTrans, W_t, kTrans, 0.0);  // H_t = X_t W_t^T
 
   if (!updating) {
     // We're not updating the estimate of the Fisher matrix; we just apply the
     // preconditioning and return.
     // X_hat_t = X_t - H_t W_t
     X_t->AddMatMat(-1.0, H_t, kNoTrans, W_t, kNoTrans, 1.0);
     return;
   }
   J_t.AddMatMat(1.0, H_t, kTrans, *X_t, kNoTrans, 0.0);  // J_t = H_t^T X_t
 
   bool compute_lk_together = (N > D);
 
   if (compute_lk_together) {
     // do the following two multiplies in one operation...
     // note
     // L_t = W_t J_t^T
     // K_t = J_t J_t^T
     // Note: L_t was defined as L_t = J_t W_t^T, but it's actually symmetric,
     // so we can compute it as L_t = W_t J_t^T.
     LK_t.AddMatMat(1.0, WJ_t, kNoTrans, J_t, kTrans, 0.0);
   } else {
     K_t.SymAddMat2(1.0, J_t, kNoTrans, 0.0);
     L_t.SymAddMat2(1.0, H_t, kTrans, 0.0);
   }
 
   Matrix<BaseFloat> LK_cpu(LK_t);  // contains L and K on the CPU.
   SubMatrix<BaseFloat> L_t_cpu(LK_cpu, 0, R, 0, R),
       K_t_cpu(LK_cpu, R, R, 0, R);
   if (!compute_lk_together) {
     // the SymAddMat2 operations only set the lower triangle and diagonal.
     L_t_cpu.CopyLowerToUpper();
     K_t_cpu.CopyLowerToUpper();
   }
 
   // beta_t = \rho_t(1+\alpha) + \alpha/D tr(D_t)
   BaseFloat beta_t = rho_t * (1.0 + alpha_) + alpha_ * d_t.Sum() / D;
   Vector<BaseFloat> e_t(R), sqrt_e_t(R), inv_sqrt_e_t(R);
   ComputeEt(d_t, beta_t, &e_t, &sqrt_e_t, &inv_sqrt_e_t);
   KALDI_VLOG(5) << "e_t = " << e_t;
 
   // The double-precision Z_t here, and the scaling, is to avoid potential
   // overflow, because Z_t is proportional to the fourth power of data.
   SpMatrix<double> Z_t_double(R);
   ComputeZt(N, rho_t, d_t, inv_sqrt_e_t, K_t_cpu, L_t_cpu, &Z_t_double);
   BaseFloat z_t_scale = std::max<double>(1.0, Z_t_double.Trace());
   Z_t_double.Scale(1.0 / z_t_scale);
   SpMatrix<BaseFloat> Z_t_scaled(Z_t_double);
 
   Matrix<BaseFloat> U_t(R, R);
   Vector<BaseFloat> c_t(R);
   // do the symmetric eigenvalue decomposition Z_t = U_t C_t U_t^T.
   Z_t_scaled.Eig(&c_t, &U_t);
   SortSvd(&c_t, &U_t);
   c_t.Scale(z_t_scale);
 
   const BaseFloat condition_threshold = 1.0e+06;
   // must_reorthogonalize will be true if the last diagonal element of c_t is
   // negative, since we don't take the absolute value, but this is the right
   // thing anyway.
   bool must_reorthogonalize = (c_t(0) > condition_threshold * c_t(R - 1));
 
   BaseFloat c_t_floor = pow(rho_t * (1 - eta), 2);
   int32 nf;
   c_t.ApplyFloor(c_t_floor, &nf);
   if (nf > 0)
     must_reorthogonalize = true;
   if (nf > 0 && self_debug_) {
     KALDI_WARN << "Floored " << nf << " elements of C_t.";
   }
 
   X_t->AddMatMat(-1.0, H_t, kNoTrans, W_t, kNoTrans, 1.0);  // X_hat_t = X_t - H_t W_t
 
   Vector<BaseFloat> sqrt_c_t(c_t);
   sqrt_c_t.ApplyPow(0.5);
 
   // \rho_{t+1} = 1/(D - R) (\eta/N tr(X_t X_t^T) + (1-\eta)(D \rho_t + tr(D_t)) - tr(C_t^{0.5})).
   BaseFloat rho_t1 = 1.0 / (D - R) * (eta / N * tr_X_Xt
                                       + (1-eta)*(D * rho_t + d_t.Sum())
                                       - sqrt_c_t.Sum());
   // D_{t+1} = C_t^{0.5} - \rho_{t+1} I
   Vector<BaseFloat> d_t1(sqrt_c_t);
   d_t1.Add(-rho_t1);
   BaseFloat floor_val = std::max(epsilon_, delta_ * sqrt_c_t.Max());
   if (rho_t1 < floor_val)
     rho_t1 = floor_val;
   d_t1.ApplyFloor(floor_val);
 
   CuMatrix<BaseFloat> W_t1(R, D);  // W_{t+1}
   ComputeWt1(N, d_t, d_t1, rho_t, rho_t1, U_t, sqrt_c_t, inv_sqrt_e_t,
              W_t, &J_t, &W_t1);
 
   if (must_reorthogonalize) {
     if (self_debug_) {
       KALDI_WARN << "Reorthogonalizing.";
     }
     ReorthogonalizeRt1(d_t1,
                        rho_t1,
                        &W_t1,
                        &J_t,
                        &L_t);
   }
 
   W_t_.Swap(&W_t1);
   d_t_.CopyFromVec(d_t1);
   rho_t_ = rho_t1;
 
   if (self_debug_)
     SelfTest();
 }

◆ ReorthogonalizeRt1()

void ReorthogonalizeRt1	(	const VectorBase< BaseFloat > &	d_t1,
		BaseFloat	rho_t1,
		CuMatrixBase< BaseFloat > *	W_t1,
		CuMatrixBase< BaseFloat > *	temp_W,
		CuMatrixBase< BaseFloat > *	temp_O
	)

private

Definition at line 201 of file natural-gradient-online.cc.

References CuMatrixBase< Real >::AddMatMat(), OnlineNaturalGradient::alpha_, TpMatrix< Real >::Cholesky(), OnlineNaturalGradient::ComputeEt(), CuMatrixBase< Real >::CopyFromMat(), MatrixBase< Real >::CopyFromTp(), rnnlm::i, TpMatrix< Real >::Invert(), SpMatrix< Real >::IsUnit(), rnnlm::j, KALDI_WARN, kaldi::kNoTrans, kaldi::kTakeLower, kaldi::kUndefined, PackedMatrix< Real >::Max(), CuMatrixBase< Real >::MulRowsVec(), CuMatrixBase< Real >::NumCols(), CuMatrixBase< Real >::NumRows(), MatrixBase< Real >::OrthogonalizeRows(), OnlineNaturalGradient::self_debug_, VectorBase< Real >::Sum(), and CuMatrixBase< Real >::SymAddMat2().

Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirectionsInternal().

                                      {
   // threshold is a configuration value: a desired threshold on orthogonality,
   // below which we won't reorthogonalize.
   const BaseFloat threshold = 1.0e-03;
 
   int32 R = W_t1->NumRows(), D = W_t1->NumCols();
   BaseFloat beta_t1 = rho_t1 * (1.0 + alpha_) + alpha_ * d_t1.Sum() / D;
   Vector<BaseFloat> e_t1(R, kUndefined), sqrt_e_t1(R, kUndefined),
       inv_sqrt_e_t1(R, kUndefined);
   ComputeEt(d_t1, beta_t1, &e_t1, &sqrt_e_t1, &inv_sqrt_e_t1);
 
   temp_O->SymAddMat2(1.0, *W_t1, kNoTrans, 0.0);
   // O_{t+1} =  E_{t+1}^{-0.5} W_{t+1} W_{t+1}^T E_{t+1}^{-0.5}
   Matrix<BaseFloat> O_mat(*temp_O);
   SpMatrix<BaseFloat> O(O_mat, kTakeLower);
   for (int32 i = 0; i < R; i++) {
     BaseFloat i_factor = inv_sqrt_e_t1(i);
     for (int32 j = 0; j <= i; j++) {
       BaseFloat j_factor = inv_sqrt_e_t1(j);
       O(i, j) *= i_factor * j_factor;
     }
   }
   if (O.IsUnit(threshold)) {
     if (self_debug_) {
       KALDI_WARN << "Not reorthogonalizing since already orthognoal: " << O;
     }
     return;
   }
   TpMatrix<BaseFloat> C(R);
   bool cholesky_ok = true;
   try {
     // one of the following two calls may throw an exception.
     C.Cholesky(O);
     C.Invert();  // Now it's C^{-1}.
     if (!(C.Max() < 100.0)) {
       KALDI_WARN << "Cholesky out of expected range, "
                 << "reorthogonalizing with Gram-Schmidt";
       cholesky_ok = false;
     }
   } catch (...) {
     // We do a Gram-Schmidt orthogonalization, which is a bit less efficient but
     // more robust than the method using Cholesky.
     KALDI_WARN << "Cholesky or Invert() failed while re-orthogonalizing R_t. "
                << "Re-orthogonalizing on CPU.";
     cholesky_ok = false;
   }
   if (!cholesky_ok) {
     Matrix<BaseFloat> cpu_W_t1(*W_t1);
     cpu_W_t1.OrthogonalizeRows();
     W_t1->CopyFromMat(cpu_W_t1);
     // at this point cpu_W_t1 represents R_{t+1}- it has orthonormal
     // rows.  Do: W_{t+1} = E_{t+1}^{0.5} R_{t+1}
     CuVector<BaseFloat> sqrt_e_t1_gpu(sqrt_e_t1);
     W_t1->MulRowsVec(sqrt_e_t1_gpu);
     return;
   }
   // Next, compute (E_t^{0.5} C^{-1} E_t^{-0.5})
   // but it's really t+1, not t.
   for (int32 i = 0; i < R; i++) {
     BaseFloat i_factor = sqrt_e_t1(i);
     for (int32 j = 0; j < i; j++) {
       // skip j == i because i_factor * j_factor == 1 for j == i.
       BaseFloat j_factor = inv_sqrt_e_t1(j);
       C(i, j) *= i_factor * j_factor;
     }
   }
   O_mat.CopyFromTp(C);
   temp_O->CopyFromMat(O_mat);
   temp_W->CopyFromMat(*W_t1);
   W_t1->AddMatMat(1.0, *temp_O, kNoTrans, *temp_W, kNoTrans, 0.0);
 }

◆ SelfTest()

void SelfTest ( ) const

private

Definition at line 279 of file natural-gradient-online.cc.

References CuSpMatrix< Real >::AddMat2(), OnlineNaturalGradient::alpha_, OnlineNaturalGradient::ComputeEt(), OnlineNaturalGradient::d_t_, OnlineNaturalGradient::delta_, OnlineNaturalGradient::epsilon_, rnnlm::i, SpMatrix< Real >::IsUnit(), rnnlm::j, KALDI_ASSERT, KALDI_WARN, kaldi::kNoTrans, kaldi::kUndefined, OnlineNaturalGradient::rho_t_, and OnlineNaturalGradient::W_t_.

Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirectionsInternal().

                                            {
   KALDI_ASSERT(rho_t_ >= epsilon_);
   BaseFloat d_t_max = d_t_.Max(), d_t_min = d_t_.Min();
   KALDI_ASSERT(d_t_min >= epsilon_);
   KALDI_ASSERT(d_t_min > 0.9 * delta_ * d_t_max);
   KALDI_ASSERT(rho_t_ > 0.9 * delta_ * d_t_max);
 
   int32 D = W_t_.NumCols(), R = W_t_.NumRows();
   BaseFloat beta_t = rho_t_ * (1.0 + alpha_) + alpha_ * d_t_.Sum() / D;
   Vector<BaseFloat> e_t(R, kUndefined), sqrt_e_t(R, kUndefined),
       inv_sqrt_e_t(R, kUndefined);
   ComputeEt(d_t_, beta_t, &e_t, &sqrt_e_t, &inv_sqrt_e_t);
 
   CuSpMatrix<BaseFloat> S(R);
   S.AddMat2(1.0, W_t_, kNoTrans, 0.0);
   SpMatrix<BaseFloat> O(S);
   for (int32 i = 0; i < R; i++) {
     BaseFloat i_factor = inv_sqrt_e_t(i);
     for (int32 j = 0; j <= i; j++) {
       BaseFloat j_factor = inv_sqrt_e_t(j);
       O(i, j) *= i_factor * j_factor;
     }
   }
   if (!O.IsUnit(1.0e-04) || O(0, 0) != O(0, 0)) {
     BaseFloat worst_error = 0.0;
     int32 worst_i = 0, worst_j = 0;
     for (int32 i = 0; i < R; i++) {
       for (int32 j = 0; j < R; j++) {
         BaseFloat elem = O(i, j);
         BaseFloat error = fabs(elem - (i == j ? 1.0 : 0.0));
         if (error > worst_error || error != error) {
           worst_error = error;
           worst_i = i;
           worst_j = j;
         }
       }
     }
     if (worst_error > 1.0e-02 || worst_error != worst_error) {
       KALDI_WARN << "Failed to verify W_t (worst error: O[" << worst_i << ','
                  << worst_j << "] = " << O(worst_i, worst_j)
                  << ", d_t = " << d_t_;
     }
   }
 }

◆ SetAlpha()

void SetAlpha ( BaseFloat alpha )

Definition at line 623 of file natural-gradient-online.cc.

References OnlineNaturalGradient::alpha_, and KALDI_ASSERT.

Referenced by LstmNonlinearityComponent::ConsolidateMemory(), TimeHeightConvolutionComponent::InitFromConfig(), TdnnComponent::InitFromConfig(), NaturalGradientAffineComponent::InitFromConfig(), LinearComponent::InitFromConfig(), TimeHeightConvolutionComponent::Read(), TdnnComponent::Read(), and LinearComponent::Read().

                                                     {
   KALDI_ASSERT(alpha >= 0.0);
   alpha_ = alpha;
 }

◆ SetNumMinibatchesHistory()

void SetNumMinibatchesHistory ( BaseFloat num_minibatches_history )

Definition at line 617 of file natural-gradient-online.cc.

References KALDI_ASSERT, and OnlineNaturalGradient::num_minibatches_history_.

Referenced by TimeHeightConvolutionComponent::InitFromConfig(), and TimeHeightConvolutionComponent::Read().

                                        {
   KALDI_ASSERT(num_minibatches_history > 1.0);
   num_minibatches_history_ = num_minibatches_history;
 }

◆ SetNumSamplesHistory()

void SetNumSamplesHistory ( BaseFloat num_samples_history )

Definition at line 612 of file natural-gradient-online.cc.

References KALDI_ASSERT, and OnlineNaturalGradient::num_samples_history_.

Referenced by TdnnComponent::InitFromConfig(), NaturalGradientAffineComponent::InitFromConfig(), LinearComponent::InitFromConfig(), LstmNonlinearityComponent::InitNaturalGradient(), TdnnComponent::Read(), and LinearComponent::Read().

                                                                               {
   KALDI_ASSERT(num_samples_history > 0.0 &&
                num_samples_history < 1.0e+6);
   num_samples_history_ = num_samples_history;
 }

◆ SetRank()

void SetRank ( int32 rank )

Definition at line 604 of file natural-gradient-online.cc.

References KALDI_ASSERT, and OnlineNaturalGradient::rank_.

Referenced by LstmNonlinearityComponent::ConsolidateMemory(), TimeHeightConvolutionComponent::InitFromConfig(), TdnnComponent::InitFromConfig(), NaturalGradientAffineComponent::InitFromConfig(), LinearComponent::InitFromConfig(), PerElementOffsetComponent::InitFromConfig(), LstmNonlinearityComponent::InitNaturalGradient(), LinearComponent::LinearComponent(), NaturalGradientAffineComponent::NaturalGradientAffineComponent(), TimeHeightConvolutionComponent::Read(), TdnnComponent::Read(), LinearComponent::Read(), PerElementOffsetComponent::Read(), and kaldi::nnet3::UnitTestPreconditionDirectionsOnline().

                                               {
   KALDI_ASSERT(rank > 0);
   rank_ = rank;
 }

◆ SetUpdatePeriod()

void SetUpdatePeriod ( int32 update_period )

Definition at line 608 of file natural-gradient-online.cc.

References KALDI_ASSERT, and OnlineNaturalGradient::update_period_.

Referenced by LstmNonlinearityComponent::ConsolidateMemory(), TdnnComponent::InitFromConfig(), NaturalGradientAffineComponent::InitFromConfig(), LinearComponent::InitFromConfig(), PerElementOffsetComponent::InitFromConfig(), LstmNonlinearityComponent::InitNaturalGradient(), LinearComponent::LinearComponent(), NaturalGradientAffineComponent::NaturalGradientAffineComponent(), TdnnComponent::Read(), LinearComponent::Read(), and PerElementOffsetComponent::Read().

                                                                {
   KALDI_ASSERT(update_period > 0);
   update_period_ = update_period;
 }

◆ Swap()

void Swap ( OnlineNaturalGradient * other )

Definition at line 628 of file natural-gradient-online.cc.

References OnlineNaturalGradient::alpha_, OnlineNaturalGradient::d_t_, OnlineNaturalGradient::delta_, OnlineNaturalGradient::epsilon_, OnlineNaturalGradient::frozen_, OnlineNaturalGradient::num_minibatches_history_, OnlineNaturalGradient::num_samples_history_, OnlineNaturalGradient::rank_, OnlineNaturalGradient::rho_t_, OnlineNaturalGradient::self_debug_, kaldi::swap(), OnlineNaturalGradient::t_, OnlineNaturalGradient::update_period_, and OnlineNaturalGradient::W_t_.

                                                              {
   std::swap(rank_, other->rank_);
   std::swap(update_period_, other->update_period_);
   std::swap(num_samples_history_, other->num_samples_history_);
   std::swap(num_minibatches_history_, other->num_minibatches_history_);
   std::swap(alpha_, other->alpha_);
   std::swap(epsilon_, other->epsilon_);
   std::swap(delta_, other->delta_);
   std::swap(frozen_, other->frozen_);
   std::swap(t_, other->t_);
   std::swap(self_debug_, other->self_debug_);
   W_t_.Swap(&(other->W_t_));
   std::swap(rho_t_, other->rho_t_);
   d_t_.Swap(&(other->d_t_));
 }

◆ TurnOnDebug()

void TurnOnDebug ( )

inline

Definition at line 429 of file natural-gradient-online.h.

References OnlineNaturalGradient::self_debug_.

Referenced by kaldi::nnet3::UnitTestPreconditionDirectionsOnline().

429 { self_debug_ = true; }

kaldi::nnet3::OnlineNaturalGradient::self_debug_

bool self_debug_

Definition: natural-gradient-online.h:610

◆ Updating()

bool Updating ( ) const

private

Definition at line 459 of file natural-gradient-online.cc.

References OnlineNaturalGradient::frozen_, OnlineNaturalGradient::t_, and OnlineNaturalGradient::update_period_.

Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirections().

                                            {
   // Just hard-code it here that we do 10 initial updates before skipping any.
   // This must be > 'num_init_iters = 3' from Init().
   const int num_initial_updates = 10;
 
   return (!frozen_ &&
           (t_ <= num_initial_updates ||
            (t_ - num_initial_updates) % update_period_ == 0));
 }