OnlineNaturalGradient Class Reference

Keywords for search: natural gradient, naturalgradient, NG-SGD. More...

#include <natural-gradient-online.h>

Collaboration diagram for OnlineNaturalGradient:

Public Member Functions

 OnlineNaturalGradient ()
 
void SetRank (int32 rank)
 
void SetUpdatePeriod (int32 update_period)
 
void SetNumSamplesHistory (BaseFloat num_samples_history)
 
void SetNumMinibatchesHistory (BaseFloat num_minibatches_history)
 
void SetAlpha (BaseFloat alpha)
 
void TurnOnDebug ()
 
BaseFloat GetNumSamplesHistory () const
 
BaseFloat GetNumMinibatchesHistory () const
 
BaseFloat GetAlpha () const
 
int32 GetRank () const
 
int32 GetUpdatePeriod () const
 
void Freeze (bool frozen)
 
void PreconditionDirections (CuMatrixBase< BaseFloat > *X, BaseFloat *scale)
 This call implements the main functionality of this class. More...
 
 OnlineNaturalGradient (const OnlineNaturalGradient &other)
 
OnlineNaturalGradientoperator= (const OnlineNaturalGradient &other)
 
void Swap (OnlineNaturalGradient *other)
 

Private Member Functions

void PreconditionDirectionsInternal (const BaseFloat rho_t, const BaseFloat tr_X_Xt, bool updating, const Vector< BaseFloat > &d_t, CuMatrixBase< BaseFloat > *WJKL_t, CuMatrixBase< BaseFloat > *X_t)
 
bool Updating () const
 
void ComputeEt (const VectorBase< BaseFloat > &d_t, BaseFloat beta_t, VectorBase< BaseFloat > *e_t, VectorBase< BaseFloat > *sqrt_e_t, VectorBase< BaseFloat > *inv_sqrt_e_t) const
 
void ComputeZt (int32 N, BaseFloat rho_t, const VectorBase< BaseFloat > &d_t, const VectorBase< BaseFloat > &inv_sqrt_e_t, const MatrixBase< BaseFloat > &K_t, const MatrixBase< BaseFloat > &L_t, SpMatrix< double > *Z_t) const
 
void ComputeWt1 (int32 N, const VectorBase< BaseFloat > &d_t, const VectorBase< BaseFloat > &d_t1, BaseFloat rho_t, BaseFloat rho_t1, const MatrixBase< BaseFloat > &U_t, const VectorBase< BaseFloat > &sqrt_c_t, const VectorBase< BaseFloat > &inv_sqrt_e_t, const CuMatrixBase< BaseFloat > &W_t, CuMatrixBase< BaseFloat > *J_t, CuMatrixBase< BaseFloat > *W_t1) const
 
void ReorthogonalizeRt1 (const VectorBase< BaseFloat > &d_t1, BaseFloat rho_t1, CuMatrixBase< BaseFloat > *W_t1, CuMatrixBase< BaseFloat > *temp_W, CuMatrixBase< BaseFloat > *temp_O)
 
void Init (const CuMatrixBase< BaseFloat > &R0)
 
void InitDefault (int32 D)
 
BaseFloat Eta (int32 N) const
 
void SelfTest () const
 

Static Private Member Functions

static void InitOrthonormalSpecial (CuMatrixBase< BaseFloat > *R)
 This function creates a matrix with orthonormal rows that is like the following matrix, except with each row normalized to have unit 2-norm: [ 1.1 0 1 0 1 0 0 1.1 0 1 0 1 ] The reason why the first element in each row is 1.1 and not 1, is for symmetry-breaking... More...
 

Private Attributes

int32 rank_
 
int32 update_period_
 
BaseFloat num_samples_history_
 
BaseFloat num_minibatches_history_
 
BaseFloat alpha_
 
BaseFloat epsilon_
 
BaseFloat delta_
 
bool frozen_
 
int32 t_
 
bool self_debug_
 
CuMatrix< BaseFloatW_t_
 
BaseFloat rho_t_
 
Vector< BaseFloatd_t_
 

Detailed Description

Keywords for search: natural gradient, naturalgradient, NG-SGD.

This method is explained in the paper "Parallel training of DNNs with Natural Gradient and Parameter Averaging" by D. Povey, X. Zhang and S. Khudanpur, ICLR Workshop, 2015, where it is referred to as online NG-SGD. Note that the method exported from this header is just the core of the algorithm, and some outer-level parts of it are implemented in class NaturalGradientAffineComponent.

The rest of this extended comment describes the way we keep updated an estimate of the inverse of a scatter matrix, in an online way. This is the same as the estimation of one of the A or B quantities in the paper. This comment is slightly redundant with the paper- actually it precedes the paper- but we keep it in case it is useful in understanging our method.

We consider the problem of doing online estimation of a (scaled-identity plus low-rank) approximation of a Fisher matrix... since the Fisher matrix is a scatter of vector-valued derivatives and we will be given the derivatives (or at least terms in a factorization of the derivatives which need not concern us right now), we can just think of the present task as being the online accumulation of a (low-rank plus scaled-identity) approximation to a variance of a distribution with mean zero.

Later on we'll think about how to get easy access to the inverse of this approximate variance, which is what we really need.

Our approximation to the Fisher matrix (the scatter of derivatives) will be of the following form (and just think of this as an approximate variance matrix of some arbitrary quantities).

F_t =(def) R_t^T D_t R_t + I

(t is the minibatch index), where R_t is an R by D matrix with orthonormal rows (1 <= R < D is our chosen rank), D_t is a positive-definite diagonal matrix, and > 0. Suppose the dimension of F_t is D. Let the vectors whose variance we are approximating be provided in minibatches of size M (M can vary from iteration to iteration, but it won't vary in the normal case, so we omit the subscript t). The batch of gradients is given as X_t Re^{M D}, i.e. each row is one of the vectors whose scatter we're estimating. On the t'th iteration, define the scatter S_t of the input vectors X_t as:

S_t =(def) 1/N X_t^T X_t (eqn:St)

(where N is the minibatch size). Be careful not to confuse the rank R with with input X_t (we would typeface X_t in bold if this were not plain text, to make the distinction clearer). We want F_t to approach some kind of time-weighted average of the S_t quantities, to the extent permitted by the limitation of the rank R. We want the F_t quantities to stay "fresh" (since we'll be doing this in a SGD context and the parameters will be slowly changing). We use a constant 0 < < 1 to control the updating rate. Our update for R_t is based on the power method. Define the smoothed scatter

T_t =(def) S_t + (1-) F_t

[note: F_{t+1} will be set to a low-rank approximation of T_t, which is where the recursion comes in.]

We'll use this in place of the observed scatter S_t, to slow down the update. Defining

Y_t =(def) R_t T_t

which can be expanded as follows: Y_t = R_t ( S_t + (1-) F_t ) = R_t ( S_t + (1-) (R_t^T D_t R_t + I) ) = R_t ( S_t + (1-) (R_t^T D_t R_t + I) ) = R_t S_t + (1-) (D_t + I) R_t

It is useful to think of Y_t as having each of the top eigenvectors of the scatter scaled by the corresponding eigenvalue . We compute the following R by R matrix: Z_t =(def) Y_t Y_t^T and do the symmetric eigenvalue decomposition Z_t = U_t C_t U_t^T where C_t is diagonal and U_t orthogonal; the diagonal elements of C_t will be positive (since > 0, T_t is positive definite; since R_t has full row rank and T_t is positive definite, Y_t has full row rank; hence Z_t is positive definite). The diagonal elements of C_t can be thought of as corresponding to the squares of our current estimate of the top eigenvalues of the scatter matrix. [we should check that no element of C_t is <= 0.]

It is easy to show that C_t^{-0.5} U_t^T Z_t U_t C_t^{-0.5} = I, so (C_t^{-0.5} U_t^T Y_t) (Y_t^T U_t C_t^{-0.5}) = I. Define R_{t+1} =(def) C_t^{-0.5} U_t^T Y_t

and it's clear that R_{t+1} R_{t+1}^T = I. We will set D_{t+1} =(def) C_t^{0.5} - {t+1} I (eqn:dt1)

which ensures that for each row r of R_{t+1}, the variance of our scatter matrix F_{t+1} will be the square root of the corresponding diagonal element of C_t. This makes sense because, as we have pointed out, the diagonal elements of C_t can be thought of as corresponding to squared eigenvalues. But a proper treatment of this would require convergence analysis that would get quite complicated. We will choose {t+1} in order to ensure that tr(F_{t+1}) = tr(T_t).

For any t, tr(F_t) = D + tr(D_t) tr(T_t) = tr(S_t) + (1-) tr(F_t) = tr(S_t) + (1-) (D + tr(D_t)) Expanding out D_{t+1} from (eqn:dt1) in the expression for tr(F_{t+1}) below: tr(F_{t+1}) = D {t+1} + tr(D_{t+1}) tr(F_{t+1}) = D {t+1} + tr(C_t^{0.5} - {t+1} I) = (D - R) {t+1} + tr(C_t^{0.5}) and equating tr(F_{t+1}) with T_t (since F_{t+1} is supposed to be a low-rank approximation to T_t), we have tr(F_{t+1}) = tr(T_t) (D - R) {t+1} + tr(C_t^{0.5}) = tr(S_t) + (1-) (D + tr(D_t))

Solving for {t+1}, {t+1} = 1/(D - R) ( tr(S_t) + (1-)(D + tr(D_t)) - tr(C_t^{0.5})). (eqn:rhot1)

Note that it is technically possible that diagonal elements of of D_{t+1} may be negative, but we can still show that F_{t+1} is strictly positive definite if F_t was strictly positive definite.

If the quantities for which we are computing the Fisher matrix are all zero for some, reason, the sequence of F_t will geometrically approach zero, which would cause problems with inversion; to prevent this happening, after setting D_{t+1} and {t+1} as above, we floor {t+1} to a small value (like 1.0e-10).

OK, we have described the updating of R_t, D_t and . Next, we need to figure out how to efficiently multiply by the inverse of F_t. Our experience from working with the old preconditioning method was that it's best not to use the inverse of the Fisher matrix itself, but a version of the Fisher matrix that's smoothed with some constant times the identity. Below, ( is a configuration value, e.g. 4.0 seemed to work well). The following formula is designed to ensure that the smoothing varies proportionally with the scale of F_t:

G_t =(def) F_t + /D tr(F_t) I = R_t^T D_t R_t + ( + /D tr(F_t)) I = R_t^T D_t R_t + I where =(def) + /D tr(F_t) = (1+) + /D tr(D_t) (eqn:betat2)

Define {X}_t =(def) X_t G_t^{-1}. the factor of is inserted arbitrarily as it just happens to be convenient to put unit scale on X_t in the formula for {X}_t; it will anyway be canceled out in the next step. Then our final preconditioned minibatch of vectors is: {X}_t = {X}_t where = sqrt(tr(X_t X_t^T) / tr({X}_t {X}_t^T). The factor of ensures that {X}_t is scaled to have the same overall 2-norm as the input X_t. We found in previous versions of this method that this rescaling was helpful, as otherwise there are certain situations (e.g. at the start of training) where the preconditioned derivatives can get very large. Note that this rescaling introduces a small bias into the training, because now the scale applied to a given sample depends on that sample itself, albeit in an increasingly diluted way as the minibatch size gets large.

To efficiently compute G_t^{-1}, we will use the Woodbury matrix identity. Writing the Woodbury formula for the symmetric case, (A + U D U^T)^{-1} = A^{-1} - A^{-1} U (D^{-1} + U^T A^{-1} U)^{-1} U^T A^{-1} Substituting A = I, D = D_t and U = R_t^T, this becomes G_t^{-1} = 1/ I - 1/^2 R_t^T (D_t^{-1} + 1/ I)^{-1} R_t = 1/ (I - R_t^T E_t R_t) where E_t =(def) 1/ (D_t^{-1} + 1/ I)^{-1}, (eqn:etdef) so e_{tii} = 1/ * 1/(1/d_{tii} + 1/) (eqn:tii) = 1/(/d_{tii} + 1)

We would like an efficient-to-compute expression for {X}_t, without too many separate invocations of kernels on the GPU. {X}_t = X_t G_t^{-1} = X_t - X_t R_t^T E_t R_t For efficient operation on the GPU, we want to reduce the number of high-dimensional operations that we do (defining "high-dimension" as anything involving D or M, but not R, since R is likely small, such as 20). We define W_t =(def) E_t^{0.5} R_t. We will actually be storing W_t on the GPU rather than R_t, in order to reduce the number of operations on the GPU. We can now write:

{X}_t = X_t - X_t W_t^T W_t (eqn:pt2)

The following, which we'll compute on the GPU, are going to be useful in computing quantities like Z_t:

H_t =(def) X_t W_t^T (dim is N by R) J_t =(def) H_t^T X_t (dim is R by D) = W_t X_t^T X_t K_t =(def) J_t J_t^T (dim is R by R, symmetric).. transfer this to CPU. L_t =(def) H_t^T H_t (dim is R by R, symmetric).. transfer this to CPU. = W_t X_t^T X_t W_t^T Note: L_t may also be computed as L_t = J_t W_t^T which may be more efficient if D < N.

Note: after we have computed H_t we can directly compute {X}_t = X_t - H_t W_t

We need to determine how Y_t and Z_t relate to the quantities we just defined. First, we'll expand out H_t, J_t, K_t and L_t in terms of the more fundamental quantities. H_t = X_t R_t^T E_t^{0.5} J_t = E_t^{0.5} R_t X_t^T X_t K_t = E_t^{0.5} R_t X_t^T X_t X_t^T X_t R_t^T E_t^{0.5} L_t = E_t^{0.5} R_t X_t^T X_t R_t^T E_t^{0.5}

we wrote above that Y_t = R_t S_t + (1-) (D_t + I) R_t so Y_t = /N R_t X_t^T X_t + (1-) (D_t + I) R_t = /N E_t^{-0.5} J_t + (1-) (D_t + I) R_t (eqn:yt) We will expand Z_t using the expression for Y_t in the line above: Z_t = Y_t Y_t^T = (/N)^2 E_t^{-0.5} J_t J_t^T E_t^{-0.5} +(/N)(1-) E_t^{-0.5} J_t R_t^T (D_t + I) +(/N)(1-) (D_t + I) R_t J_t^T E_t^{-0.5} +(1-)^2 (D_t + I)^2 = (/N)^2 E_t^{-0.5} K_t E_t^{-0.5} +(/N)(1-) E_t^{-0.5} L_t E_t^{-0.5} (D_t + I) +(/N)(1-) (D_t + I) E_t^{-0.5} L_t E_t^{-0.5} +(1-)^2 (D_t + I)^2 (eqn:Zt) We compute Z_t on the CPU using the expression above, and then do the symmetric eigenvalue decomposition (also on the CPU): Z_t = U_t C_t U_t^T. and we make sure the eigenvalues are sorted from largest to smallest, for reasons that will be mentioned later.

Mathematically, no diagonal element of C_t can be less than (1-)^2 ^2, and since negative or zero elements of C_t would cause us a problem later, we floor C_t to this value. (see below regarding how we ensure R_{t+1} has orthonormal rows).

We will continue the discussion below regarding what we do with C_t and U_t. Next, we need to digress briefly and describe how to compute tr({X}_t {X}_t^T) and tr(X_t X_t^2), since these appear in expressions for (needed to produce the output {X}_t), and for {t+1}. It happens that we need, for purposes of appying "max_change" in the neural net code, the squared 2-norm of each row of the output {X}_t. In order to be able to compute , it's most convenient to compute this squared row-norm for each row of {X}_t, as a vector, to compute tr({X}_t {X}_t^2) from this vector as its sum, and to then work back to compute tr(X_t X_t^2) from the relation between {X}_t and X_t. We can then scale the row-norms we computed for {X}_t, so they apply to {X}_t.

For current purposes, you can imagine that we computed tr({X}_t {X}_t^T) directly. Using (from eqn:pt2) {X}_t = X_t - X_t W_t^T W_t, we can expand tr({X}_t {X}_t^T) as: tr({X}_t {X}_t^T) = tr(X_t X_t^T) + tr(X_t W_t^T W_t W_t^T W_t X_t^T)

  • 2 tr(X_t W_t^T W_t X_t^T) = tr(X_t X_t^T) + tr(W_t X_t^T X_t W_t^T W_t W_t^T)
  • 2 tr(W_t X_t^T X_t W_t^T) = tr(X_t X_t^T) + tr(L_t W_t W_t^T) - 2 tr(L_t) = tr(X_t X_t^T) + tr(L_t E_t) - 2 tr(L_t) and all quantities have already been computed (or are quick to compute, such as the small traces on the right), except tr(X_t X_t^T), so we can write

tr(X_t X_t^T) = tr({X}_t {X}_t^T) - tr(L_t E_t) + 2 tr(L_t) and the above expression can be used to obtain tr(X_t X_t^2). We can then do <– sqrt(tr(X_t X_t^T) / tr({X}_t {X}_t^T)). (or one if the denominator is zero), and then {X}_t <– {X}_t We can then output the per-row squared-l2-norms of Q by scaling those we computed from P by ^2.

OK, the digression on how to compute and tr(X_t X_t^T) is over. We now return to the computation of R_{t+1}, W_{t+1}, {t+1}, D_{t+1} and E_{t+1}.

We found above in (eqn:rhot1) {t+1} = 1/(D - R) ( tr(S_t) + (1-)(D + tr(D_t)) - tr(C_t^{0.5})). Expanding out S_t from its definition in (eqn:St), {t+1} = 1/(D - R) (/N tr(X_t X_t^T) + (1-)(D + tr(D_t)) - tr(C_t^{0.5})). We can compute this directly as all the quantities involved are already known or easy to compute. Next, from (eqn:dt1), we compute D_{t+1} = C_t^{0.5} - {t+1} I At this point if {t+1} is smaller than some small value , e.g. 1.0e-10, we set it to ; as mentioned, we do this to stop F_t approaching zero if all inputs are zero. Next, if any diagonal element D_{t+1,i,i} has absolute value less than , we set it to +. This is to ensure that diagonal elements of E are never zero, which would cause problems.

Next, we compute (from eqn:betat2, eqn:etdef, eqn:tii), {t+1} = {t+1} (1+) + /D tr(D_{t+1}) E_{t+1} = 1/{t+1} (D_{t+1}^{-1} + 1/{t+1} I)^{-1}, i.e.: e_{tii} = 1/({t+1}/d_{t+1,ii} + 1)

We'll want to store D_{t+1}. We next want to compute W_{t+1}.

Before computing W_{t+1}, we need to find an expression for R_{t+1} = C_t^{-0.5} U_t^T Y_t Expanding out Y_t using the expression in (eqn:yt), R_{t+1} = C_t^{-0.5} U_t^T (/N E_t^{-0.5} J_t + (1-) (D_t + I) R_t) = (/N C_t^{-0.5} U_t^T E_t^{-0.5}) J_t +((1-) C_t^{-0.5} U_t^T (D_t + I) E_t^{-0.5}) W_t

What we actually want is W_{t+1} = E_{t+1}^{0.5} R_{t+1}: W_{t+1} = (/N E_{t+1}^{0.5} C_t^{-0.5} U_t^T E_t^{-0.5}) J_t +((1-) E_{t+1}^{0.5} C_t^{-0.5} U_t^T (D_t + I) E_t^{-0.5}) W_t and to minimize the number of matrix-matrix multiplies we can factorize this as: W_{t+1} = A_t B_t A_t = (/N) E_{t+1}^{0.5} C_t^{-0.5} U_t^T E_t^{-0.5} B_t = J_t + (1-)/(/N) (D_t + I) W_t [note: we use the fact that (D_t + I) and E_t^{-0.5} commute because they are diagonal].

A_t is computed on the CPU and transferred from there to the GPU, B_t is computed on the PGU, and the multiplication of A_t with B_t is done on the GPU.

Keeping R_t orthogonal *

Our method requires the R_t matrices to be orthogonal (which we define to mean that R_t R_t^T = I). If roundoff error causes this equality to be significantly violated, it could cause a problem for the stability of our method. We now address our method for making sure that the R_t values stay orthogonal. We do this in the algorithm described above, after creating W_{t+1}. This extra step is only executed if the condition number of C_t (i.e. the ratio of its largest to smallest diagonal element) exceeds a specified threshold, such as 1.0e+06 [this is tested before applying the floor to C_t]. The threshold was determined empirically by finding the largest value needed to ensure a certain level of orthogonality in R_{t+1}. For purposes of the present discussion, since R_{t+1} is not actually stored, define it as E_{t+1}^{-0.5} W_{t+1}. Define the following (and we will just use t instead of t+1 below, as all quantities have the same subscript):

O_t =(def) R_t R_t^T = E_t^{-0.5} W_t W_t^T E_t^{-0.5}

(and we would compute this by computing W_t W_t^T on the GPU, transferring it to the CPU, and doing the rest there). If O_t is not sufficiently close to the unit matrix, we can re-orthogonalize as follows: Do the Cholesky decomposition O_t = C C^T Clearly C^{-1} O_t C^{-T} = I, so if we correct R_t with: R_t <– C^{-1} R_t we can ensure orthogonality. If R_t's first k rows are orthogonal, this transform will not affect them, because of its lower-triangular structure... this is good because (thanks to the eigenvalue sorting), the larger eigenvectors are first and it is more critical to keep them pointing in the same direction. Any loss of orthogonality will be dealt with by modifying the smaller eigenvectors. As a modification to W_t, this would be: W_t <– (E_t^{0.5} C^{-1} E_t^{-0.5}) W_t, and the matrix in parentheses is computed on the CPU, transferred to the GPU, and the multiplication is done there.

Initialization *

Now, a note on what we do on time t = 0, i.e. for the first minibatch. We initialize R_0 to the top R eigenvectors of 1/N X_0 X_0^T, where N is the minibatch size (num-rows of X0). If L is the corresponding RxR diagonal matrix of eigenvalues, then we will set D_0 = L - I. We set to ensure that tr(F_0) = 1/N tr(X_0 X_0^T), tr(D_0) - D = 1/N tr(X_0 X_0^T), tr(L) + R - D = 1/N tr(X_0 X_0^T) = (1/N tr(X_0 X_0^T) - tr(L)) / (D - R)

We then floor to (e.g. 1.0e-10) and also floor the diagonal elements of D_0 to ; this ensures that we won't crash for zero inputs.

A note on multi-threading. This technique was really designed for use with a GPU, where we won't have multi-threading, but we want it to work also on a CPU, where we may have multiple worker threads. Our approach is as follows (we do this when we're about to start updating the parameters R_t, D_t, and derived quantities):

For time t > 0 (where the matrices are already initialized), before starting the part of the computation that updates the parameters (R_t, D_t, and derived quantities), we try to lock a mutex that guards the OnlinePreconditioner. If we can lock it right away, we go ahead and do the update, but if not, we just abandon the attempt to update those quantities.

We will have another mutex to ensure that when we access quantities like W_t, they are all "in sync" (and we don't access them while they are being written by another thread). This mutex will only be locked for short periods of time.

Note: it might be a good idea to make sure that the R_t still retain orthonormal rows even in the presence of roundoff, without errors accumulating. My instinct is that this isn't going to be a problem.

Definition at line 414 of file natural-gradient-online.h.

Constructor & Destructor Documentation

◆ OnlineNaturalGradient() [1/2]

Definition at line 27 of file natural-gradient-online.cc.

Referenced by OnlineNaturalGradient::Freeze().

◆ OnlineNaturalGradient() [2/2]

OnlineNaturalGradient ( const OnlineNaturalGradient other)
explicit

Definition at line 578 of file natural-gradient-online.cc.

578  :
579  rank_(other.rank_), update_period_(other.update_period_),
580  num_samples_history_(other.num_samples_history_),
581  num_minibatches_history_(other.num_minibatches_history_),
582  alpha_(other.alpha_), epsilon_(other.epsilon_), delta_(other.delta_),
583  frozen_(other.frozen_), t_(other.t_),
584  self_debug_(other.self_debug_), W_t_(other.W_t_),
585  rho_t_(other.rho_t_), d_t_(other.d_t_) { }

Member Function Documentation

◆ ComputeEt()

void ComputeEt ( const VectorBase< BaseFloat > &  d_t,
BaseFloat  beta_t,
VectorBase< BaseFloat > *  e_t,
VectorBase< BaseFloat > *  sqrt_e_t,
VectorBase< BaseFloat > *  inv_sqrt_e_t 
) const
private

Definition at line 560 of file natural-gradient-online.cc.

References VectorBase< Real >::ApplyPow(), VectorBase< Real >::CopyFromVec(), rnnlm::d, VectorBase< Real >::Data(), VectorBase< Real >::Dim(), rnnlm::i, and VectorBase< Real >::InvertElements().

Referenced by OnlineNaturalGradient::ComputeWt1(), OnlineNaturalGradient::Freeze(), OnlineNaturalGradient::PreconditionDirectionsInternal(), OnlineNaturalGradient::ReorthogonalizeRt1(), and OnlineNaturalGradient::SelfTest().

564  {
565  // e_{tii} = 1/(\beta_t/d_{tii} + 1)
566  int32 D = d_t.Dim();
567  const BaseFloat *d = d_t.Data();
568  BaseFloat *e = e_t->Data();
569  for (int32 i = 0; i < D; i++)
570  e[i] = 1.0 / (beta_t / d[i] + 1);
571  sqrt_e_t->CopyFromVec(*e_t);
572  sqrt_e_t->ApplyPow(0.5);
573  inv_sqrt_e_t->CopyFromVec(*sqrt_e_t);
574  inv_sqrt_e_t->InvertElements();
575 }
kaldi::int32 int32
float BaseFloat
Definition: kaldi-types.h:29

◆ ComputeWt1()

void ComputeWt1 ( int32  N,
const VectorBase< BaseFloat > &  d_t,
const VectorBase< BaseFloat > &  d_t1,
BaseFloat  rho_t,
BaseFloat  rho_t1,
const MatrixBase< BaseFloat > &  U_t,
const VectorBase< BaseFloat > &  sqrt_c_t,
const VectorBase< BaseFloat > &  inv_sqrt_e_t,
const CuMatrixBase< BaseFloat > &  W_t,
CuMatrixBase< BaseFloat > *  J_t,
CuMatrixBase< BaseFloat > *  W_t1 
) const
private

Definition at line 484 of file natural-gradient-online.cc.

References CuMatrixBase< Real >::AddDiagVecMat(), CuMatrixBase< Real >::AddMatMat(), OnlineNaturalGradient::alpha_, OnlineNaturalGradient::ComputeEt(), VectorBase< Real >::Dim(), OnlineNaturalGradient::Eta(), rnnlm::i, VectorBase< Real >::InvertElements(), rnnlm::j, KALDI_ASSERT, kaldi::kNoTrans, kaldi::kTrans, kaldi::kUndefined, CuMatrixBase< Real >::NumCols(), and VectorBase< Real >::Sum().

Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirectionsInternal().

494  {
495 
496  int32 R = d_t.Dim(), D = W_t.NumCols();
497  BaseFloat eta = Eta(N);
498 
499  // \beta_{t+1} = \rho_{t+1} (1+\alpha) + \alpha/D tr(D_{t+1})
500  BaseFloat beta_t1 = rho_t1 * (1.0 + alpha_) + alpha_ * d_t1.Sum() / D;
501  KALDI_ASSERT(beta_t1 > 0.0);
502  Vector<BaseFloat> e_t1(R, kUndefined), sqrt_e_t1(R, kUndefined),
503  inv_sqrt_e_t1(R, kUndefined);
504  ComputeEt(d_t1, beta_t1, &e_t1, &sqrt_e_t1, &inv_sqrt_e_t1);
505  Vector<BaseFloat> inv_sqrt_c_t(sqrt_c_t);
506  inv_sqrt_c_t.InvertElements();
507 
508  Vector<BaseFloat> w_t_coeff(R);
509  for (int32 i = 0; i < R; i++)
510  w_t_coeff(i) = (1.0 - eta) / (eta/N) * (d_t(i) + rho_t);
511  CuVector<BaseFloat> w_t_coeff_gpu(w_t_coeff);
512  // B_t = J_t + (1-\eta)/(\eta/N) (D_t + \rho_t I) W_t
513  J_t->AddDiagVecMat(1.0, w_t_coeff_gpu, W_t, kNoTrans, 1.0);
514 
515  // A_t = (\eta/N) E_{t+1}^{0.5} C_t^{-0.5} U_t^T E_t^{-0.5}
516  Matrix<BaseFloat> A_t(U_t, kTrans);
517  for (int32 i = 0; i < R; i++) {
518  BaseFloat i_factor = (eta / N) * sqrt_e_t1(i) * inv_sqrt_c_t(i);
519  for (int32 j = 0; j < R; j++) {
520  BaseFloat j_factor = inv_sqrt_e_t(j);
521  A_t(i, j) *= i_factor * j_factor;
522  }
523  }
524  // W_{t+1} = A_t B_t
525  CuMatrix<BaseFloat> A_t_gpu(A_t);
526  W_t1->AddMatMat(1.0, A_t_gpu, kNoTrans, *J_t, kNoTrans, 0.0);
527 }
kaldi::int32 int32
float BaseFloat
Definition: kaldi-types.h:29
void ComputeEt(const VectorBase< BaseFloat > &d_t, BaseFloat beta_t, VectorBase< BaseFloat > *e_t, VectorBase< BaseFloat > *sqrt_e_t, VectorBase< BaseFloat > *inv_sqrt_e_t) const
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185

◆ ComputeZt()

void ComputeZt ( int32  N,
BaseFloat  rho_t,
const VectorBase< BaseFloat > &  d_t,
const VectorBase< BaseFloat > &  inv_sqrt_e_t,
const MatrixBase< BaseFloat > &  K_t,
const MatrixBase< BaseFloat > &  L_t,
SpMatrix< double > *  Z_t 
) const
private

Definition at line 529 of file natural-gradient-online.cc.

References VectorBase< Real >::Add(), VectorBase< Real >::Dim(), OnlineNaturalGradient::Eta(), rnnlm::i, and rnnlm::j.

Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirectionsInternal().

535  {
536  // Use doubles because the range of quantities in Z_t can get large (fourth
537  // power of data), and we want to avoid overflow. This routine is fast.
538  BaseFloat eta = Eta(N);
539  Vector<BaseFloat> d_t_rho_t(d_t);
540  d_t_rho_t.Add(rho_t); // now d_t_rho_t is diag(D_t + \rho_t I).
541  double etaN = eta / N, eta1 = 1.0 - eta,
542  etaN_sq = etaN * etaN, eta1_sq = eta1 * eta1,
543  etaN_eta1 = etaN * eta1;
544  int32 R = d_t.Dim();
545  for (int32 i = 0; i < R; i++) {
546  double inv_sqrt_e_t_i = inv_sqrt_e_t(i), d_t_rho_t_i = d_t_rho_t(i);
547  for (int32 j = 0; j <= i; j++) {
548  double inv_sqrt_e_t_j = inv_sqrt_e_t(j), d_t_rho_t_j = d_t_rho_t(j),
549  L_t_i_j = 0.5 * (L_t(i, j) + L_t(j, i)),
550  K_t_i_j = 0.5 * (K_t(i, j) + K_t(j, i));
551  // See (eqn:Zt) in header.
552  (*Z_t)(i, j) = etaN_sq * inv_sqrt_e_t_i * K_t_i_j * inv_sqrt_e_t_j
553  + etaN_eta1 * inv_sqrt_e_t_i * L_t_i_j * inv_sqrt_e_t_j * d_t_rho_t_j
554  + etaN_eta1 * d_t_rho_t_i * inv_sqrt_e_t_i * L_t_i_j * inv_sqrt_e_t_j
555  + (i == j ? eta1_sq * d_t_rho_t_i * d_t_rho_t_i : 0.0);
556  }
557  }
558 }
kaldi::int32 int32
float BaseFloat
Definition: kaldi-types.h:29

◆ Eta()

BaseFloat Eta ( int32  N) const
private

Definition at line 470 of file natural-gradient-online.cc.

References KALDI_ASSERT, OnlineNaturalGradient::num_minibatches_history_, and OnlineNaturalGradient::num_samples_history_.

Referenced by OnlineNaturalGradient::ComputeWt1(), OnlineNaturalGradient::ComputeZt(), OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirectionsInternal().

470  {
471  if (num_minibatches_history_ > 0.0) {
473  return 1.0 / num_minibatches_history_;
474  } else {
476  BaseFloat ans = 1.0 - exp(-N / num_samples_history_);
477  // Don't let eta approach 1 too closely, as it can lead to NaN's appearing if
478  // the input is all zero.
479  if (ans > 0.9) ans = 0.9;
480  return ans;
481  }
482 }
float BaseFloat
Definition: kaldi-types.h:29
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185

◆ Freeze()

◆ GetAlpha()

◆ GetNumMinibatchesHistory()

◆ GetNumSamplesHistory()

◆ GetRank()

◆ GetUpdatePeriod()

◆ Init()

void Init ( const CuMatrixBase< BaseFloat > &  R0)
private

Definition at line 122 of file natural-gradient-online.cc.

References OnlineNaturalGradient::d_t_, OnlineNaturalGradient::frozen_, rnnlm::i, OnlineNaturalGradient::InitDefault(), kaldi::kUndefined, CuMatrixBase< Real >::NumCols(), CuMatrixBase< Real >::NumRows(), OnlineNaturalGradient::PreconditionDirections(), OnlineNaturalGradient::rank_, OnlineNaturalGradient::rho_t_, OnlineNaturalGradient::t_, and OnlineNaturalGradient::W_t_.

Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirections().

122  {
123  int32 D = X0.NumCols();
124  // for locking reasons it's better to use a different object.
125  OnlineNaturalGradient this_copy(*this);
126  this_copy.InitDefault(D);
127  this_copy.t_ = 1; // Prevent recursion to Init() again.
128 
129  CuMatrix<BaseFloat> X0_copy(X0.NumRows(), X0.NumCols(), kUndefined);
130  // 'num_iters' is number of iterations with the same data from a pseudorandom
131  // start. this is a faster way of starting than doing eigenvalue
132  // decomposition.
133  //
134  // Note: we only do three iterations of initialization if we have enough data
135  // that it's reasonably possible to estimate the subspace of dimension
136  // this_copy.rank_. If we don't have more than that many rows in our initial
137  // minibatch X0, we just do one iteration... this gives us almost exactly
138  // (barring small effects due to epsilon_ > 0) the row subspace of X0 after
139  // one iteration anyway.
140  int32 num_init_iters;
141  if (X0.NumRows() <= this_copy.rank_)
142  num_init_iters = 1;
143  else
144  num_init_iters = 3;
145 
146  this_copy.frozen_ = false; // un-freeze if it was frozen, so we can
147  // initialize.
148  for (int32 i = 0; i < num_init_iters; i++) {
149  BaseFloat scale;
150  X0_copy.CopyFromMat(X0);
151  this_copy.PreconditionDirections(&X0_copy, &scale);
152  }
153  rank_ = this_copy.rank_;
154  W_t_.Swap(&this_copy.W_t_);
155  d_t_.Swap(&this_copy.d_t_);
156  rho_t_ = this_copy.rho_t_;
157 }
kaldi::int32 int32
float BaseFloat
Definition: kaldi-types.h:29

◆ InitDefault()

void InitDefault ( int32  D)
private

Definition at line 71 of file natural-gradient-online.cc.

References OnlineNaturalGradient::alpha_, OnlineNaturalGradient::d_t_, OnlineNaturalGradient::delta_, OnlineNaturalGradient::epsilon_, OnlineNaturalGradient::InitOrthonormalSpecial(), KALDI_ASSERT, KALDI_WARN, kaldi::kUndefined, OnlineNaturalGradient::num_minibatches_history_, OnlineNaturalGradient::num_samples_history_, OnlineNaturalGradient::rank_, OnlineNaturalGradient::rho_t_, OnlineNaturalGradient::t_, and OnlineNaturalGradient::W_t_.

Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::Init().

71  {
72  if (rank_ >= D) {
73  KALDI_WARN << "Rank " << rank_ << " of online preconditioner is >= dim " << D
74  << ", setting it to "
75  << (D - 1) << " (but this is probably still too high)";
76  rank_ = D - 1;
77  }
78  if (rank_ == 0) {
79  // Dimension of input data was 1, so the natural gradient preconditioner
80  // would always be the unit matrix.
81  // We'll handle this as a special case, for generality.
82  return;
83  }
86  num_minibatches_history_ > 1.0) &&
87  num_minibatches_history_ < 1.0e+06);
88  KALDI_ASSERT(alpha_ >= 0.0);
89  KALDI_ASSERT(rank_ > 0);
90  KALDI_ASSERT(epsilon_ > 0.0 && epsilon_ <= 1.0e-05); // plausible values.
91  KALDI_ASSERT(delta_ > 0.0 && delta_ <= 1.0e-02); // plausible values.
92 
93  // to initialize, in the equation
94  // F_t =(def) R_t^T D_t R_t + \rho_t I
95  // we will set the orthogonal R_t to a special orthogonal matrix with no zero
96  // rows or columns (see the function), rho_t to epsilon,
97  // and D_t to epsilon. But we don't store R_t directly. Instead, we store
98  // W_t =(def) E_t^{0.5} R_t,
99  // where E_t =(def) 1/\beta_t (D_t^{-1} + 1/\beta_t I)^{-1}
100  // from (eqn:tii),
101  // e_{tii} = 1/(\beta_t/d_{tii} + 1),
102  // where
103  // \beta_t =(def) \rho_t + \alpha/D tr(F_t)
104  // = epsilon + alpha/D * (epsilon * D + epsilon * rank)
105  // = epsilon * (1 + alpha * (D + rank) / D)
106  // And d_{tii} is epsilon, so
107  // e_{tii} = 1/((1 + alpha * (D + rank) / D) + 1) [for each i.]
108  // = 1/(2 + alpha * (D + rank) / D)).
109  BaseFloat epsilon = epsilon_; // we could make this a bit more.
110  rho_t_ = epsilon;
111  d_t_.Resize(rank_, kUndefined);
112  d_t_.Set(epsilon);
113  W_t_.Resize(rank_, D, kUndefined);
114  // after the next line, W_ will store the orthogonal matrix R_t.
116  BaseFloat E_tii = 1.0 / ( 2.0 + (D + rank_) * alpha_ / D );
117  // W_t =(def) E_t^{0.5} R_t.
118  W_t_.Scale(sqrt(E_tii));
119  t_ = 0;
120 }
static void InitOrthonormalSpecial(CuMatrixBase< BaseFloat > *R)
This function creates a matrix with orthonormal rows that is like the following matrix, except with each row normalized to have unit 2-norm: [ 1.1 0 1 0 1 0 0 1.1 0 1 0 1 ] The reason why the first element in each row is 1.1 and not 1, is for symmetry-breaking...
float BaseFloat
Definition: kaldi-types.h:29
#define KALDI_WARN
Definition: kaldi-error.h:150
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185

◆ InitOrthonormalSpecial()

void InitOrthonormalSpecial ( CuMatrixBase< BaseFloat > *  R)
staticprivate

This function creates a matrix with orthonormal rows that is like the following matrix, except with each row normalized to have unit 2-norm: [ 1.1 0 1 0 1 0 0 1.1 0 1 0 1 ] The reason why the first element in each row is 1.1 and not 1, is for symmetry-breaking...

we don't want any weighted sum of all these rows to be all ones, because the derivative in that direction can be zero in some architectures and it causes us to have to do an inefficient CPU-based renormalization.

Definition at line 46 of file natural-gradient-online.cc.

References CuMatrixBase< Real >::AddElements(), rnnlm::i, KALDI_ASSERT, CuMatrixBase< Real >::NumCols(), CuMatrixBase< Real >::NumRows(), and CuMatrixBase< Real >::SetZero().

Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::InitDefault().

46  {
47  int32 num_rows = R->NumRows(), num_cols = R->NumCols();
48  KALDI_ASSERT(num_cols >= num_rows);
49  R->SetZero();
50  std::vector<MatrixElement<BaseFloat> > elems;
51  elems.reserve(num_cols);
52  BaseFloat first_elem = 1.1;
53  for (int32 r = 0; r < num_rows; r++) {
54  std::vector<int32> cols; // columns that have an entry for this row
55  for (int32 c = r; c < num_cols; c += num_rows)
56  cols.push_back(c);
57  BaseFloat normalizer = 1.0 / sqrt(first_elem * first_elem +
58  cols.size() - 1);
59  for (size_t i = 0; i < cols.size(); i++) {
60  int32 c = cols[i];
61  MatrixElement<BaseFloat> e = { r, c,
62  normalizer * (i == 0 ? first_elem :
63  BaseFloat(1.0)) };
64  elems.push_back(e);
65  }
66  }
67  R->AddElements(1.0, elems);
68 }
kaldi::int32 int32
float BaseFloat
Definition: kaldi-types.h:29
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185

◆ operator=()

OnlineNaturalGradient & operator= ( const OnlineNaturalGradient other)

Definition at line 588 of file natural-gradient-online.cc.

References OnlineNaturalGradient::alpha_, OnlineNaturalGradient::d_t_, OnlineNaturalGradient::delta_, OnlineNaturalGradient::epsilon_, OnlineNaturalGradient::num_samples_history_, OnlineNaturalGradient::rank_, OnlineNaturalGradient::rho_t_, OnlineNaturalGradient::self_debug_, OnlineNaturalGradient::t_, OnlineNaturalGradient::update_period_, and OnlineNaturalGradient::W_t_.

Referenced by OnlineNaturalGradient::Freeze().

589  {
590  rank_ = other.rank_;
591  update_period_ = other.update_period_;
592  num_samples_history_ = other.num_samples_history_;
593  alpha_ = other.alpha_;
594  epsilon_ = other.epsilon_;
595  delta_ = other.delta_;
596  t_ = other.t_;
597  self_debug_ = other.self_debug_;
598  W_t_ = other.W_t_;
599  rho_t_ = other.rho_t_;
600  d_t_ = other.d_t_;
601  return *this;
602 }

◆ PreconditionDirections()

void PreconditionDirections ( CuMatrixBase< BaseFloat > *  X,
BaseFloat scale 
)

This call implements the main functionality of this class.

Parameters
[in,out]RThe "R" pointer is both the input (R in the comment, X in the paper), and the output (P in the comment, X with a hat on it in the paper). Each row of R is viewed as a vector in some space, where we're estimating a smoothed Fisher matrix and then multiplying by the inverse of that smoothed Fisher matrix.
[out]scaleIf non-NULL, a scaling factor is written to here, and the output 'R' should be multiplied by this factor by the user (we don't do it internally, to save an operation). The factor is chosen so that the vector 2-norm of R is the same after the natural gradient as it was before. (The pointer being NULL or non-NULL doesn't affect the magnitude of R; in any case the user will probably want to do this rescaling, the question being whether they want to do so manually or not.

Definition at line 159 of file natural-gradient-online.cc.

References OnlineNaturalGradient::d_t_, OnlineNaturalGradient::Init(), kaldi::kTrans, CuMatrixBase< Real >::NumCols(), NVTX_RANGE, OnlineNaturalGradient::PreconditionDirectionsInternal(), CuMatrixBase< Real >::Range(), OnlineNaturalGradient::rho_t_, OnlineNaturalGradient::t_, kaldi::TraceMatMat(), OnlineNaturalGradient::Updating(), and OnlineNaturalGradient::W_t_.

Referenced by LstmNonlinearityComponent::Backprop(), ConstantComponent::Backprop(), LinearComponent::Backprop(), PerElementOffsetComponent::Backprop(), ConstantFunctionComponent::Backprop(), ScaleAndOffsetComponent::BackpropInternal(), LstmNonlinearityComponent::ConsolidateMemory(), OnlineNaturalGradient::Freeze(), OnlineNaturalGradient::Init(), kaldi::nnet3::UnitTestPreconditionDirectionsOnline(), NaturalGradientRepeatedAffineComponent::Update(), NaturalGradientAffineComponent::Update(), NaturalGradientPerElementScaleComponent::Update(), TimeHeightConvolutionComponent::UpdateNaturalGradient(), and TdnnComponent::UpdateNaturalGradient().

161  {
162  NVTX_RANGE(__func__);
163  if (X_t->NumCols() == 1) {
164  // If the dimension of the space equals one then our natural gradient update
165  // with rescaling becomes a no-op, but the code wouldn't naturally handle it
166  // because rank would be zero. Support this as a special case.
167  if (scale)
168  *scale = 1.0;
169  return;
170  }
171 
172  if (t_ == 0) // not initialized
173  Init(*X_t);
174 
175  int32 R = W_t_.NumRows(), D = W_t_.NumCols();
176  // space for W_t, J_t, K_t, L_t.
177  CuMatrix<BaseFloat> WJKL_t(2 * R, D + R);
178  WJKL_t.Range(0, R, 0, D).CopyFromMat(W_t_);
179  BaseFloat rho_t(rho_t_);
180  Vector<BaseFloat> d_t(d_t_);
181 
182  bool updating = Updating();
183 
184  BaseFloat initial_product;
185  initial_product = TraceMatMat(*X_t, *X_t, kTrans);
186 
187  PreconditionDirectionsInternal(rho_t, initial_product,
188  updating, d_t, &WJKL_t, X_t);
189 
190  if (scale) {
191  if (initial_product <= 0.0) {
192  *scale = 1.0;
193  } else {
194  BaseFloat final_product = TraceMatMat(*X_t, *X_t, kTrans);
195  *scale = sqrt(initial_product / final_product);
196  }
197  }
198  t_ += 1;
199 }
kaldi::int32 int32
float BaseFloat
Definition: kaldi-types.h:29
Real TraceMatMat(const MatrixBase< Real > &A, const MatrixBase< Real > &B, MatrixTransposeType trans)
We need to declare this here as it will be a friend function.
#define NVTX_RANGE(name)
Definition: cu-common.h:143
void Init(const CuMatrixBase< BaseFloat > &R0)
void PreconditionDirectionsInternal(const BaseFloat rho_t, const BaseFloat tr_X_Xt, bool updating, const Vector< BaseFloat > &d_t, CuMatrixBase< BaseFloat > *WJKL_t, CuMatrixBase< BaseFloat > *X_t)

◆ PreconditionDirectionsInternal()

void PreconditionDirectionsInternal ( const BaseFloat  rho_t,
const BaseFloat  tr_X_Xt,
bool  updating,
const Vector< BaseFloat > &  d_t,
CuMatrixBase< BaseFloat > *  WJKL_t,
CuMatrixBase< BaseFloat > *  X_t 
)
private

Definition at line 324 of file natural-gradient-online.cc.

References VectorBase< Real >::Add(), CuMatrixBase< Real >::AddMatMat(), OnlineNaturalGradient::alpha_, VectorBase< Real >::ApplyFloor(), VectorBase< Real >::ApplyPow(), OnlineNaturalGradient::ComputeEt(), OnlineNaturalGradient::ComputeWt1(), OnlineNaturalGradient::ComputeZt(), MatrixBase< Real >::CopyLowerToUpper(), OnlineNaturalGradient::d_t_, OnlineNaturalGradient::delta_, SpMatrix< Real >::Eig(), OnlineNaturalGradient::epsilon_, OnlineNaturalGradient::Eta(), KALDI_ASSERT, KALDI_VLOG, KALDI_WARN, kaldi::kNoTrans, kaldi::kTrans, VectorBase< Real >::Max(), CuMatrixBase< Real >::NumCols(), CuMatrixBase< Real >::NumRows(), NVTX_RANGE, OnlineNaturalGradient::rank_, OnlineNaturalGradient::ReorthogonalizeRt1(), OnlineNaturalGradient::rho_t_, PackedMatrix< Real >::Scale(), VectorBase< Real >::Scale(), OnlineNaturalGradient::self_debug_, OnlineNaturalGradient::SelfTest(), kaldi::SortSvd(), VectorBase< Real >::Sum(), SpMatrix< Real >::Trace(), and OnlineNaturalGradient::W_t_.

Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirections().

330  {
331  NVTX_RANGE(__func__);
332  int32 N = X_t->NumRows(), // Minibatch size.
333  D = X_t->NumCols(), // Dimensions of vectors we're preconditioning
334  R = rank_; // Rank of correction to unit matrix.
335  KALDI_ASSERT(R > 0 && R < D);
336  BaseFloat eta = Eta(N);
337 
338  CuMatrix<BaseFloat> H_t(N, R);
339  const CuSubMatrix<BaseFloat> W_t(*WJKL_t, 0, R, 0, D);
340  // Below, WJ_t and LK_t are combinations of two matrices,
341  // which we define in order to combine two separate multiplications into one.
342  CuSubMatrix<BaseFloat> J_t(*WJKL_t, R, R, 0, D),
343  L_t(*WJKL_t, 0, R, D, R),
344  K_t(*WJKL_t, R, R, D, R),
345  WJ_t(*WJKL_t, 0, 2 * R, 0, D),
346  LK_t(*WJKL_t, 0, 2 * R, D, R);
347 
348  H_t.AddMatMat(1.0, *X_t, kNoTrans, W_t, kTrans, 0.0); // H_t = X_t W_t^T
349 
350  if (!updating) {
351  // We're not updating the estimate of the Fisher matrix; we just apply the
352  // preconditioning and return.
353  // X_hat_t = X_t - H_t W_t
354  X_t->AddMatMat(-1.0, H_t, kNoTrans, W_t, kNoTrans, 1.0);
355  return;
356  }
357  J_t.AddMatMat(1.0, H_t, kTrans, *X_t, kNoTrans, 0.0); // J_t = H_t^T X_t
358 
359  bool compute_lk_together = (N > D);
360 
361  if (compute_lk_together) {
362  // do the following two multiplies in one operation...
363  // note
364  // L_t = W_t J_t^T
365  // K_t = J_t J_t^T
366  // Note: L_t was defined as L_t = J_t W_t^T, but it's actually symmetric,
367  // so we can compute it as L_t = W_t J_t^T.
368  LK_t.AddMatMat(1.0, WJ_t, kNoTrans, J_t, kTrans, 0.0);
369  } else {
370  K_t.SymAddMat2(1.0, J_t, kNoTrans, 0.0);
371  L_t.SymAddMat2(1.0, H_t, kTrans, 0.0);
372  }
373 
374  Matrix<BaseFloat> LK_cpu(LK_t); // contains L and K on the CPU.
375  SubMatrix<BaseFloat> L_t_cpu(LK_cpu, 0, R, 0, R),
376  K_t_cpu(LK_cpu, R, R, 0, R);
377  if (!compute_lk_together) {
378  // the SymAddMat2 operations only set the lower triangle and diagonal.
379  L_t_cpu.CopyLowerToUpper();
380  K_t_cpu.CopyLowerToUpper();
381  }
382 
383  // beta_t = \rho_t(1+\alpha) + \alpha/D tr(D_t)
384  BaseFloat beta_t = rho_t * (1.0 + alpha_) + alpha_ * d_t.Sum() / D;
385  Vector<BaseFloat> e_t(R), sqrt_e_t(R), inv_sqrt_e_t(R);
386  ComputeEt(d_t, beta_t, &e_t, &sqrt_e_t, &inv_sqrt_e_t);
387  KALDI_VLOG(5) << "e_t = " << e_t;
388 
389  // The double-precision Z_t here, and the scaling, is to avoid potential
390  // overflow, because Z_t is proportional to the fourth power of data.
391  SpMatrix<double> Z_t_double(R);
392  ComputeZt(N, rho_t, d_t, inv_sqrt_e_t, K_t_cpu, L_t_cpu, &Z_t_double);
393  BaseFloat z_t_scale = std::max<double>(1.0, Z_t_double.Trace());
394  Z_t_double.Scale(1.0 / z_t_scale);
395  SpMatrix<BaseFloat> Z_t_scaled(Z_t_double);
396 
397  Matrix<BaseFloat> U_t(R, R);
398  Vector<BaseFloat> c_t(R);
399  // do the symmetric eigenvalue decomposition Z_t = U_t C_t U_t^T.
400  Z_t_scaled.Eig(&c_t, &U_t);
401  SortSvd(&c_t, &U_t);
402  c_t.Scale(z_t_scale);
403 
404  const BaseFloat condition_threshold = 1.0e+06;
405  // must_reorthogonalize will be true if the last diagonal element of c_t is
406  // negative, since we don't take the absolute value, but this is the right
407  // thing anyway.
408  bool must_reorthogonalize = (c_t(0) > condition_threshold * c_t(R - 1));
409 
410  BaseFloat c_t_floor = pow(rho_t * (1 - eta), 2);
411  int32 nf;
412  c_t.ApplyFloor(c_t_floor, &nf);
413  if (nf > 0)
414  must_reorthogonalize = true;
415  if (nf > 0 && self_debug_) {
416  KALDI_WARN << "Floored " << nf << " elements of C_t.";
417  }
418 
419  X_t->AddMatMat(-1.0, H_t, kNoTrans, W_t, kNoTrans, 1.0); // X_hat_t = X_t - H_t W_t
420 
421  Vector<BaseFloat> sqrt_c_t(c_t);
422  sqrt_c_t.ApplyPow(0.5);
423 
424  // \rho_{t+1} = 1/(D - R) (\eta/N tr(X_t X_t^T) + (1-\eta)(D \rho_t + tr(D_t)) - tr(C_t^{0.5})).
425  BaseFloat rho_t1 = 1.0 / (D - R) * (eta / N * tr_X_Xt
426  + (1-eta)*(D * rho_t + d_t.Sum())
427  - sqrt_c_t.Sum());
428  // D_{t+1} = C_t^{0.5} - \rho_{t+1} I
429  Vector<BaseFloat> d_t1(sqrt_c_t);
430  d_t1.Add(-rho_t1);
431  BaseFloat floor_val = std::max(epsilon_, delta_ * sqrt_c_t.Max());
432  if (rho_t1 < floor_val)
433  rho_t1 = floor_val;
434  d_t1.ApplyFloor(floor_val);
435 
436  CuMatrix<BaseFloat> W_t1(R, D); // W_{t+1}
437  ComputeWt1(N, d_t, d_t1, rho_t, rho_t1, U_t, sqrt_c_t, inv_sqrt_e_t,
438  W_t, &J_t, &W_t1);
439 
440  if (must_reorthogonalize) {
441  if (self_debug_) {
442  KALDI_WARN << "Reorthogonalizing.";
443  }
444  ReorthogonalizeRt1(d_t1,
445  rho_t1,
446  &W_t1,
447  &J_t,
448  &L_t);
449  }
450 
451  W_t_.Swap(&W_t1);
452  d_t_.CopyFromVec(d_t1);
453  rho_t_ = rho_t1;
454 
455  if (self_debug_)
456  SelfTest();
457 }
void ReorthogonalizeRt1(const VectorBase< BaseFloat > &d_t1, BaseFloat rho_t1, CuMatrixBase< BaseFloat > *W_t1, CuMatrixBase< BaseFloat > *temp_W, CuMatrixBase< BaseFloat > *temp_O)
kaldi::int32 int32
void ComputeWt1(int32 N, const VectorBase< BaseFloat > &d_t, const VectorBase< BaseFloat > &d_t1, BaseFloat rho_t, BaseFloat rho_t1, const MatrixBase< BaseFloat > &U_t, const VectorBase< BaseFloat > &sqrt_c_t, const VectorBase< BaseFloat > &inv_sqrt_e_t, const CuMatrixBase< BaseFloat > &W_t, CuMatrixBase< BaseFloat > *J_t, CuMatrixBase< BaseFloat > *W_t1) const
float BaseFloat
Definition: kaldi-types.h:29
#define KALDI_WARN
Definition: kaldi-error.h:150
void ComputeEt(const VectorBase< BaseFloat > &d_t, BaseFloat beta_t, VectorBase< BaseFloat > *e_t, VectorBase< BaseFloat > *sqrt_e_t, VectorBase< BaseFloat > *inv_sqrt_e_t) const
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185
#define NVTX_RANGE(name)
Definition: cu-common.h:143
#define KALDI_VLOG(v)
Definition: kaldi-error.h:156
void ComputeZt(int32 N, BaseFloat rho_t, const VectorBase< BaseFloat > &d_t, const VectorBase< BaseFloat > &inv_sqrt_e_t, const MatrixBase< BaseFloat > &K_t, const MatrixBase< BaseFloat > &L_t, SpMatrix< double > *Z_t) const
void SortSvd(VectorBase< Real > *s, MatrixBase< Real > *U, MatrixBase< Real > *Vt, bool sort_on_absolute_value)
Function to ensure that SVD is sorted.

◆ ReorthogonalizeRt1()

void ReorthogonalizeRt1 ( const VectorBase< BaseFloat > &  d_t1,
BaseFloat  rho_t1,
CuMatrixBase< BaseFloat > *  W_t1,
CuMatrixBase< BaseFloat > *  temp_W,
CuMatrixBase< BaseFloat > *  temp_O 
)
private

Definition at line 201 of file natural-gradient-online.cc.

References CuMatrixBase< Real >::AddMatMat(), OnlineNaturalGradient::alpha_, TpMatrix< Real >::Cholesky(), OnlineNaturalGradient::ComputeEt(), CuMatrixBase< Real >::CopyFromMat(), MatrixBase< Real >::CopyFromTp(), rnnlm::i, TpMatrix< Real >::Invert(), SpMatrix< Real >::IsUnit(), rnnlm::j, KALDI_WARN, kaldi::kNoTrans, kaldi::kTakeLower, kaldi::kUndefined, PackedMatrix< Real >::Max(), CuMatrixBase< Real >::MulRowsVec(), CuMatrixBase< Real >::NumCols(), CuMatrixBase< Real >::NumRows(), MatrixBase< Real >::OrthogonalizeRows(), OnlineNaturalGradient::self_debug_, VectorBase< Real >::Sum(), and CuMatrixBase< Real >::SymAddMat2().

Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirectionsInternal().

206  {
207  // threshold is a configuration value: a desired threshold on orthogonality,
208  // below which we won't reorthogonalize.
209  const BaseFloat threshold = 1.0e-03;
210 
211  int32 R = W_t1->NumRows(), D = W_t1->NumCols();
212  BaseFloat beta_t1 = rho_t1 * (1.0 + alpha_) + alpha_ * d_t1.Sum() / D;
213  Vector<BaseFloat> e_t1(R, kUndefined), sqrt_e_t1(R, kUndefined),
214  inv_sqrt_e_t1(R, kUndefined);
215  ComputeEt(d_t1, beta_t1, &e_t1, &sqrt_e_t1, &inv_sqrt_e_t1);
216 
217  temp_O->SymAddMat2(1.0, *W_t1, kNoTrans, 0.0);
218  // O_{t+1} = E_{t+1}^{-0.5} W_{t+1} W_{t+1}^T E_{t+1}^{-0.5}
219  Matrix<BaseFloat> O_mat(*temp_O);
220  SpMatrix<BaseFloat> O(O_mat, kTakeLower);
221  for (int32 i = 0; i < R; i++) {
222  BaseFloat i_factor = inv_sqrt_e_t1(i);
223  for (int32 j = 0; j <= i; j++) {
224  BaseFloat j_factor = inv_sqrt_e_t1(j);
225  O(i, j) *= i_factor * j_factor;
226  }
227  }
228  if (O.IsUnit(threshold)) {
229  if (self_debug_) {
230  KALDI_WARN << "Not reorthogonalizing since already orthognoal: " << O;
231  }
232  return;
233  }
234  TpMatrix<BaseFloat> C(R);
235  bool cholesky_ok = true;
236  try {
237  // one of the following two calls may throw an exception.
238  C.Cholesky(O);
239  C.Invert(); // Now it's C^{-1}.
240  if (!(C.Max() < 100.0)) {
241  KALDI_WARN << "Cholesky out of expected range, "
242  << "reorthogonalizing with Gram-Schmidt";
243  cholesky_ok = false;
244  }
245  } catch (...) {
246  // We do a Gram-Schmidt orthogonalization, which is a bit less efficient but
247  // more robust than the method using Cholesky.
248  KALDI_WARN << "Cholesky or Invert() failed while re-orthogonalizing R_t. "
249  << "Re-orthogonalizing on CPU.";
250  cholesky_ok = false;
251  }
252  if (!cholesky_ok) {
253  Matrix<BaseFloat> cpu_W_t1(*W_t1);
254  cpu_W_t1.OrthogonalizeRows();
255  W_t1->CopyFromMat(cpu_W_t1);
256  // at this point cpu_W_t1 represents R_{t+1}- it has orthonormal
257  // rows. Do: W_{t+1} = E_{t+1}^{0.5} R_{t+1}
258  CuVector<BaseFloat> sqrt_e_t1_gpu(sqrt_e_t1);
259  W_t1->MulRowsVec(sqrt_e_t1_gpu);
260  return;
261  }
262  // Next, compute (E_t^{0.5} C^{-1} E_t^{-0.5})
263  // but it's really t+1, not t.
264  for (int32 i = 0; i < R; i++) {
265  BaseFloat i_factor = sqrt_e_t1(i);
266  for (int32 j = 0; j < i; j++) {
267  // skip j == i because i_factor * j_factor == 1 for j == i.
268  BaseFloat j_factor = inv_sqrt_e_t1(j);
269  C(i, j) *= i_factor * j_factor;
270  }
271  }
272  O_mat.CopyFromTp(C);
273  temp_O->CopyFromMat(O_mat);
274  temp_W->CopyFromMat(*W_t1);
275  W_t1->AddMatMat(1.0, *temp_O, kNoTrans, *temp_W, kNoTrans, 0.0);
276 }
kaldi::int32 int32
float BaseFloat
Definition: kaldi-types.h:29
#define KALDI_WARN
Definition: kaldi-error.h:150
void ComputeEt(const VectorBase< BaseFloat > &d_t, BaseFloat beta_t, VectorBase< BaseFloat > *e_t, VectorBase< BaseFloat > *sqrt_e_t, VectorBase< BaseFloat > *inv_sqrt_e_t) const

◆ SelfTest()

void SelfTest ( ) const
private

Definition at line 279 of file natural-gradient-online.cc.

References CuSpMatrix< Real >::AddMat2(), OnlineNaturalGradient::alpha_, OnlineNaturalGradient::ComputeEt(), OnlineNaturalGradient::d_t_, OnlineNaturalGradient::delta_, OnlineNaturalGradient::epsilon_, rnnlm::i, SpMatrix< Real >::IsUnit(), rnnlm::j, KALDI_ASSERT, KALDI_WARN, kaldi::kNoTrans, kaldi::kUndefined, OnlineNaturalGradient::rho_t_, and OnlineNaturalGradient::W_t_.

Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirectionsInternal().

279  {
281  BaseFloat d_t_max = d_t_.Max(), d_t_min = d_t_.Min();
282  KALDI_ASSERT(d_t_min >= epsilon_);
283  KALDI_ASSERT(d_t_min > 0.9 * delta_ * d_t_max);
284  KALDI_ASSERT(rho_t_ > 0.9 * delta_ * d_t_max);
285 
286  int32 D = W_t_.NumCols(), R = W_t_.NumRows();
287  BaseFloat beta_t = rho_t_ * (1.0 + alpha_) + alpha_ * d_t_.Sum() / D;
288  Vector<BaseFloat> e_t(R, kUndefined), sqrt_e_t(R, kUndefined),
289  inv_sqrt_e_t(R, kUndefined);
290  ComputeEt(d_t_, beta_t, &e_t, &sqrt_e_t, &inv_sqrt_e_t);
291 
292  CuSpMatrix<BaseFloat> S(R);
293  S.AddMat2(1.0, W_t_, kNoTrans, 0.0);
294  SpMatrix<BaseFloat> O(S);
295  for (int32 i = 0; i < R; i++) {
296  BaseFloat i_factor = inv_sqrt_e_t(i);
297  for (int32 j = 0; j <= i; j++) {
298  BaseFloat j_factor = inv_sqrt_e_t(j);
299  O(i, j) *= i_factor * j_factor;
300  }
301  }
302  if (!O.IsUnit(1.0e-04) || O(0, 0) != O(0, 0)) {
303  BaseFloat worst_error = 0.0;
304  int32 worst_i = 0, worst_j = 0;
305  for (int32 i = 0; i < R; i++) {
306  for (int32 j = 0; j < R; j++) {
307  BaseFloat elem = O(i, j);
308  BaseFloat error = fabs(elem - (i == j ? 1.0 : 0.0));
309  if (error > worst_error || error != error) {
310  worst_error = error;
311  worst_i = i;
312  worst_j = j;
313  }
314  }
315  }
316  if (worst_error > 1.0e-02 || worst_error != worst_error) {
317  KALDI_WARN << "Failed to verify W_t (worst error: O[" << worst_i << ','
318  << worst_j << "] = " << O(worst_i, worst_j)
319  << ", d_t = " << d_t_;
320  }
321  }
322 }
kaldi::int32 int32
float BaseFloat
Definition: kaldi-types.h:29
#define KALDI_WARN
Definition: kaldi-error.h:150
void ComputeEt(const VectorBase< BaseFloat > &d_t, BaseFloat beta_t, VectorBase< BaseFloat > *e_t, VectorBase< BaseFloat > *sqrt_e_t, VectorBase< BaseFloat > *inv_sqrt_e_t) const
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185

◆ SetAlpha()

◆ SetNumMinibatchesHistory()

void SetNumMinibatchesHistory ( BaseFloat  num_minibatches_history)

Definition at line 617 of file natural-gradient-online.cc.

References KALDI_ASSERT, and OnlineNaturalGradient::num_minibatches_history_.

Referenced by TimeHeightConvolutionComponent::InitFromConfig(), and TimeHeightConvolutionComponent::Read().

618  {
619  KALDI_ASSERT(num_minibatches_history > 1.0);
620  num_minibatches_history_ = num_minibatches_history;
621 }
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185

◆ SetNumSamplesHistory()

void SetNumSamplesHistory ( BaseFloat  num_samples_history)

◆ SetRank()

◆ SetUpdatePeriod()

◆ Swap()

void Swap ( OnlineNaturalGradient other)

Definition at line 628 of file natural-gradient-online.cc.

References OnlineNaturalGradient::alpha_, OnlineNaturalGradient::d_t_, OnlineNaturalGradient::delta_, OnlineNaturalGradient::epsilon_, OnlineNaturalGradient::frozen_, OnlineNaturalGradient::num_minibatches_history_, OnlineNaturalGradient::num_samples_history_, OnlineNaturalGradient::rank_, OnlineNaturalGradient::rho_t_, OnlineNaturalGradient::self_debug_, kaldi::swap(), OnlineNaturalGradient::t_, OnlineNaturalGradient::update_period_, and OnlineNaturalGradient::W_t_.

Referenced by TimeHeightConvolutionComponent::ConsolidateMemory(), LstmNonlinearityComponent::ConsolidateMemory(), TdnnComponent::ConsolidateMemory(), NaturalGradientRepeatedAffineComponent::ConsolidateMemory(), ConstantComponent::ConsolidateMemory(), NaturalGradientAffineComponent::ConsolidateMemory(), LinearComponent::ConsolidateMemory(), ConstantFunctionComponent::ConsolidateMemory(), NaturalGradientPerElementScaleComponent::ConsolidateMemory(), ScaleAndOffsetComponent::ConsolidateMemory(), and OnlineNaturalGradient::Freeze().

628  {
629  std::swap(rank_, other->rank_);
630  std::swap(update_period_, other->update_period_);
631  std::swap(num_samples_history_, other->num_samples_history_);
632  std::swap(num_minibatches_history_, other->num_minibatches_history_);
633  std::swap(alpha_, other->alpha_);
634  std::swap(epsilon_, other->epsilon_);
635  std::swap(delta_, other->delta_);
636  std::swap(frozen_, other->frozen_);
637  std::swap(t_, other->t_);
638  std::swap(self_debug_, other->self_debug_);
639  W_t_.Swap(&(other->W_t_));
640  std::swap(rho_t_, other->rho_t_);
641  d_t_.Swap(&(other->d_t_));
642 }
void swap(basic_filebuf< CharT, Traits > &x, basic_filebuf< CharT, Traits > &y)

◆ TurnOnDebug()

◆ Updating()

bool Updating ( ) const
private

Definition at line 459 of file natural-gradient-online.cc.

References OnlineNaturalGradient::frozen_, OnlineNaturalGradient::t_, and OnlineNaturalGradient::update_period_.

Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirections().

459  {
460  // Just hard-code it here that we do 10 initial updates before skipping any.
461  // This must be > 'num_init_iters = 3' from Init().
462  const int num_initial_updates = 10;
463 
464  return (!frozen_ &&
465  (t_ <= num_initial_updates ||
466  (t_ - num_initial_updates) % update_period_ == 0));
467 }

Member Data Documentation

◆ alpha_

◆ d_t_

◆ delta_

◆ epsilon_

◆ frozen_

◆ num_minibatches_history_

◆ num_samples_history_

◆ rank_

◆ rho_t_

◆ self_debug_

◆ t_

◆ update_period_

◆ W_t_


The documentation for this class was generated from the following files: