Keywords for search: natural gradient, naturalgradient, NG-SGD. More...
#include <natural-gradient-online.h>
Public Member Functions | |
OnlineNaturalGradient () | |
void | SetRank (int32 rank) |
void | SetUpdatePeriod (int32 update_period) |
void | SetNumSamplesHistory (BaseFloat num_samples_history) |
void | SetNumMinibatchesHistory (BaseFloat num_minibatches_history) |
void | SetAlpha (BaseFloat alpha) |
void | TurnOnDebug () |
BaseFloat | GetNumSamplesHistory () const |
BaseFloat | GetNumMinibatchesHistory () const |
BaseFloat | GetAlpha () const |
int32 | GetRank () const |
int32 | GetUpdatePeriod () const |
void | Freeze (bool frozen) |
void | PreconditionDirections (CuMatrixBase< BaseFloat > *X, BaseFloat *scale) |
This call implements the main functionality of this class. More... | |
OnlineNaturalGradient (const OnlineNaturalGradient &other) | |
OnlineNaturalGradient & | operator= (const OnlineNaturalGradient &other) |
void | Swap (OnlineNaturalGradient *other) |
Static Private Member Functions | |
static void | InitOrthonormalSpecial (CuMatrixBase< BaseFloat > *R) |
This function creates a matrix with orthonormal rows that is like the following matrix, except with each row normalized to have unit 2-norm: [ 1.1 0 1 0 1 0 0 1.1 0 1 0 1 ] The reason why the first element in each row is 1.1 and not 1, is for symmetry-breaking... More... | |
Keywords for search: natural gradient, naturalgradient, NG-SGD.
This method is explained in the paper "Parallel training of DNNs with Natural Gradient and Parameter Averaging" by D. Povey, X. Zhang and S. Khudanpur, ICLR Workshop, 2015, where it is referred to as online NG-SGD. Note that the method exported from this header is just the core of the algorithm, and some outer-level parts of it are implemented in class NaturalGradientAffineComponent.
The rest of this extended comment describes the way we keep updated an estimate of the inverse of a scatter matrix, in an online way. This is the same as the estimation of one of the A or B quantities in the paper. This comment is slightly redundant with the paper- actually it precedes the paper- but we keep it in case it is useful in understanging our method.
We consider the problem of doing online estimation of a (scaled-identity plus low-rank) approximation of a Fisher matrix... since the Fisher matrix is a scatter of vector-valued derivatives and we will be given the derivatives (or at least terms in a factorization of the derivatives which need not concern us right now), we can just think of the present task as being the online accumulation of a (low-rank plus scaled-identity) approximation to a variance of a distribution with mean zero.
Later on we'll think about how to get easy access to the inverse of this approximate variance, which is what we really need.
Our approximation to the Fisher matrix (the scatter of derivatives) will be of the following form (and just think of this as an approximate variance matrix of some arbitrary quantities).
F_t =(def) R_t^T D_t R_t + I
(t is the minibatch index), where R_t is an R by D matrix with orthonormal rows (1 <= R < D is our chosen rank), D_t is a positive-definite diagonal matrix, and > 0. Suppose the dimension of F_t is D. Let the vectors whose variance we are approximating be provided in minibatches of size M (M can vary from iteration to iteration, but it won't vary in the normal case, so we omit the subscript t). The batch of gradients is given as X_t Re^{M D}, i.e. each row is one of the vectors whose scatter we're estimating. On the t'th iteration, define the scatter S_t of the input vectors X_t as:
S_t =(def) 1/N X_t^T X_t (eqn:St)
(where N is the minibatch size). Be careful not to confuse the rank R with with input X_t (we would typeface X_t in bold if this were not plain text, to make the distinction clearer). We want F_t to approach some kind of time-weighted average of the S_t quantities, to the extent permitted by the limitation of the rank R. We want the F_t quantities to stay "fresh" (since we'll be doing this in a SGD context and the parameters will be slowly changing). We use a constant 0 < < 1 to control the updating rate. Our update for R_t is based on the power method. Define the smoothed scatter
T_t =(def) S_t + (1-) F_t
[note: F_{t+1} will be set to a low-rank approximation of T_t, which is where the recursion comes in.]
We'll use this in place of the observed scatter S_t, to slow down the update. Defining
Y_t =(def) R_t T_t
which can be expanded as follows: Y_t = R_t ( S_t + (1-) F_t ) = R_t ( S_t + (1-) (R_t^T D_t R_t + I) ) = R_t ( S_t + (1-) (R_t^T D_t R_t + I) ) = R_t S_t + (1-) (D_t + I) R_t
It is useful to think of Y_t as having each of the top eigenvectors of the scatter scaled by the corresponding eigenvalue . We compute the following R by R matrix: Z_t =(def) Y_t Y_t^T and do the symmetric eigenvalue decomposition Z_t = U_t C_t U_t^T where C_t is diagonal and U_t orthogonal; the diagonal elements of C_t will be positive (since > 0, T_t is positive definite; since R_t has full row rank and T_t is positive definite, Y_t has full row rank; hence Z_t is positive definite). The diagonal elements of C_t can be thought of as corresponding to the squares of our current estimate of the top eigenvalues of the scatter matrix. [we should check that no element of C_t is <= 0.]
It is easy to show that C_t^{-0.5} U_t^T Z_t U_t C_t^{-0.5} = I, so (C_t^{-0.5} U_t^T Y_t) (Y_t^T U_t C_t^{-0.5}) = I. Define R_{t+1} =(def) C_t^{-0.5} U_t^T Y_t
and it's clear that R_{t+1} R_{t+1}^T = I. We will set D_{t+1} =(def) C_t^{0.5} - {t+1} I (eqn:dt1)
which ensures that for each row r of R_{t+1}, the variance of our scatter matrix F_{t+1} will be the square root of the corresponding diagonal element of C_t. This makes sense because, as we have pointed out, the diagonal elements of C_t can be thought of as corresponding to squared eigenvalues. But a proper treatment of this would require convergence analysis that would get quite complicated. We will choose {t+1} in order to ensure that tr(F_{t+1}) = tr(T_t).
For any t, tr(F_t) = D + tr(D_t) tr(T_t) = tr(S_t) + (1-) tr(F_t) = tr(S_t) + (1-) (D + tr(D_t)) Expanding out D_{t+1} from (eqn:dt1) in the expression for tr(F_{t+1}) below: tr(F_{t+1}) = D {t+1} + tr(D_{t+1}) tr(F_{t+1}) = D {t+1} + tr(C_t^{0.5} - {t+1} I) = (D - R) {t+1} + tr(C_t^{0.5}) and equating tr(F_{t+1}) with T_t (since F_{t+1} is supposed to be a low-rank approximation to T_t), we have tr(F_{t+1}) = tr(T_t) (D - R) {t+1} + tr(C_t^{0.5}) = tr(S_t) + (1-) (D + tr(D_t))
Solving for {t+1}, {t+1} = 1/(D - R) ( tr(S_t) + (1-)(D + tr(D_t)) - tr(C_t^{0.5})). (eqn:rhot1)
Note that it is technically possible that diagonal elements of of D_{t+1} may be negative, but we can still show that F_{t+1} is strictly positive definite if F_t was strictly positive definite.
If the quantities for which we are computing the Fisher matrix are all zero for some, reason, the sequence of F_t will geometrically approach zero, which would cause problems with inversion; to prevent this happening, after setting D_{t+1} and {t+1} as above, we floor {t+1} to a small value (like 1.0e-10).
OK, we have described the updating of R_t, D_t and . Next, we need to figure out how to efficiently multiply by the inverse of F_t. Our experience from working with the old preconditioning method was that it's best not to use the inverse of the Fisher matrix itself, but a version of the Fisher matrix that's smoothed with some constant times the identity. Below, ( is a configuration value, e.g. 4.0 seemed to work well). The following formula is designed to ensure that the smoothing varies proportionally with the scale of F_t:
G_t =(def) F_t + /D tr(F_t) I = R_t^T D_t R_t + ( + /D tr(F_t)) I = R_t^T D_t R_t + I where =(def) + /D tr(F_t) = (1+) + /D tr(D_t) (eqn:betat2)
Define {X}_t =(def) X_t G_t^{-1}. the factor of is inserted arbitrarily as it just happens to be convenient to put unit scale on X_t in the formula for {X}_t; it will anyway be canceled out in the next step. Then our final preconditioned minibatch of vectors is: {X}_t = {X}_t where = sqrt(tr(X_t X_t^T) / tr({X}_t {X}_t^T). The factor of ensures that {X}_t is scaled to have the same overall 2-norm as the input X_t. We found in previous versions of this method that this rescaling was helpful, as otherwise there are certain situations (e.g. at the start of training) where the preconditioned derivatives can get very large. Note that this rescaling introduces a small bias into the training, because now the scale applied to a given sample depends on that sample itself, albeit in an increasingly diluted way as the minibatch size gets large.
To efficiently compute G_t^{-1}, we will use the Woodbury matrix identity. Writing the Woodbury formula for the symmetric case, (A + U D U^T)^{-1} = A^{-1} - A^{-1} U (D^{-1} + U^T A^{-1} U)^{-1} U^T A^{-1} Substituting A = I, D = D_t and U = R_t^T, this becomes G_t^{-1} = 1/ I - 1/^2 R_t^T (D_t^{-1} + 1/ I)^{-1} R_t = 1/ (I - R_t^T E_t R_t) where E_t =(def) 1/ (D_t^{-1} + 1/ I)^{-1}, (eqn:etdef) so e_{tii} = 1/ * 1/(1/d_{tii} + 1/) (eqn:tii) = 1/(/d_{tii} + 1)
We would like an efficient-to-compute expression for {X}_t, without too many separate invocations of kernels on the GPU. {X}_t = X_t G_t^{-1} = X_t - X_t R_t^T E_t R_t For efficient operation on the GPU, we want to reduce the number of high-dimensional operations that we do (defining "high-dimension" as anything involving D or M, but not R, since R is likely small, such as 20). We define W_t =(def) E_t^{0.5} R_t. We will actually be storing W_t on the GPU rather than R_t, in order to reduce the number of operations on the GPU. We can now write:
{X}_t = X_t - X_t W_t^T W_t (eqn:pt2)
The following, which we'll compute on the GPU, are going to be useful in computing quantities like Z_t:
H_t =(def) X_t W_t^T (dim is N by R) J_t =(def) H_t^T X_t (dim is R by D) = W_t X_t^T X_t K_t =(def) J_t J_t^T (dim is R by R, symmetric).. transfer this to CPU. L_t =(def) H_t^T H_t (dim is R by R, symmetric).. transfer this to CPU. = W_t X_t^T X_t W_t^T Note: L_t may also be computed as L_t = J_t W_t^T which may be more efficient if D < N.
Note: after we have computed H_t we can directly compute {X}_t = X_t - H_t W_t
We need to determine how Y_t and Z_t relate to the quantities we just defined. First, we'll expand out H_t, J_t, K_t and L_t in terms of the more fundamental quantities. H_t = X_t R_t^T E_t^{0.5} J_t = E_t^{0.5} R_t X_t^T X_t K_t = E_t^{0.5} R_t X_t^T X_t X_t^T X_t R_t^T E_t^{0.5} L_t = E_t^{0.5} R_t X_t^T X_t R_t^T E_t^{0.5}
we wrote above that Y_t = R_t S_t + (1-) (D_t + I) R_t so Y_t = /N R_t X_t^T X_t + (1-) (D_t + I) R_t = /N E_t^{-0.5} J_t + (1-) (D_t + I) R_t (eqn:yt) We will expand Z_t using the expression for Y_t in the line above: Z_t = Y_t Y_t^T = (/N)^2 E_t^{-0.5} J_t J_t^T E_t^{-0.5} +(/N)(1-) E_t^{-0.5} J_t R_t^T (D_t + I) +(/N)(1-) (D_t + I) R_t J_t^T E_t^{-0.5} +(1-)^2 (D_t + I)^2 = (/N)^2 E_t^{-0.5} K_t E_t^{-0.5} +(/N)(1-) E_t^{-0.5} L_t E_t^{-0.5} (D_t + I) +(/N)(1-) (D_t + I) E_t^{-0.5} L_t E_t^{-0.5} +(1-)^2 (D_t + I)^2 (eqn:Zt) We compute Z_t on the CPU using the expression above, and then do the symmetric eigenvalue decomposition (also on the CPU): Z_t = U_t C_t U_t^T. and we make sure the eigenvalues are sorted from largest to smallest, for reasons that will be mentioned later.
Mathematically, no diagonal element of C_t can be less than (1-)^2 ^2, and since negative or zero elements of C_t would cause us a problem later, we floor C_t to this value. (see below regarding how we ensure R_{t+1} has orthonormal rows).
We will continue the discussion below regarding what we do with C_t and U_t. Next, we need to digress briefly and describe how to compute tr({X}_t {X}_t^T) and tr(X_t X_t^2), since these appear in expressions for (needed to produce the output {X}_t), and for {t+1}. It happens that we need, for purposes of appying "max_change" in the neural net code, the squared 2-norm of each row of the output {X}_t. In order to be able to compute , it's most convenient to compute this squared row-norm for each row of {X}_t, as a vector, to compute tr({X}_t {X}_t^2) from this vector as its sum, and to then work back to compute tr(X_t X_t^2) from the relation between {X}_t and X_t. We can then scale the row-norms we computed for {X}_t, so they apply to {X}_t.
For current purposes, you can imagine that we computed tr({X}_t {X}_t^T) directly. Using (from eqn:pt2) {X}_t = X_t - X_t W_t^T W_t, we can expand tr({X}_t {X}_t^T) as: tr({X}_t {X}_t^T) = tr(X_t X_t^T) + tr(X_t W_t^T W_t W_t^T W_t X_t^T)
tr(X_t X_t^T) = tr({X}_t {X}_t^T) - tr(L_t E_t) + 2 tr(L_t) and the above expression can be used to obtain tr(X_t X_t^2). We can then do <– sqrt(tr(X_t X_t^T) / tr({X}_t {X}_t^T)). (or one if the denominator is zero), and then {X}_t <– {X}_t We can then output the per-row squared-l2-norms of Q by scaling those we computed from P by ^2.
OK, the digression on how to compute and tr(X_t X_t^T) is over. We now return to the computation of R_{t+1}, W_{t+1}, {t+1}, D_{t+1} and E_{t+1}.
We found above in (eqn:rhot1) {t+1} = 1/(D - R) ( tr(S_t) + (1-)(D + tr(D_t)) - tr(C_t^{0.5})). Expanding out S_t from its definition in (eqn:St), {t+1} = 1/(D - R) (/N tr(X_t X_t^T) + (1-)(D + tr(D_t)) - tr(C_t^{0.5})). We can compute this directly as all the quantities involved are already known or easy to compute. Next, from (eqn:dt1), we compute D_{t+1} = C_t^{0.5} - {t+1} I At this point if {t+1} is smaller than some small value , e.g. 1.0e-10, we set it to ; as mentioned, we do this to stop F_t approaching zero if all inputs are zero. Next, if any diagonal element D_{t+1,i,i} has absolute value less than , we set it to +. This is to ensure that diagonal elements of E are never zero, which would cause problems.
Next, we compute (from eqn:betat2, eqn:etdef, eqn:tii), {t+1} = {t+1} (1+) + /D tr(D_{t+1}) E_{t+1} = 1/{t+1} (D_{t+1}^{-1} + 1/{t+1} I)^{-1}, i.e.: e_{tii} = 1/({t+1}/d_{t+1,ii} + 1)
We'll want to store D_{t+1}. We next want to compute W_{t+1}.
Before computing W_{t+1}, we need to find an expression for R_{t+1} = C_t^{-0.5} U_t^T Y_t Expanding out Y_t using the expression in (eqn:yt), R_{t+1} = C_t^{-0.5} U_t^T (/N E_t^{-0.5} J_t + (1-) (D_t + I) R_t) = (/N C_t^{-0.5} U_t^T E_t^{-0.5}) J_t +((1-) C_t^{-0.5} U_t^T (D_t + I) E_t^{-0.5}) W_t
What we actually want is W_{t+1} = E_{t+1}^{0.5} R_{t+1}: W_{t+1} = (/N E_{t+1}^{0.5} C_t^{-0.5} U_t^T E_t^{-0.5}) J_t +((1-) E_{t+1}^{0.5} C_t^{-0.5} U_t^T (D_t + I) E_t^{-0.5}) W_t and to minimize the number of matrix-matrix multiplies we can factorize this as: W_{t+1} = A_t B_t A_t = (/N) E_{t+1}^{0.5} C_t^{-0.5} U_t^T E_t^{-0.5} B_t = J_t + (1-)/(/N) (D_t + I) W_t [note: we use the fact that (D_t + I) and E_t^{-0.5} commute because they are diagonal].
A_t is computed on the CPU and transferred from there to the GPU, B_t is computed on the PGU, and the multiplication of A_t with B_t is done on the GPU.
Keeping R_t orthogonal *
Our method requires the R_t matrices to be orthogonal (which we define to mean that R_t R_t^T = I). If roundoff error causes this equality to be significantly violated, it could cause a problem for the stability of our method. We now address our method for making sure that the R_t values stay orthogonal. We do this in the algorithm described above, after creating W_{t+1}. This extra step is only executed if the condition number of C_t (i.e. the ratio of its largest to smallest diagonal element) exceeds a specified threshold, such as 1.0e+06 [this is tested before applying the floor to C_t]. The threshold was determined empirically by finding the largest value needed to ensure a certain level of orthogonality in R_{t+1}. For purposes of the present discussion, since R_{t+1} is not actually stored, define it as E_{t+1}^{-0.5} W_{t+1}. Define the following (and we will just use t instead of t+1 below, as all quantities have the same subscript):
O_t =(def) R_t R_t^T = E_t^{-0.5} W_t W_t^T E_t^{-0.5}
(and we would compute this by computing W_t W_t^T on the GPU, transferring it to the CPU, and doing the rest there). If O_t is not sufficiently close to the unit matrix, we can re-orthogonalize as follows: Do the Cholesky decomposition O_t = C C^T Clearly C^{-1} O_t C^{-T} = I, so if we correct R_t with: R_t <– C^{-1} R_t we can ensure orthogonality. If R_t's first k rows are orthogonal, this transform will not affect them, because of its lower-triangular structure... this is good because (thanks to the eigenvalue sorting), the larger eigenvectors are first and it is more critical to keep them pointing in the same direction. Any loss of orthogonality will be dealt with by modifying the smaller eigenvectors. As a modification to W_t, this would be: W_t <– (E_t^{0.5} C^{-1} E_t^{-0.5}) W_t, and the matrix in parentheses is computed on the CPU, transferred to the GPU, and the multiplication is done there.
Initialization *
Now, a note on what we do on time t = 0, i.e. for the first minibatch. We initialize R_0 to the top R eigenvectors of 1/N X_0 X_0^T, where N is the minibatch size (num-rows of X0). If L is the corresponding RxR diagonal matrix of eigenvalues, then we will set D_0 = L - I. We set to ensure that tr(F_0) = 1/N tr(X_0 X_0^T), tr(D_0) - D = 1/N tr(X_0 X_0^T), tr(L) + R - D = 1/N tr(X_0 X_0^T) = (1/N tr(X_0 X_0^T) - tr(L)) / (D - R)
We then floor to (e.g. 1.0e-10) and also floor the diagonal elements of D_0 to ; this ensures that we won't crash for zero inputs.
A note on multi-threading. This technique was really designed for use with a GPU, where we won't have multi-threading, but we want it to work also on a CPU, where we may have multiple worker threads. Our approach is as follows (we do this when we're about to start updating the parameters R_t, D_t, and derived quantities):
For time t > 0 (where the matrices are already initialized), before starting the part of the computation that updates the parameters (R_t, D_t, and derived quantities), we try to lock a mutex that guards the OnlinePreconditioner. If we can lock it right away, we go ahead and do the update, but if not, we just abandon the attempt to update those quantities.
We will have another mutex to ensure that when we access quantities like W_t, they are all "in sync" (and we don't access them while they are being written by another thread). This mutex will only be locked for short periods of time.
Note: it might be a good idea to make sure that the R_t still retain orthonormal rows even in the presence of roundoff, without errors accumulating. My instinct is that this isn't going to be a problem.
Definition at line 414 of file natural-gradient-online.h.
Definition at line 27 of file natural-gradient-online.cc.
Referenced by OnlineNaturalGradient::Freeze().
|
explicit |
Definition at line 578 of file natural-gradient-online.cc.
|
private |
Definition at line 560 of file natural-gradient-online.cc.
References VectorBase< Real >::ApplyPow(), VectorBase< Real >::CopyFromVec(), rnnlm::d, VectorBase< Real >::Data(), VectorBase< Real >::Dim(), rnnlm::i, and VectorBase< Real >::InvertElements().
Referenced by OnlineNaturalGradient::ComputeWt1(), OnlineNaturalGradient::Freeze(), OnlineNaturalGradient::PreconditionDirectionsInternal(), OnlineNaturalGradient::ReorthogonalizeRt1(), and OnlineNaturalGradient::SelfTest().
|
private |
Definition at line 484 of file natural-gradient-online.cc.
References CuMatrixBase< Real >::AddDiagVecMat(), CuMatrixBase< Real >::AddMatMat(), OnlineNaturalGradient::alpha_, OnlineNaturalGradient::ComputeEt(), VectorBase< Real >::Dim(), OnlineNaturalGradient::Eta(), rnnlm::i, VectorBase< Real >::InvertElements(), rnnlm::j, KALDI_ASSERT, kaldi::kNoTrans, kaldi::kTrans, kaldi::kUndefined, CuMatrixBase< Real >::NumCols(), and VectorBase< Real >::Sum().
Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirectionsInternal().
|
private |
Definition at line 529 of file natural-gradient-online.cc.
References VectorBase< Real >::Add(), VectorBase< Real >::Dim(), OnlineNaturalGradient::Eta(), rnnlm::i, and rnnlm::j.
Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirectionsInternal().
Definition at line 470 of file natural-gradient-online.cc.
References KALDI_ASSERT, OnlineNaturalGradient::num_minibatches_history_, and OnlineNaturalGradient::num_samples_history_.
Referenced by OnlineNaturalGradient::ComputeWt1(), OnlineNaturalGradient::ComputeZt(), OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirectionsInternal().
|
inline |
Definition at line 437 of file natural-gradient-online.h.
References OnlineNaturalGradient::ComputeEt(), OnlineNaturalGradient::ComputeWt1(), OnlineNaturalGradient::ComputeZt(), OnlineNaturalGradient::Eta(), OnlineNaturalGradient::frozen_, OnlineNaturalGradient::Init(), OnlineNaturalGradient::InitDefault(), OnlineNaturalGradient::InitOrthonormalSpecial(), OnlineNaturalGradient::OnlineNaturalGradient(), OnlineNaturalGradient::operator=(), OnlineNaturalGradient::PreconditionDirections(), OnlineNaturalGradient::PreconditionDirectionsInternal(), OnlineNaturalGradient::ReorthogonalizeRt1(), OnlineNaturalGradient::SelfTest(), OnlineNaturalGradient::Swap(), and OnlineNaturalGradient::Updating().
Referenced by LstmNonlinearityComponent::ConsolidateMemory(), TimeHeightConvolutionComponent::FreezeNaturalGradient(), LstmNonlinearityComponent::FreezeNaturalGradient(), TdnnComponent::FreezeNaturalGradient(), NaturalGradientAffineComponent::FreezeNaturalGradient(), LinearComponent::FreezeNaturalGradient(), and NaturalGradientPerElementScaleComponent::FreezeNaturalGradient().
|
inline |
Definition at line 432 of file natural-gradient-online.h.
References OnlineNaturalGradient::alpha_.
Referenced by LstmNonlinearityComponent::ConsolidateMemory(), TimeHeightConvolutionComponent::Info(), TdnnComponent::Info(), NaturalGradientAffineComponent::Info(), LinearComponent::Info(), TimeHeightConvolutionComponent::Write(), TdnnComponent::Write(), NaturalGradientAffineComponent::Write(), and LinearComponent::Write().
|
inline |
Definition at line 431 of file natural-gradient-online.h.
References OnlineNaturalGradient::num_minibatches_history_.
Referenced by TimeHeightConvolutionComponent::Info(), and TimeHeightConvolutionComponent::Write().
|
inline |
Definition at line 430 of file natural-gradient-online.h.
References OnlineNaturalGradient::num_samples_history_.
Referenced by TdnnComponent::Info(), NaturalGradientAffineComponent::Info(), LinearComponent::Info(), TdnnComponent::Write(), NaturalGradientAffineComponent::Write(), and LinearComponent::Write().
|
inline |
Definition at line 433 of file natural-gradient-online.h.
References OnlineNaturalGradient::rank_.
Referenced by LstmNonlinearityComponent::ConsolidateMemory(), TimeHeightConvolutionComponent::Info(), TdnnComponent::Info(), NaturalGradientAffineComponent::Info(), LinearComponent::Info(), TimeHeightConvolutionComponent::Write(), TdnnComponent::Write(), NaturalGradientAffineComponent::Write(), and LinearComponent::Write().
|
inline |
Definition at line 434 of file natural-gradient-online.h.
References OnlineNaturalGradient::update_period_.
Referenced by LstmNonlinearityComponent::ConsolidateMemory(), TdnnComponent::Info(), NaturalGradientAffineComponent::Info(), LinearComponent::Info(), NaturalGradientAffineComponent::Write(), and LinearComponent::Write().
|
private |
Definition at line 122 of file natural-gradient-online.cc.
References OnlineNaturalGradient::d_t_, OnlineNaturalGradient::frozen_, rnnlm::i, OnlineNaturalGradient::InitDefault(), kaldi::kUndefined, CuMatrixBase< Real >::NumCols(), CuMatrixBase< Real >::NumRows(), OnlineNaturalGradient::PreconditionDirections(), OnlineNaturalGradient::rank_, OnlineNaturalGradient::rho_t_, OnlineNaturalGradient::t_, and OnlineNaturalGradient::W_t_.
Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirections().
|
private |
Definition at line 71 of file natural-gradient-online.cc.
References OnlineNaturalGradient::alpha_, OnlineNaturalGradient::d_t_, OnlineNaturalGradient::delta_, OnlineNaturalGradient::epsilon_, OnlineNaturalGradient::InitOrthonormalSpecial(), KALDI_ASSERT, KALDI_WARN, kaldi::kUndefined, OnlineNaturalGradient::num_minibatches_history_, OnlineNaturalGradient::num_samples_history_, OnlineNaturalGradient::rank_, OnlineNaturalGradient::rho_t_, OnlineNaturalGradient::t_, and OnlineNaturalGradient::W_t_.
Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::Init().
|
staticprivate |
This function creates a matrix with orthonormal rows that is like the following matrix, except with each row normalized to have unit 2-norm: [ 1.1 0 1 0 1 0 0 1.1 0 1 0 1 ] The reason why the first element in each row is 1.1 and not 1, is for symmetry-breaking...
we don't want any weighted sum of all these rows to be all ones, because the derivative in that direction can be zero in some architectures and it causes us to have to do an inefficient CPU-based renormalization.
Definition at line 46 of file natural-gradient-online.cc.
References CuMatrixBase< Real >::AddElements(), rnnlm::i, KALDI_ASSERT, CuMatrixBase< Real >::NumCols(), CuMatrixBase< Real >::NumRows(), and CuMatrixBase< Real >::SetZero().
Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::InitDefault().
OnlineNaturalGradient & operator= | ( | const OnlineNaturalGradient & | other | ) |
Definition at line 588 of file natural-gradient-online.cc.
References OnlineNaturalGradient::alpha_, OnlineNaturalGradient::d_t_, OnlineNaturalGradient::delta_, OnlineNaturalGradient::epsilon_, OnlineNaturalGradient::num_samples_history_, OnlineNaturalGradient::rank_, OnlineNaturalGradient::rho_t_, OnlineNaturalGradient::self_debug_, OnlineNaturalGradient::t_, OnlineNaturalGradient::update_period_, and OnlineNaturalGradient::W_t_.
Referenced by OnlineNaturalGradient::Freeze().
void PreconditionDirections | ( | CuMatrixBase< BaseFloat > * | X, |
BaseFloat * | scale | ||
) |
This call implements the main functionality of this class.
[in,out] | R | The "R" pointer is both the input (R in the comment, X in the paper), and the output (P in the comment, X with a hat on it in the paper). Each row of R is viewed as a vector in some space, where we're estimating a smoothed Fisher matrix and then multiplying by the inverse of that smoothed Fisher matrix. |
[out] | scale | If non-NULL, a scaling factor is written to here, and the output 'R' should be multiplied by this factor by the user (we don't do it internally, to save an operation). The factor is chosen so that the vector 2-norm of R is the same after the natural gradient as it was before. (The pointer being NULL or non-NULL doesn't affect the magnitude of R; in any case the user will probably want to do this rescaling, the question being whether they want to do so manually or not. |
Definition at line 159 of file natural-gradient-online.cc.
References OnlineNaturalGradient::d_t_, OnlineNaturalGradient::Init(), kaldi::kTrans, CuMatrixBase< Real >::NumCols(), NVTX_RANGE, OnlineNaturalGradient::PreconditionDirectionsInternal(), CuMatrixBase< Real >::Range(), OnlineNaturalGradient::rho_t_, OnlineNaturalGradient::t_, kaldi::TraceMatMat(), OnlineNaturalGradient::Updating(), and OnlineNaturalGradient::W_t_.
Referenced by LstmNonlinearityComponent::Backprop(), ConstantComponent::Backprop(), LinearComponent::Backprop(), PerElementOffsetComponent::Backprop(), ConstantFunctionComponent::Backprop(), ScaleAndOffsetComponent::BackpropInternal(), LstmNonlinearityComponent::ConsolidateMemory(), OnlineNaturalGradient::Freeze(), OnlineNaturalGradient::Init(), kaldi::nnet3::UnitTestPreconditionDirectionsOnline(), NaturalGradientRepeatedAffineComponent::Update(), NaturalGradientAffineComponent::Update(), NaturalGradientPerElementScaleComponent::Update(), TimeHeightConvolutionComponent::UpdateNaturalGradient(), and TdnnComponent::UpdateNaturalGradient().
|
private |
Definition at line 324 of file natural-gradient-online.cc.
References VectorBase< Real >::Add(), CuMatrixBase< Real >::AddMatMat(), OnlineNaturalGradient::alpha_, VectorBase< Real >::ApplyFloor(), VectorBase< Real >::ApplyPow(), OnlineNaturalGradient::ComputeEt(), OnlineNaturalGradient::ComputeWt1(), OnlineNaturalGradient::ComputeZt(), MatrixBase< Real >::CopyLowerToUpper(), OnlineNaturalGradient::d_t_, OnlineNaturalGradient::delta_, SpMatrix< Real >::Eig(), OnlineNaturalGradient::epsilon_, OnlineNaturalGradient::Eta(), KALDI_ASSERT, KALDI_VLOG, KALDI_WARN, kaldi::kNoTrans, kaldi::kTrans, VectorBase< Real >::Max(), CuMatrixBase< Real >::NumCols(), CuMatrixBase< Real >::NumRows(), NVTX_RANGE, OnlineNaturalGradient::rank_, OnlineNaturalGradient::ReorthogonalizeRt1(), OnlineNaturalGradient::rho_t_, PackedMatrix< Real >::Scale(), VectorBase< Real >::Scale(), OnlineNaturalGradient::self_debug_, OnlineNaturalGradient::SelfTest(), kaldi::SortSvd(), VectorBase< Real >::Sum(), SpMatrix< Real >::Trace(), and OnlineNaturalGradient::W_t_.
Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirections().
|
private |
Definition at line 201 of file natural-gradient-online.cc.
References CuMatrixBase< Real >::AddMatMat(), OnlineNaturalGradient::alpha_, TpMatrix< Real >::Cholesky(), OnlineNaturalGradient::ComputeEt(), CuMatrixBase< Real >::CopyFromMat(), MatrixBase< Real >::CopyFromTp(), rnnlm::i, TpMatrix< Real >::Invert(), SpMatrix< Real >::IsUnit(), rnnlm::j, KALDI_WARN, kaldi::kNoTrans, kaldi::kTakeLower, kaldi::kUndefined, PackedMatrix< Real >::Max(), CuMatrixBase< Real >::MulRowsVec(), CuMatrixBase< Real >::NumCols(), CuMatrixBase< Real >::NumRows(), MatrixBase< Real >::OrthogonalizeRows(), OnlineNaturalGradient::self_debug_, VectorBase< Real >::Sum(), and CuMatrixBase< Real >::SymAddMat2().
Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirectionsInternal().
|
private |
Definition at line 279 of file natural-gradient-online.cc.
References CuSpMatrix< Real >::AddMat2(), OnlineNaturalGradient::alpha_, OnlineNaturalGradient::ComputeEt(), OnlineNaturalGradient::d_t_, OnlineNaturalGradient::delta_, OnlineNaturalGradient::epsilon_, rnnlm::i, SpMatrix< Real >::IsUnit(), rnnlm::j, KALDI_ASSERT, KALDI_WARN, kaldi::kNoTrans, kaldi::kUndefined, OnlineNaturalGradient::rho_t_, and OnlineNaturalGradient::W_t_.
Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirectionsInternal().
void SetAlpha | ( | BaseFloat | alpha | ) |
Definition at line 623 of file natural-gradient-online.cc.
References OnlineNaturalGradient::alpha_, and KALDI_ASSERT.
Referenced by LstmNonlinearityComponent::ConsolidateMemory(), TimeHeightConvolutionComponent::InitFromConfig(), TdnnComponent::InitFromConfig(), NaturalGradientAffineComponent::InitFromConfig(), LinearComponent::InitFromConfig(), TimeHeightConvolutionComponent::Read(), TdnnComponent::Read(), and LinearComponent::Read().
void SetNumMinibatchesHistory | ( | BaseFloat | num_minibatches_history | ) |
Definition at line 617 of file natural-gradient-online.cc.
References KALDI_ASSERT, and OnlineNaturalGradient::num_minibatches_history_.
Referenced by TimeHeightConvolutionComponent::InitFromConfig(), and TimeHeightConvolutionComponent::Read().
void SetNumSamplesHistory | ( | BaseFloat | num_samples_history | ) |
Definition at line 612 of file natural-gradient-online.cc.
References KALDI_ASSERT, and OnlineNaturalGradient::num_samples_history_.
Referenced by TdnnComponent::InitFromConfig(), NaturalGradientAffineComponent::InitFromConfig(), LinearComponent::InitFromConfig(), LstmNonlinearityComponent::InitNaturalGradient(), TdnnComponent::Read(), and LinearComponent::Read().
void SetRank | ( | int32 | rank | ) |
Definition at line 604 of file natural-gradient-online.cc.
References KALDI_ASSERT, and OnlineNaturalGradient::rank_.
Referenced by LstmNonlinearityComponent::ConsolidateMemory(), TimeHeightConvolutionComponent::InitFromConfig(), TdnnComponent::InitFromConfig(), NaturalGradientAffineComponent::InitFromConfig(), LinearComponent::InitFromConfig(), PerElementOffsetComponent::InitFromConfig(), LstmNonlinearityComponent::InitNaturalGradient(), LinearComponent::LinearComponent(), NaturalGradientAffineComponent::NaturalGradientAffineComponent(), TimeHeightConvolutionComponent::Read(), TdnnComponent::Read(), LinearComponent::Read(), PerElementOffsetComponent::Read(), and kaldi::nnet3::UnitTestPreconditionDirectionsOnline().
void SetUpdatePeriod | ( | int32 | update_period | ) |
Definition at line 608 of file natural-gradient-online.cc.
References KALDI_ASSERT, and OnlineNaturalGradient::update_period_.
Referenced by LstmNonlinearityComponent::ConsolidateMemory(), TdnnComponent::InitFromConfig(), NaturalGradientAffineComponent::InitFromConfig(), LinearComponent::InitFromConfig(), PerElementOffsetComponent::InitFromConfig(), LstmNonlinearityComponent::InitNaturalGradient(), LinearComponent::LinearComponent(), NaturalGradientAffineComponent::NaturalGradientAffineComponent(), TdnnComponent::Read(), LinearComponent::Read(), and PerElementOffsetComponent::Read().
void Swap | ( | OnlineNaturalGradient * | other | ) |
Definition at line 628 of file natural-gradient-online.cc.
References OnlineNaturalGradient::alpha_, OnlineNaturalGradient::d_t_, OnlineNaturalGradient::delta_, OnlineNaturalGradient::epsilon_, OnlineNaturalGradient::frozen_, OnlineNaturalGradient::num_minibatches_history_, OnlineNaturalGradient::num_samples_history_, OnlineNaturalGradient::rank_, OnlineNaturalGradient::rho_t_, OnlineNaturalGradient::self_debug_, kaldi::swap(), OnlineNaturalGradient::t_, OnlineNaturalGradient::update_period_, and OnlineNaturalGradient::W_t_.
Referenced by TimeHeightConvolutionComponent::ConsolidateMemory(), LstmNonlinearityComponent::ConsolidateMemory(), TdnnComponent::ConsolidateMemory(), NaturalGradientRepeatedAffineComponent::ConsolidateMemory(), ConstantComponent::ConsolidateMemory(), NaturalGradientAffineComponent::ConsolidateMemory(), LinearComponent::ConsolidateMemory(), ConstantFunctionComponent::ConsolidateMemory(), NaturalGradientPerElementScaleComponent::ConsolidateMemory(), ScaleAndOffsetComponent::ConsolidateMemory(), and OnlineNaturalGradient::Freeze().
|
inline |
Definition at line 429 of file natural-gradient-online.h.
References OnlineNaturalGradient::self_debug_.
Referenced by kaldi::nnet3::UnitTestPreconditionDirectionsOnline().
|
private |
Definition at line 459 of file natural-gradient-online.cc.
References OnlineNaturalGradient::frozen_, OnlineNaturalGradient::t_, and OnlineNaturalGradient::update_period_.
Referenced by OnlineNaturalGradient::Freeze(), and OnlineNaturalGradient::PreconditionDirections().
|
private |
Definition at line 583 of file natural-gradient-online.h.
Referenced by OnlineNaturalGradient::ComputeWt1(), OnlineNaturalGradient::GetAlpha(), OnlineNaturalGradient::InitDefault(), OnlineNaturalGradient::operator=(), OnlineNaturalGradient::PreconditionDirectionsInternal(), OnlineNaturalGradient::ReorthogonalizeRt1(), OnlineNaturalGradient::SelfTest(), OnlineNaturalGradient::SetAlpha(), and OnlineNaturalGradient::Swap().
Definition at line 614 of file natural-gradient-online.h.
Referenced by OnlineNaturalGradient::Init(), OnlineNaturalGradient::InitDefault(), OnlineNaturalGradient::operator=(), OnlineNaturalGradient::PreconditionDirections(), OnlineNaturalGradient::PreconditionDirectionsInternal(), OnlineNaturalGradient::SelfTest(), and OnlineNaturalGradient::Swap().
|
private |
Definition at line 595 of file natural-gradient-online.h.
Referenced by OnlineNaturalGradient::InitDefault(), OnlineNaturalGradient::operator=(), OnlineNaturalGradient::PreconditionDirectionsInternal(), OnlineNaturalGradient::SelfTest(), and OnlineNaturalGradient::Swap().
|
private |
Definition at line 589 of file natural-gradient-online.h.
Referenced by OnlineNaturalGradient::InitDefault(), OnlineNaturalGradient::operator=(), OnlineNaturalGradient::PreconditionDirectionsInternal(), OnlineNaturalGradient::SelfTest(), and OnlineNaturalGradient::Swap().
|
private |
Definition at line 603 of file natural-gradient-online.h.
Referenced by OnlineNaturalGradient::Freeze(), OnlineNaturalGradient::Init(), OnlineNaturalGradient::Swap(), and OnlineNaturalGradient::Updating().
|
private |
Definition at line 578 of file natural-gradient-online.h.
Referenced by OnlineNaturalGradient::Eta(), OnlineNaturalGradient::GetNumMinibatchesHistory(), OnlineNaturalGradient::InitDefault(), OnlineNaturalGradient::SetNumMinibatchesHistory(), and OnlineNaturalGradient::Swap().
|
private |
Definition at line 566 of file natural-gradient-online.h.
Referenced by OnlineNaturalGradient::Eta(), OnlineNaturalGradient::GetNumSamplesHistory(), OnlineNaturalGradient::InitDefault(), OnlineNaturalGradient::operator=(), OnlineNaturalGradient::SetNumSamplesHistory(), and OnlineNaturalGradient::Swap().
|
private |
Definition at line 553 of file natural-gradient-online.h.
Referenced by OnlineNaturalGradient::GetRank(), OnlineNaturalGradient::Init(), OnlineNaturalGradient::InitDefault(), OnlineNaturalGradient::operator=(), OnlineNaturalGradient::PreconditionDirectionsInternal(), OnlineNaturalGradient::SetRank(), and OnlineNaturalGradient::Swap().
|
private |
Definition at line 613 of file natural-gradient-online.h.
Referenced by OnlineNaturalGradient::Init(), OnlineNaturalGradient::InitDefault(), OnlineNaturalGradient::operator=(), OnlineNaturalGradient::PreconditionDirections(), OnlineNaturalGradient::PreconditionDirectionsInternal(), OnlineNaturalGradient::SelfTest(), and OnlineNaturalGradient::Swap().
|
private |
Definition at line 610 of file natural-gradient-online.h.
Referenced by OnlineNaturalGradient::operator=(), OnlineNaturalGradient::PreconditionDirectionsInternal(), OnlineNaturalGradient::ReorthogonalizeRt1(), OnlineNaturalGradient::Swap(), and OnlineNaturalGradient::TurnOnDebug().
|
private |
Definition at line 607 of file natural-gradient-online.h.
Referenced by OnlineNaturalGradient::Init(), OnlineNaturalGradient::InitDefault(), OnlineNaturalGradient::operator=(), OnlineNaturalGradient::PreconditionDirections(), OnlineNaturalGradient::Swap(), and OnlineNaturalGradient::Updating().
|
private |
Definition at line 558 of file natural-gradient-online.h.
Referenced by OnlineNaturalGradient::GetUpdatePeriod(), OnlineNaturalGradient::operator=(), OnlineNaturalGradient::SetUpdatePeriod(), OnlineNaturalGradient::Swap(), and OnlineNaturalGradient::Updating().
Definition at line 612 of file natural-gradient-online.h.
Referenced by OnlineNaturalGradient::Init(), OnlineNaturalGradient::InitDefault(), OnlineNaturalGradient::operator=(), OnlineNaturalGradient::PreconditionDirections(), OnlineNaturalGradient::PreconditionDirectionsInternal(), OnlineNaturalGradient::SelfTest(), and OnlineNaturalGradient::Swap().