This comment explains the basic framework used for everything related to time-height convolution. More...

#include <convolution.h>

Collaboration diagram for ConvolutionModel:

Classes
struct	Offset

Public Member Functions
void	ComputeDerived ()

int32	InputDim () const

int32	OutputDim () const

int32	ParamRows () const

int32	ParamCols () const

	ConvolutionModel ()

bool	operator== (const ConvolutionModel &other) const

bool	Check (bool check_heights_used=true, bool allow_height_padding=true) const

std::string	Info () const

void	Write (std::ostream &os, bool binary) const

void	Read (std::istream &is, bool binary)

Public Attributes
int32	num_filters_in

int32	num_filters_out

int32	height_in

int32	height_out

int32	height_subsample_out

std::vector< Offset >	offsets

std::set< int32 >	required_time_offsets

std::set< int32 >	all_time_offsets

int32	time_offsets_modulus

Detailed Description

This comment explains the basic framework used for everything related to time-height convolution.

We are doing convolution in 2 dimensions; these would normally be width and height, but in the nnet3 framework we identify the width with the 'time' dimension (the 't' element of an Index). This enables us to use this framework in the normal way for speech tasks, and it turns out to have other advantages it too, giving us a very efficient and easy implementation of CNNs (basically, the nnet3 framework takes care of certain reorderings for us). As mentioned, the 't' index will correspond to the width, and the vectors we operate on will be of dimension height * num-filters, where the filter-index has the stride of 1.

We will use the GeneralComponent interface, and its function ReorderIndexes(), to ensure that the input and output Indexes of the component have a specified regular structure; we'll pad with 'blank' Indexes (t=kNoTime) on the input and output of the component, as needed to ensure that it's an evenly spaced grid over n and t, with x always zero and the t values evenly spaced. (However, a note on even spacing: for computations with downsampling this ordering of the 't' values is bit more complicated, search for 'blocks' in the rest of this header for more information).

First consider the simplest case, call it "same-t-stride" (where there is no downsampling on the time index, i.e. the input and output 't' values have the same stride, like 1, 2 or 4). The input and output matrices have dimension num-t-values * num-images, with the num-t-values having the higher stride. The computation involves copying a row-range of the input matrix to a temporary matrix with a column mapping (the temporary matrix will typically have more columns than the input matrix); and then doing a matrix-multiply between the reshaped temporary matrix and a block of the parameters; the block corresponds to a particular time-offset. Then we may need to repeat the whole process with a different, shifted row-range of the input matrix and a different column map. You may have to read the rest of this header, to understand this in more detail. This struct represents a convolutional model from a structural point of view (it doesn't contain the actual parameters). Note: the parameters are to be stored in a matrix of dimension (num_filters_out) by (offsets.size() * num_filters_in) [where the offset-index has the larger stride than the filter-index].

Partly out of a desire for generality, but also for convenience in implementation and integration with nnet3, at this level we don't represent the patch size in the normal way like '1x1' or '3x3', but as a list of pairs (time-offset, height-offset). E.g. a 1x1 patch would normally be the single pair (0,0), and a 3x3 patch might be represented as

offsets={ (0,0),(0,1),(0,2), (1,0),(1,1),(1,2), (2,0),(2,1),(2,2) }

However– and you have to be a bit careful here– the time indexes are on an absolute* numbering scheme so that if you had downsampled the time axis on a previous layer, the time-offsets might all be multiples of (e.g.) 2 or 4, but the height-offsets would normally always be separated by 1. [note: we always normalize the list of (time-offset, height-offset) pairs with the lexicographical ordering that you see above.] This asymmetry between time and height may not be very aesthetic, but the absolute numbering of time is at the core of how the framework works. Note: the offsets don't have to start from zero, they can be less than zero, just like the offsets in TDNNs which are often lists like (-3,0,3). Don't be surprised to see things like:

offsets={ (-3,-1),(-3,0),(-3,1), (0,-1),(0,0),(0,2), (3,-1),(3,0),(3,1) }

If there are negative offsets in the height dimension (as above) it means that there is zero-padding in the height dimension (because the first height-index at both the input and the output is 0, so having a height-offset means that to compute the output at height-index 0 we need the input at height-index -1, which doesn't exist; this implies zero padding on the bottom of the image.

Definition at line 125 of file convolution.h.

Constructor & Destructor Documentation

◆ ConvolutionModel()

ConvolutionModel ( )

inline

Definition at line 210 of file convolution.h.

References ConvolutionModel::Check(), ConvolutionModel::Info(), ConvolutionModel::Offset::operator==(), ConvolutionModel::Read(), and ConvolutionModel::Write().

210 { }

Member Function Documentation

◆ Check()

bool Check	(	bool	check_heights_used = `true`,
		bool	allow_height_padding = `true`
	)		const

Definition at line 130 of file convolution.cc.

References ConvolutionModel::all_time_offsets, ConvolutionModel::ComputeDerived(), ConvolutionModel::height_in, ConvolutionModel::Offset::height_offset, ConvolutionModel::height_out, ConvolutionModel::height_subsample_out, rnnlm::i, kaldi::IsSortedAndUniq(), KALDI_ASSERT, KALDI_WARN, ConvolutionModel::num_filters_in, ConvolutionModel::num_filters_out, ConvolutionModel::offsets, ConvolutionModel::required_time_offsets, ConvolutionModel::Offset::time_offset, and ConvolutionModel::Write().

Referenced by kaldi::nnet3::time_height_convolution::AppendInputFrames(), TimeHeightConvolutionComponent::Check(), ConvolutionModel::ConvolutionModel(), kaldi::nnet3::time_height_convolution::GetRandomConvolutionIndexes(), kaldi::nnet3::time_height_convolution::GetRandomConvolutionModel(), TimeHeightConvolutionComponent::InitFromConfig(), kaldi::nnet3::time_height_convolution::PadModelHeight(), ConvolutionModel::Read(), and ConvolutionComputation::Read().

                                                               {
   if (num_filters_in <= 0 || num_filters_out <= 0 ||
       height_in <= 0 || height_out <= 0 ||
       height_subsample_out <=  0  || offsets.empty() ||
       required_time_offsets.empty()) {
     KALDI_WARN << "Convolution model fails basic check.";
     return false;
   }
   ConvolutionModel temp(*this);
   temp.ComputeDerived();
   if (!(temp == *this)) {
     KALDI_WARN << "Derived variables are incorrect.";
     return false;
   }
   // check that required_time_offsets is included in all_time_offsets.
   for (std::set<int32>::iterator iter = required_time_offsets.begin();
        iter != required_time_offsets.end(); ++iter) {
     if (all_time_offsets.count(*iter) == 0) {
       KALDI_WARN << "Required time offsets not a subset of all_time_offsets.";
       return false;
     }
   }
   KALDI_ASSERT(IsSortedAndUniq(offsets));
   std::vector<bool> h_in_used(height_in, false);
   std::vector<bool> offsets_used(offsets.size(), false);
 
   // check that in cases where we only have the minimum
   // required input (from required_time_offsets), each
   // height in the output is potentially nonzero.
   for (int32 h_out = 0; h_out < height_out * height_subsample_out;
        h_out += height_subsample_out) {
     bool some_input_available = false;
     for (size_t i = 0; i < offsets.size(); i++) {
       const Offset &offset = offsets[i];
       int32 h_in = h_out + offset.height_offset;
       if (h_in >= 0 && h_in < height_in) {
         offsets_used[i] = true;
         h_in_used[h_in] = true;
         if (required_time_offsets.count(offset.time_offset) != 0)
           some_input_available = true;
       } else {
         if (!allow_height_padding) {
           KALDI_WARN << "height padding not allowed but is required.";
           return false;
         }
       }
     }
     if (!some_input_available) {
       // none of the
       // input pixels for this output pixel were available (at least in the case
       // where we only have the 'required' inputs on the time dimension).
       std::ostringstream os;
       Write(os, false);
       KALDI_WARN << "for the " << (h_out / height_out) << "'th output height, "
                  "no input is available, if only required time-indexes "
                  "are available.";
       // We could later change this part of the validation code to accept
       // such models, if there is a legitimate use-case.
       return false;
     }
   }
   if (check_heights_used) {
     for (int32 h = 0; h < height_in; h++) {
       if (!h_in_used[h]) {
         KALDI_WARN << "The input at the " << h << "'th height is never used.";
         return false;
       }
     }
   }
   for (size_t i = 0; i < offsets_used.size(); i++) {
     if (!offsets_used[i]) {
       KALDI_WARN << "(time,height) offset (" << offsets[i].time_offset
                  << "," << offsets[i].height_offset
                  << ") of this computation is never used.";
       return false;
     }
   }
   return true;
 }

◆ ComputeDerived()

void ComputeDerived ( )

Definition at line 109 of file convolution.cc.

References ConvolutionModel::all_time_offsets, kaldi::Gcd(), ConvolutionModel::offsets, and ConvolutionModel::time_offsets_modulus.

Referenced by kaldi::nnet3::time_height_convolution::AppendInputFrames(), ConvolutionModel::Check(), kaldi::nnet3::time_height_convolution::GetRandomConvolutionModel(), TimeHeightConvolutionComponent::InitFromConfig(), ConvolutionModel::Read(), and ConvolutionComputation::Read().

                                       {
   { // compute all_time_offsets
     all_time_offsets.clear();
     for (std::vector<Offset>::const_iterator iter = offsets.begin();
          iter != offsets.end(); ++iter)
       all_time_offsets.insert(iter->time_offset);
   }
   { // compute time_offsets_modulus
     time_offsets_modulus = 0;
     std::set<int32>::iterator iter = all_time_offsets.begin();
     int32 cur_offset = *iter;
     for (++iter; iter != all_time_offsets.end(); ++iter) {
       int32 this_offset = *iter;
       time_offsets_modulus = Gcd(time_offsets_modulus,
                                  this_offset - cur_offset);
       cur_offset = this_offset;
     }
   }
 }

◆ Info()

std::string Info ( ) const

Definition at line 87 of file convolution.cc.

References ConvolutionModel::height_in, ConvolutionModel::height_out, ConvolutionModel::height_subsample_out, rnnlm::i, ConvolutionModel::InputDim(), ConvolutionModel::num_filters_in, ConvolutionModel::num_filters_out, ConvolutionModel::offsets, ConvolutionModel::OutputDim(), and ConvolutionModel::required_time_offsets.

Referenced by ConvolutionModel::ConvolutionModel(), kaldi::nnet3::time_height_convolution::GetRandomConvolutionModel(), TimeHeightConvolutionComponent::Info(), and kaldi::nnet3::time_height_convolution::TestRunningComputation().

                                        {
   std::ostringstream os;
   os << "num-filters-in=" << num_filters_in
      << ", num-filters-out=" << num_filters_out
      << ", height-in=" << height_in
      << ", height-out=" << height_out
      << ", height-subsample-out=" << height_subsample_out
      << ", {time,height}-offsets=[";
   for (size_t i = 0; i < offsets.size(); i++) {
     if (i > 0) os << ' ';
     os << offsets[i].time_offset << ',' << offsets[i].height_offset;
   }
   os << "], required-time-offsets=[";
   for (std::set<int32>::const_iterator iter = required_time_offsets.begin();
        iter != required_time_offsets.end(); ++iter) {
     if (iter != required_time_offsets.begin()) os << ',';
     os << *iter;
   }
   os << "], input-dim=" << InputDim() << ", output-dim=" << OutputDim();
   return os.str();
 }

◆ InputDim()

int32 InputDim ( ) const

inline

Definition at line 203 of file convolution.h.

References ConvolutionModel::height_in.

Referenced by ConvolutionModel::Info(), TimeHeightConvolutionComponent::InputDim(), kaldi::nnet3::time_height_convolution::TestDataBackprop(), kaldi::nnet3::time_height_convolution::TestParamsBackprop(), and kaldi::nnet3::time_height_convolution::TestRunningComputation().

203 { return num_filters_in * height_in; }

kaldi::nnet3::time_height_convolution::ConvolutionModel::height_in

int32 height_in

Definition: convolution.h:128

kaldi::nnet3::time_height_convolution::ConvolutionModel::num_filters_in

int32 num_filters_in

Definition: convolution.h:126

◆ operator==()

bool operator== ( const ConvolutionModel & other ) const

Definition at line 212 of file convolution.cc.

References ConvolutionModel::all_time_offsets, ConvolutionModel::height_in, ConvolutionModel::height_out, ConvolutionModel::height_subsample_out, ConvolutionModel::num_filters_in, ConvolutionModel::num_filters_out, ConvolutionModel::offsets, ConvolutionModel::required_time_offsets, and ConvolutionModel::time_offsets_modulus.

                                                                        {
   return num_filters_in == other.num_filters_in &&
       num_filters_out == other.num_filters_out &&
       height_in == other.height_in &&
       height_out == other.height_out &&
       height_subsample_out == other.height_subsample_out &&
       offsets == other.offsets &&
       required_time_offsets == other.required_time_offsets &&
       all_time_offsets == other.all_time_offsets &&
       time_offsets_modulus == other.time_offsets_modulus;
 }

◆ OutputDim()

int32 OutputDim ( ) const

inline

Definition at line 204 of file convolution.h.

References ConvolutionModel::height_out.

Referenced by ConvolutionModel::Info(), TimeHeightConvolutionComponent::OutputDim(), kaldi::nnet3::time_height_convolution::TestDataBackprop(), kaldi::nnet3::time_height_convolution::TestParamsBackprop(), and kaldi::nnet3::time_height_convolution::TestRunningComputation().

204 { return num_filters_out * height_out; }

kaldi::nnet3::time_height_convolution::ConvolutionModel::num_filters_out

int32 num_filters_out

Definition: convolution.h:127

kaldi::nnet3::time_height_convolution::ConvolutionModel::height_out

int32 height_out

Definition: convolution.h:129

◆ ParamCols()

int32 ParamCols ( ) const

inline

Definition at line 208 of file convolution.h.

Referenced by TimeHeightConvolutionComponent::Check(), TimeHeightConvolutionComponent::InitFromConfig(), kaldi::nnet3::time_height_convolution::TestDataBackprop(), kaldi::nnet3::time_height_convolution::TestParamsBackprop(), and kaldi::nnet3::time_height_convolution::TestRunningComputation().

208 { return num_filters_in * static_cast<int32>(offsets.size()); }

kaldi::nnet3::time_height_convolution::ConvolutionModel::num_filters_in

int32 num_filters_in

Definition: convolution.h:126

kaldi::int32

kaldi::int32 int32

Definition: online-tcp-source.cc:27

kaldi::nnet3::time_height_convolution::ConvolutionModel::offsets

std::vector< Offset > offsets

Definition: convolution.h:157

◆ ParamRows()

int32 ParamRows ( ) const

inline

Definition at line 206 of file convolution.h.

References ConvolutionModel::num_filters_out.

Referenced by TimeHeightConvolutionComponent::Check(), TimeHeightConvolutionComponent::InitFromConfig(), kaldi::nnet3::time_height_convolution::TestDataBackprop(), kaldi::nnet3::time_height_convolution::TestParamsBackprop(), and kaldi::nnet3::time_height_convolution::TestRunningComputation().

206 { return num_filters_out; }

kaldi::nnet3::time_height_convolution::ConvolutionModel::num_filters_out

int32 num_filters_out

Definition: convolution.h:127

◆ Read()

void Read	(	std::istream &	is,
		bool	binary
	)

Definition at line 252 of file convolution.cc.

References ConvolutionModel::Check(), ConvolutionModel::ComputeDerived(), kaldi::ExpectOneOrTwoTokens(), kaldi::nnet3::ExpectToken(), ConvolutionModel::height_in, ConvolutionModel::height_out, ConvolutionModel::height_subsample_out, rnnlm::i, KALDI_ASSERT, ConvolutionModel::num_filters_in, ConvolutionModel::num_filters_out, ConvolutionModel::offsets, kaldi::ReadBasicType(), kaldi::ReadIntegerPairVector(), kaldi::ReadIntegerVector(), and ConvolutionModel::required_time_offsets.

Referenced by ConvolutionModel::ConvolutionModel(), TimeHeightConvolutionComponent::Read(), and kaldi::nnet3::time_height_convolution::UnitTestTimeHeightConvolutionIo().

                                                        {
   ExpectOneOrTwoTokens(is, binary, "<ConvolutionModel>", "<NumFiltersIn>");
   ReadBasicType(is, binary, &num_filters_in);
   ExpectToken(is, binary, "<NumFiltersOut>");
   ReadBasicType(is, binary, &num_filters_out);
   ExpectToken(is, binary, "<HeightIn>");
   ReadBasicType(is, binary, &height_in);
   ExpectToken(is, binary, "<HeightOut>");
   ReadBasicType(is, binary, &height_out);
   ExpectToken(is, binary, "<HeightSubsampleOut>");
   ReadBasicType(is, binary, &height_subsample_out);
   ExpectToken(is, binary, "<Offsets>");
   std::vector<std::pair<int32, int32> > pairs;
   ReadIntegerPairVector(is, binary, &pairs);
   offsets.resize(pairs.size());
   for (size_t i = 0; i < offsets.size(); i++) {
     offsets[i].time_offset = pairs[i].first;
     offsets[i].height_offset = pairs[i].second;
   }
   std::vector<int32> required_time_offsets_list;
   ExpectToken(is, binary, "<RequiredTimeOffsets>");
   ReadIntegerVector(is, binary, &required_time_offsets_list);
   required_time_offsets.clear();
   required_time_offsets.insert(required_time_offsets_list.begin(),
                                required_time_offsets_list.end());
   ExpectToken(is, binary, "</ConvolutionModel>");
   ComputeDerived();
   KALDI_ASSERT(Check(false, true));
 }

◆ Write()

void Write	(	std::ostream &	os,
		bool	binary
	)		const

Definition at line 225 of file convolution.cc.

References ConvolutionModel::height_in, ConvolutionModel::height_out, ConvolutionModel::height_subsample_out, rnnlm::i, ConvolutionModel::num_filters_in, ConvolutionModel::num_filters_out, ConvolutionModel::offsets, ConvolutionModel::required_time_offsets, kaldi::WriteBasicType(), kaldi::WriteIntegerPairVector(), kaldi::WriteIntegerVector(), and kaldi::WriteToken().

Referenced by ConvolutionModel::Check(), ConvolutionModel::ConvolutionModel(), kaldi::nnet3::time_height_convolution::UnitTestTimeHeightConvolutionIo(), and TimeHeightConvolutionComponent::Write().

                                                               {
   WriteToken(os, binary, "<ConvolutionModel>");
   WriteToken(os, binary, "<NumFiltersIn>");
   WriteBasicType(os, binary, num_filters_in);
   WriteToken(os, binary, "<NumFiltersOut>");
   WriteBasicType(os, binary, num_filters_out);
   WriteToken(os, binary, "<HeightIn>");
   WriteBasicType(os, binary, height_in);
   WriteToken(os, binary, "<HeightOut>");
   WriteBasicType(os, binary, height_out);
   WriteToken(os, binary, "<HeightSubsampleOut>");
   WriteBasicType(os, binary, height_subsample_out);
   WriteToken(os, binary, "<Offsets>");
   std::vector<std::pair<int32, int32> > pairs(offsets.size());
   for (size_t i = 0; i < offsets.size(); i++) {
     pairs[i].first = offsets[i].time_offset;
     pairs[i].second = offsets[i].height_offset;
   }
   WriteIntegerPairVector(os, binary, pairs);
   std::vector<int32> required_time_offsets_list(required_time_offsets.begin(),
                                                 required_time_offsets.end());
   WriteToken(os, binary, "<RequiredTimeOffsets>");
   WriteIntegerVector(os, binary, required_time_offsets_list);
   WriteToken(os, binary, "</ConvolutionModel>");
 }