ExampleMergingConfig Class Reference

#include <nnet-example-utils.h>

Collaboration diagram for ExampleMergingConfig:

Classes

struct  IntSet
 

Public Member Functions

 ExampleMergingConfig (const char *default_minibatch_size="256")
 
void Register (OptionsItf *po)
 
void ComputeDerived ()
 
int32 MinibatchSize (int32 size_of_eg, int32 num_available_egs, bool input_ended) const
 This function tells you what minibatch size should be used for this eg. More...
 

Public Attributes

bool compress
 
std::string measure_output_frames
 
std::string minibatch_size
 
std::string discard_partial_minibatches
 

Static Private Member Functions

static bool ParseIntSet (const std::string &str, IntSet *int_set)
 

Private Attributes

std::vector< std::pair< int32, IntSet > > rules
 

Detailed Description

Definition at line 321 of file nnet-example-utils.h.

Constructor & Destructor Documentation

◆ ExampleMergingConfig()

ExampleMergingConfig ( const char *  default_minibatch_size = "256")
inline

Definition at line 329 of file nnet-example-utils.h.

329  :
330  compress(false),
331  measure_output_frames("deprecated"),
332  minibatch_size(default_minibatch_size),
333  discard_partial_minibatches("deprecated") { }

Member Function Documentation

◆ ComputeDerived()

void ComputeDerived ( )

Definition at line 958 of file nnet-example-utils.cc.

References kaldi::ConvertStringToInteger(), rnnlm::i, kaldi::IsSortedAndUniq(), KALDI_ERR, KALDI_WARN, and kaldi::SplitStringToVector().

Referenced by main().

958  {
959  if (measure_output_frames != "deprecated") {
960  KALDI_WARN << "The --measure-output-frames option is deprecated "
961  "and will be ignored.";
962  }
963  if (discard_partial_minibatches != "deprecated") {
964  KALDI_WARN << "The --discard-partial-minibatches option is deprecated "
965  "and will be ignored.";
966  }
967  std::vector<std::string> minibatch_size_split;
968  SplitStringToVector(minibatch_size, "/", false, &minibatch_size_split);
969  if (minibatch_size_split.empty()) {
970  KALDI_ERR << "Invalid option --minibatch-size=" << minibatch_size;
971  }
972 
973  rules.resize(minibatch_size_split.size());
974  for (size_t i = 0; i < minibatch_size_split.size(); i++) {
975  int32 &eg_size = rules[i].first;
976  IntSet &int_set = rules[i].second;
977  // 'this_rule' will be either something like "256" or like "64-128,256"
978  // (but these two only if minibatch_size_split.size() == 1, or something with
979  // an example-size specified, like "256=64-128,256"
980  std::string &this_rule = minibatch_size_split[i];
981  if (this_rule.find('=') != std::string::npos) {
982  std::vector<std::string> rule_split; // split on '='
983  SplitStringToVector(this_rule, "=", false, &rule_split);
984  if (rule_split.size() != 2) {
985  KALDI_ERR << "Could not parse option --minibatch-size="
986  << minibatch_size;
987  }
988  if (!ConvertStringToInteger(rule_split[0], &eg_size) ||
989  !ParseIntSet(rule_split[1], &int_set))
990  KALDI_ERR << "Could not parse option --minibatch-size="
991  << minibatch_size;
992 
993  } else {
994  if (minibatch_size_split.size() != 1) {
995  KALDI_ERR << "Could not parse option --minibatch-size="
996  << minibatch_size << " (all rules must have "
997  << "eg-size specified if >1 rule)";
998  }
999  if (!ParseIntSet(this_rule, &int_set))
1000  KALDI_ERR << "Could not parse option --minibatch-size="
1001  << minibatch_size;
1002  }
1003  }
1004  {
1005  // check that no size is repeated.
1006  std::vector<int32> all_sizes(minibatch_size_split.size());
1007  for (size_t i = 0; i < minibatch_size_split.size(); i++)
1008  all_sizes[i] = rules[i].first;
1009  std::sort(all_sizes.begin(), all_sizes.end());
1010  if (!IsSortedAndUniq(all_sizes)) {
1011  KALDI_ERR << "Invalid --minibatch-size=" << minibatch_size
1012  << " (repeated example-sizes)";
1013  }
1014  }
1015 }
bool ConvertStringToInteger(const std::string &str, Int *out)
Converts a string into an integer via strtoll and returns false if there was any kind of problem (i...
Definition: text-utils.h:118
std::vector< std::pair< int32, IntSet > > rules
kaldi::int32 int32
static bool ParseIntSet(const std::string &str, IntSet *int_set)
void SplitStringToVector(const std::string &full, const char *delim, bool omit_empty_strings, std::vector< std::string > *out)
Split a string using any of the single character delimiters.
Definition: text-utils.cc:63
#define KALDI_ERR
Definition: kaldi-error.h:147
#define KALDI_WARN
Definition: kaldi-error.h:150
bool IsSortedAndUniq(const std::vector< T > &vec)
Returns true if the vector is sorted and contains each element only once.
Definition: stl-utils.h:63

◆ MinibatchSize()

int32 MinibatchSize ( int32  size_of_eg,
int32  num_available_egs,
bool  input_ended 
) const

This function tells you what minibatch size should be used for this eg.

Parameters
[in]size_of_egThe "size" of the eg, as obtained by GetNnetExampleSize() or a similar function (up to the caller).
[in]num_available_egsThe number of egs of this size that are currently available; should be >0. The value returned will be <= this value, possibly zero.
[in]input_endedTrue if the input has ended, false otherwise. This is important because before the input has ended, we will only batch egs into the largest possible minibatch size among the range allowed for that size of eg.
Returns
Returns the minibatch size to use in this situation, as specified by the configuration.

Definition at line 1017 of file nnet-example-utils.cc.

References rnnlm::i, KALDI_ASSERT, and KALDI_ERR.

Referenced by DiscriminativeExampleMerger::AcceptExample(), ChainExampleMerger::AcceptExample(), ExampleMerger::AcceptExample(), DiscriminativeExampleMerger::Finish(), ChainExampleMerger::Finish(), and ExampleMerger::Finish().

1019  {
1020  KALDI_ASSERT(num_available_egs > 0 && size_of_eg > 0);
1021  int32 num_rules = rules.size();
1022  if (num_rules == 0)
1023  KALDI_ERR << "You need to call ComputeDerived() before calling "
1024  "MinibatchSize().";
1025  int32 min_distance = std::numeric_limits<int32>::max(),
1026  closest_rule_index = 0;
1027  for (int32 i = 0; i < num_rules; i++) {
1028  int32 distance = std::abs(size_of_eg - rules[i].first);
1029  if (distance < min_distance) {
1030  min_distance = distance;
1031  closest_rule_index = i;
1032  }
1033  }
1034  if (!input_ended) {
1035  // until the input ends, we can only use the largest available
1036  // minibatch-size (otherwise, we could expect more later).
1037  int32 largest_size = rules[closest_rule_index].second.largest_size;
1038  if (largest_size <= num_available_egs)
1039  return largest_size;
1040  else
1041  return 0;
1042  } else {
1043  int32 s = rules[closest_rule_index].second.LargestValueInRange(
1044  num_available_egs);
1045  KALDI_ASSERT(s <= num_available_egs);
1046  return s;
1047  }
1048 }
std::vector< std::pair< int32, IntSet > > rules
kaldi::int32 int32
#define KALDI_ERR
Definition: kaldi-error.h:147
#define KALDI_ASSERT(cond)
Definition: kaldi-error.h:185

◆ ParseIntSet()

bool ParseIntSet ( const std::string &  str,
ExampleMergingConfig::IntSet int_set 
)
staticprivate

Definition at line 936 of file nnet-example-utils.cc.

References rnnlm::i, ExampleMergingConfig::IntSet::largest_size, ExampleMergingConfig::IntSet::ranges, kaldi::SplitStringToIntegers(), and kaldi::SplitStringToVector().

937  {
938  std::vector<std::string> split_str;
939  SplitStringToVector(str, ",", false, &split_str);
940  if (split_str.empty())
941  return false;
942  int_set->largest_size = 0;
943  int_set->ranges.resize(split_str.size());
944  for (size_t i = 0; i < split_str.size(); i++) {
945  std::vector<int32> split_range;
946  SplitStringToIntegers(split_str[i], ":", false, &split_range);
947  if (split_range.size() < 1 || split_range.size() > 2 ||
948  split_range[0] > split_range.back() || split_range[0] <= 0)
949  return false;
950  int_set->ranges[i].first = split_range[0];
951  int_set->ranges[i].second = split_range.back();
952  int_set->largest_size = std::max<int32>(int_set->largest_size,
953  split_range.back());
954  }
955  return true;
956 }
bool SplitStringToIntegers(const std::string &full, const char *delim, bool omit_empty_strings, std::vector< I > *out)
Split a string (e.g.
Definition: text-utils.h:68
void SplitStringToVector(const std::string &full, const char *delim, bool omit_empty_strings, std::vector< std::string > *out)
Split a string using any of the single character delimiters.
Definition: text-utils.cc:63

◆ Register()

void Register ( OptionsItf po)
inline

Definition at line 335 of file nnet-example-utils.h.

References ExampleGenerationConfig::ComputeDerived(), and OptionsItf::Register().

Referenced by main().

335  {
336  po->Register("compress", &compress, "If true, compress the output examples "
337  "(not recommended unless you are writing to disk)");
338  po->Register("measure-output-frames", &measure_output_frames, "This "
339  "value will be ignored (included for back-compatibility)");
340  po->Register("discard-partial-minibatches", &discard_partial_minibatches,
341  "This value will be ignored (included for back-compatibility)");
342  po->Register("minibatch-size", &minibatch_size,
343  "String controlling the minibatch size. May be just an integer, "
344  "meaning a fixed minibatch size (e.g. --minibatch-size=128). "
345  "May be a list of ranges and values, e.g. --minibatch-size=32,64 "
346  "or --minibatch-size=16:32,64,128. All minibatches will be of "
347  "the largest size until the end of the input is reached; "
348  "then, increasingly smaller sizes will be allowed. Only egs "
349  "with the same structure (e.g num-frames) are merged. You may "
350  "specify different minibatch sizes for different sizes of eg "
351  "(defined as the maximum number of Indexes on any input), in "
352  "the format "
353  "--minibatch-size='eg_size1=mb_sizes1/eg_size2=mb_sizes2', e.g. "
354  "--minibatch-size=128=64:128,256/256=32:64,128. Egs are given "
355  "minibatch-sizes based on the specified eg-size closest to "
356  "their actual size.");
357  }

Member Data Documentation

◆ compress

◆ discard_partial_minibatches

std::string discard_partial_minibatches

Definition at line 327 of file nnet-example-utils.h.

◆ measure_output_frames

std::string measure_output_frames

Definition at line 325 of file nnet-example-utils.h.

◆ minibatch_size

std::string minibatch_size

Definition at line 326 of file nnet-example-utils.h.

◆ rules

std::vector<std::pair<int32, IntSet> > rules
private

Definition at line 410 of file nnet-example-utils.h.


The documentation for this class was generated from the following files: