Răsfoiți Sursa

t_sigmoid overflow when nan / remove new, delete from Model

Summary:
When running
```
fasttext supervised -input dbpedia.train -output /tmp/trash -lr 10.1 -loss ns -thread 1
```
the program quickly segfaults. This is because the argument to sigmoid becomes nan and the index overflows.

This diff simply documents this fact. The check might be too costly for this to be worth it.

It might still be good to add this to our FAQ.

Reviewed By: wickedfoo

Differential Revision: D5739447

fbshipit-source-id: bdeca744e011a9f390ff157e4a205aeb37e38ebe
Christian Puhrsch 8 ani în urmă
părinte
comite
4697db0425
6 a modificat fișierele cu 47 adăugiri și 37 ștergeri
  1. 5 0
      docs/faqs.md
  2. 2 2
      docs/supervised-tutorial.md
  3. 3 3
      docs/unsupervised-tutorials.md
  4. 8 0
      src/matrix.cc
  5. 26 28
      src/model.cc
  6. 3 4
      src/model.h

+ 5 - 0
docs/faqs.md

@@ -43,8 +43,13 @@ There are few unofficial wrappers for python or lua available on github.
 FastText works on discrete tokens and thus cannot be directly used on continuous tokens. However, one can discretize continuous tokens to use fastText on them, for example by rounding values to a specific digit ("12.3" becomes "12").
 
 ## There are misspellings in the dictionary. Should we improve text normalization?
+
 If the words are infrequent, there is no need to worry.
 
+## I'm encountering a NaN, why could this be?
+
+You'll likely see this behavior because your learning rate is too high. Try reducing it until you don't see this error anymore.
+
 ## My compiler / architecture can't build fastText. What should I do?
 Try a newer version of your compiler. We try to maintain compatibility with older versions of gcc and many platforms, however sometimes maintaining backwards compatibility becomes very hard. In general, compilers and tool chains that ship with LTS versions of major linux distributions should be fair game. In any case, create an issue with your compiler version and architecture and we'll try to implement compatibility.
 

+ 2 - 2
docs/supervised-tutorial.md

@@ -126,7 +126,7 @@ R@5  0.146
 Number of examples: 3000
 ```
 
-### Advanced reader: precision and recall
+## Advanced readers: precision and recall
 
 The precision is the number of correct labels among the labels predicted by fastText. The recall is the number of labels that successfully were predicted, among all the real labels. Let's take an example to make this more clear:
 
@@ -257,7 +257,7 @@ With a few steps, we were able to go from a precision at one of 12.4% to 59.9%.
 * changing the learning rate (using the option `-lr`, standard range `[0.1 - 1.0]`) ;
 * using word n-grams (using the option `-wordNgrams`, standard range `[1 - 5]`).
 
-### Advanced readers: What is a Bigram?
+## Advanced readers: What is a Bigram?
 
 A 'unigram' refers to a single undividing unit, or token,  usually used as an input to a model. For example a unigram can a word or a letter depending on the model. In fastText, we work at the word level and thus unigrams are words.
 

+ 3 - 3
docs/unsupervised-tutorials.md

@@ -68,7 +68,7 @@ one 0.32731 0.044409 -0.46484 0.14716 0.7431 0.24684 -0.11301 0.51721 0.73262 ..
 
 The first line is a header containing the number of words and the dimensionality of the vectors. The subsequent lines are the word vectors for all words in the vocabulary, sorted by decreasing frequency.
 
-### Advanced readers: skipgram versus cbow
+## Advanced readers: skipgram versus cbow
 
 fastText provides two models for computing word representations: skipgram and cbow ('**c**ontinuous-**b**ag-**o**f-**w**ords').
 
@@ -86,7 +86,7 @@ To train a cbow model with fastText, you run the following command:
 
 In practice, we observe that skipgram models works better with subword information than cbow. 
 
-### Advanced readers: playing with the parameters
+## Advanced readers: playing with the parameters
 
 So far, we run fastText with the default parameters, but depending on the data, these parameters may not be optimal. Let us give an introduction to some of the key parameters for word vectors.
 
@@ -196,7 +196,7 @@ ecotourism 0.697081
 
 Thanks to the information contained within the word, the vector of our misspelled word matches to reasonable words! It is not perfect but the main information has been captured.
 
-### Advanced reader: measure of similarity
+## Advanced reader: measure of similarity
 
 In order to find nearest neighbors, we need to compute a similarity score between words. Our words are represented by continuous word vectors and we can thus apply simple similarities to them. In particular we use the cosine of the angles between two vectors. This similarity is computed for all words in the vocabulary, and the 10 most similar words are shown.  Of course, if the word appears in the vocabulary, it will appear on top, with a similarity of 1.
 

+ 8 - 0
src/matrix.cc

@@ -12,6 +12,8 @@
 #include <assert.h>
 
 #include <random>
+#include <exception>
+#include <stdexcept>
 
 #include "utils.h"
 #include "vector.h"
@@ -73,6 +75,9 @@ real Matrix::dotRow(const Vector& vec, int64_t i) const {
   for (int64_t j = 0; j < n_; j++) {
     d += at(i, j) * vec.data_[j];
   }
+  if (std::isnan(d)) {
+    throw std::runtime_error("Encountered NaN.");
+  }
   return d;
 }
 
@@ -117,6 +122,9 @@ real Matrix::l2NormRow(int64_t i) const {
     const real v = at(i,j);
     norm += v * v;
   }
+  if (std::isnan(norm)) {
+    throw std::runtime_error("Encountered NaN.");
+  }
   return std::sqrt(norm);
 }
 

+ 26 - 28
src/model.cc

@@ -15,17 +15,20 @@
 
 namespace fasttext {
 
-constexpr int32_t SIGMOID_TABLE_SIZE = 512;
-constexpr int32_t MAX_SIGMOID = 8;
-constexpr int32_t LOG_TABLE_SIZE = 512;
-
-Model::Model(std::shared_ptr<Matrix> wi,
-             std::shared_ptr<Matrix> wo,
-             std::shared_ptr<Args> args,
-             int32_t seed)
-  : hidden_(args->dim), output_(wo->m_),
-  grad_(args->dim), rng(seed), quant_(false)
-{
+constexpr int64_t SIGMOID_TABLE_SIZE = 512;
+constexpr int64_t MAX_SIGMOID = 8;
+constexpr int64_t LOG_TABLE_SIZE = 512;
+
+Model::Model(
+    std::shared_ptr<Matrix> wi,
+    std::shared_ptr<Matrix> wo,
+    std::shared_ptr<Args> args,
+    int32_t seed)
+    : hidden_(args->dim),
+      output_(wo->m_),
+      grad_(args->dim),
+      rng(seed),
+      quant_(false) {
   wi_ = wi;
   wo_ = wo;
   args_ = args;
@@ -34,15 +37,12 @@ Model::Model(std::shared_ptr<Matrix> wi,
   negpos = 0;
   loss_ = 0.0;
   nexamples_ = 1;
+  t_sigmoid_.reserve(SIGMOID_TABLE_SIZE + 1);
+  t_log_.reserve(LOG_TABLE_SIZE + 1);
   initSigmoid();
   initLog();
 }
 
-Model::~Model() {
-  delete[] t_sigmoid;
-  delete[] t_log;
-}
-
 void Model::setQuantizePointer(std::shared_ptr<QMatrix> qwi,
                                std::shared_ptr<QMatrix> qwo, bool qout) {
   qwi_ = qwi;
@@ -250,17 +250,17 @@ void Model::initTableNegatives(const std::vector<int64_t>& counts) {
   for (size_t i = 0; i < counts.size(); i++) {
     real c = pow(counts[i], 0.5);
     for (size_t j = 0; j < c * NEGATIVE_TABLE_SIZE / z; j++) {
-      negatives.push_back(i);
+      negatives_.push_back(i);
     }
   }
-  std::shuffle(negatives.begin(), negatives.end(), rng);
+  std::shuffle(negatives_.begin(), negatives_.end(), rng);
 }
 
 int32_t Model::getNegative(int32_t target) {
   int32_t negative;
   do {
-    negative = negatives[negpos];
-    negpos = (negpos + 1) % negatives.size();
+    negative = negatives_[negpos];
+    negpos = (negpos + 1) % negatives_.size();
   } while (target == negative);
   return negative;
 }
@@ -314,18 +314,16 @@ real Model::getLoss() const {
 }
 
 void Model::initSigmoid() {
-  t_sigmoid = new real[SIGMOID_TABLE_SIZE + 1];
   for (int i = 0; i < SIGMOID_TABLE_SIZE + 1; i++) {
     real x = real(i * 2 * MAX_SIGMOID) / SIGMOID_TABLE_SIZE - MAX_SIGMOID;
-    t_sigmoid[i] = 1.0 / (1.0 + std::exp(-x));
+    t_sigmoid_.push_back(1.0 / (1.0 + std::exp(-x)));
   }
 }
 
 void Model::initLog() {
-  t_log = new real[LOG_TABLE_SIZE + 1];
   for (int i = 0; i < LOG_TABLE_SIZE + 1; i++) {
     real x = (real(i) + 1e-5) / LOG_TABLE_SIZE;
-    t_log[i] = std::log(x);
+    t_log_.push_back(std::log(x));
   }
 }
 
@@ -333,8 +331,8 @@ real Model::log(real x) const {
   if (x > 1.0) {
     return 0.0;
   }
-  int i = int(x * LOG_TABLE_SIZE);
-  return t_log[i];
+  int64_t i = int64_t(x * LOG_TABLE_SIZE);
+  return t_log_[i];
 }
 
 real Model::std_log(real x) const {
@@ -347,8 +345,8 @@ real Model::sigmoid(real x) const {
   } else if (x > MAX_SIGMOID) {
     return 1.0;
   } else {
-    int i = int((x + MAX_SIGMOID) * SIGMOID_TABLE_SIZE / MAX_SIGMOID / 2);
-    return t_sigmoid[i];
+    int64_t i = int64_t((x + MAX_SIGMOID) * SIGMOID_TABLE_SIZE / MAX_SIGMOID / 2);
+    return t_sigmoid_[i];
   }
 }
 

+ 3 - 4
src/model.h

@@ -44,10 +44,10 @@ class Model {
     int32_t osz_;
     real loss_;
     int64_t nexamples_;
-    real* t_sigmoid;
-    real* t_log;
+    std::vector<real> t_sigmoid_;
+    std::vector<real> t_log_;
     // used for negative sampling:
-    std::vector<int32_t> negatives;
+    std::vector<int32_t> negatives_;
     size_t negpos;
     // used for hierarchical softmax:
     std::vector< std::vector<int32_t> > paths;
@@ -66,7 +66,6 @@ class Model {
   public:
     Model(std::shared_ptr<Matrix>, std::shared_ptr<Matrix>,
           std::shared_ptr<Args>, int32_t);
-    ~Model();
 
     real binaryLogistic(int32_t, bool, real);
     real negativeSampling(int32_t, real);