Kaynağa Gözat

added initial README

Piotr Bojanowski 9 yıl önce
ebeveyn
işleme
a75375205d
1 değiştirilmiş dosya ile 136 ekleme ve 0 silme
  1. 136 0
      README.md

+ 136 - 0
README.md

@@ -0,0 +1,136 @@
+# fastText
+
+fastText is a library for efficient computation of word representations and sentence classification.
+
+## Requirements
+
+fastText should compile on all modern platforms including Mac OS and Linux. Because of the use of C++ 11 features, it requires the use of a C++ 11 compatible compiler. These include :
+
+* (gcc-4.6.3 or newer) or (clang-3.3 or newer)
+* make
+
+For the word-similarity evaluation script you will need:
+
+* python 2.6 or newer
+
+## Building fastText
+
+Use the provided Makefile. At the command prompt, type:
+
+```
+$ make
+```
+
+This will produce object files for all the classes as well as the main binary `fasttext`.
+If you do not plan on using the default system-wide compiler, please update the two macros defined at the beginning of the Makefile (CC and INCLUDES).
+
+## Example use cases
+
+We provide in this library two main use cases that we will describe here and that correspond to [1] and [2].
+
+### text classification
+
+In order to train a text classifier following [2], please follow the compilation steps and then issue:
+
+```
+$ ./fasttext supervised -input train.txt -output model
+```
+
+### predicting labels
+
+where `train.txt` is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words in a sentence that are prefixed by `__label__`. This will output two files: `model.bin` and `model.vec`. Once the model was trained, you can compute the test precision at 1 (P@1) using:
+
+```
+$ ./fasttext test model.bin test.txt
+```
+
+If you want to obtain the most likely label for a piece of text, please use:
+
+```
+$ ./fasttext predict model.bin test.txt
+```
+
+where test.txt contains a piece of text to classify per line. Doing so will output to standard output the most likely label per line. Please check `classification.sh` for an example use case. In order to reproduce results from the paper [2] please run `classification-results.sh`, this will download all the datasets and reproduce the results from Table 1.
+
+### Word Representation
+
+In order to compute word vectors as described in [1], please compile all the executables as described before. Then, given a training file `data.txt`, do:
+
+```
+$ ./fasttext skipgram -input data.txt -output model
+```
+
+This will launch the optimization and save two files: `model.bin` and `model.vec`.
+`model.vec` is a text file containing the word vectors, one per line. `model.bin` is the binary containing all the parameters of the model. It can be used later to compute word vectors or to restart the optimization.
+
+### obtaining word vectors For out-of-vocabulary words
+
+Provided you have a text file `queries.txt` containing words for which you want to compute vectors, please issue the following command
+
+```
+$ ./fasttext print-vectors model.bin < queries.txt
+```
+
+This will output to standard output, the word and its vector, one per line.
+Please note that this can be successfully used with pipes:
+
+```
+$ cat queries.txt | ./fasttext print-vectors model.bin
+```
+
+See the provided scripts for an example. For instance, running:
+
+```
+$ ./get-vectors.sh
+```
+
+will compile the code, download data, compute the word vectors and evaluate on the rare words similarity dataset RW [Thang et al. 2013].
+
+## Full documentation
+
+* input: text file used for training the model
+* test: text file used for testing  the model. Only works in the classification setup
+* output: prefix of the file saved at the end of optimization
+* lr: learning rate (default of 0.05)
+* dim: required dimension of the vectors
+* ws: size of the context window considered around the word
+* epoch: number of iterations over the training file
+* minCount: minimal word occurence in the training file
+* neg: number of negatives for the negative sampling approximation
+* wordNgrams: number of word n-grams considered in the sentence classification setup
+* sampling: word distribution used to sample negatives
+    * log: use logarithm of unigram frequency
+    * sqrt: use square root of unigram frequency
+    * uni: uniform distribution over words
+* loss: approximation to the softmax loss
+    * hs: hierarchical softmax
+    * ns: negative sampling
+    * softmax: full softmax computation (slow)
+* model: model used
+    * cbow: continuous bag-of-words
+    * sg: skip-gram
+    * supervised: sentence classification
+* bucket: number of n-gram hashes in use
+* minn: shortest character n-gram used
+* maxn: longest character n-gram used
+* onlyWord: the `onlyWord` most frequent words don't use subword information
+* thread: number of threads used for optimization
+* verbose: print info to stdout every other `verbose` samples
+* t: threshold for random word discarding based on unigram frequency
+* label: what string to use as prefix for labels
+
+## References
+
+[1] Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, arXiv 1607.04606, 2016
+[2] Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov, Bag of Tricks for Efficient Text Classification, arXiv 1607.01759, 2016
+
+## Join the fastText community
+
+* Facebook page: https://www.facebook.com/groups/fasttextusers
+* Contact: [[email protected]](mailto:[email protected]) [[email protected]](mailto:[email protected]) [[email protected]](mailto:[email protected]) [[email protected]](mailto:[email protected])
+
+See the CONTRIBUTING file for information about how to help out.
+
+## License
+
+fastText is BSD-licensed. We also provide an additional patent grant.