1
0

README.rst 15 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406
  1. fastText |CircleCI|
  2. ===================
  3. `fastText <https://fasttext.cc/>`__ is a library for efficient learning
  4. of word representations and sentence classification.
  5. In this document we present how to use fastText in python.
  6. Table of contents
  7. -----------------
  8. - `Requirements <#requirements>`__
  9. - `Installation <#installation>`__
  10. - `Usage overview <#usage-overview>`__
  11. - `Word representation model <#word-representation-model>`__
  12. - `Text classification model <#text-classification-model>`__
  13. - `IMPORTANT: Preprocessing data / encoding
  14. conventions <#important-preprocessing-data-encoding-conventions>`__
  15. - `More examples <#more-examples>`__
  16. - `API <#api>`__
  17. - `train_unsupervised parameters <#train_unsupervised-parameters>`__
  18. - `train_supervised parameters <#train_supervised-parameters>`__
  19. - `model object <#model-object>`__
  20. Requirements
  21. ============
  22. `fastText <https://fasttext.cc/>`__ builds on modern Mac OS and Linux
  23. distributions. Since it uses C++11 features, it requires a compiler with
  24. good C++11 support. You will need `Python <https://www.python.org/>`__
  25. (version 2.7 or ≥ 3.4), `NumPy <http://www.numpy.org/>`__ &
  26. `SciPy <https://www.scipy.org/>`__ and
  27. `pybind11 <https://github.com/pybind/pybind11>`__.
  28. Installation
  29. ============
  30. To install the latest release, you can do :
  31. .. code:: bash
  32. $ pip install fasttext
  33. or, to get the latest development version of fasttext, you can install
  34. from our github repository :
  35. .. code:: bash
  36. $ git clone https://github.com/facebookresearch/fastText.git
  37. $ cd fastText
  38. $ sudo pip install .
  39. $ # or :
  40. $ sudo python setup.py install
  41. Usage overview
  42. ==============
  43. Word representation model
  44. -------------------------
  45. In order to learn word vectors, as `described
  46. here <https://fasttext.cc/docs/en/references.html#enriching-word-vectors-with-subword-information>`__,
  47. we can use ``fasttext.train_unsupervised`` function like this:
  48. .. code:: py
  49. import fasttext
  50. # Skipgram model :
  51. model = fasttext.train_unsupervised('data.txt', model='skipgram')
  52. # or, cbow model :
  53. model = fasttext.train_unsupervised('data.txt', model='cbow')
  54. where ``data.txt`` is a training file containing utf-8 encoded text.
  55. The returned ``model`` object represents your learned model, and you can
  56. use it to retrieve information.
  57. .. code:: py
  58. print(model.words) # list of words in dictionary
  59. print(model['king']) # get the vector of the word 'king'
  60. Saving and loading a model object
  61. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  62. You can save your trained model object by calling the function
  63. ``save_model``.
  64. .. code:: py
  65. model.save_model("model_filename.bin")
  66. and retrieve it later thanks to the function ``load_model`` :
  67. .. code:: py
  68. model = fasttext.load_model("model_filename.bin")
  69. For more information about word representation usage of fasttext, you
  70. can refer to our `word representations
  71. tutorial <https://fasttext.cc/docs/en/unsupervised-tutorial.html>`__.
  72. Text classification model
  73. -------------------------
  74. In order to train a text classifier using the method `described
  75. here <https://fasttext.cc/docs/en/references.html#bag-of-tricks-for-efficient-text-classification>`__,
  76. we can use ``fasttext.train_supervised`` function like this:
  77. .. code:: py
  78. import fasttext
  79. model = fasttext.train_supervised('data.train.txt')
  80. where ``data.train.txt`` is a text file containing a training sentence
  81. per line along with the labels. By default, we assume that labels are
  82. words that are prefixed by the string ``__label__``
  83. Once the model is trained, we can retrieve the list of words and labels:
  84. .. code:: py
  85. print(model.words)
  86. print(model.labels)
  87. To evaluate our model by computing the precision at 1 (P@1) and the
  88. recall on a test set, we use the ``test`` function:
  89. .. code:: py
  90. def print_results(N, p, r):
  91. print("N\t" + str(N))
  92. print("P@{}\t{:.3f}".format(1, p))
  93. print("R@{}\t{:.3f}".format(1, r))
  94. print_results(*model.test('test.txt'))
  95. We can also predict labels for a specific text :
  96. .. code:: py
  97. model.predict("Which baking dish is best to bake a banana bread ?")
  98. By default, ``predict`` returns only one label : the one with the
  99. highest probability. You can also predict more than one label by
  100. specifying the parameter ``k``:
  101. .. code:: py
  102. model.predict("Which baking dish is best to bake a banana bread ?", k=3)
  103. If you want to predict more than one sentence you can pass an array of
  104. strings :
  105. .. code:: py
  106. model.predict(["Which baking dish is best to bake a banana bread ?", "Why not put knives in the dishwasher?"], k=3)
  107. Of course, you can also save and load a model to/from a file as `in the
  108. word representation usage <#saving-and-loading-a-model-object>`__.
  109. For more information about text classification usage of fasttext, you
  110. can refer to our `text classification
  111. tutorial <https://fasttext.cc/docs/en/supervised-tutorial.html>`__.
  112. Compress model files with quantization
  113. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  114. When you want to save a supervised model file, fastText can compress it
  115. in order to have a much smaller model file by sacrificing only a little
  116. bit performance.
  117. .. code:: py
  118. # with the previously trained `model` object, call :
  119. model.quantize(input='data.train.txt', retrain=True)
  120. # then display results and save the new model :
  121. print_results(*model.test(valid_data))
  122. model.save_model("model_filename.ftz")
  123. ``model_filename.ftz`` will have a much smaller size than
  124. ``model_filename.bin``.
  125. For further reading on quantization, you can refer to `this paragraph
  126. from our blog
  127. post <https://fasttext.cc/blog/2017/10/02/blog-post.html#model-compression>`__.
  128. IMPORTANT: Preprocessing data / encoding conventions
  129. ----------------------------------------------------
  130. In general it is important to properly preprocess your data. In
  131. particular our example scripts in the `root
  132. folder <https://github.com/facebookresearch/fastText>`__ do this.
  133. fastText assumes UTF-8 encoded text. All text must be `unicode for
  134. Python2 <https://docs.python.org/2/library/functions.html#unicode>`__
  135. and `str for
  136. Python3 <https://docs.python.org/3.5/library/stdtypes.html#textseq>`__.
  137. The passed text will be `encoded as UTF-8 by
  138. pybind11 <https://pybind11.readthedocs.io/en/master/advanced/cast/strings.html?highlight=utf-8#strings-bytes-and-unicode-conversions>`__
  139. before passed to the fastText C++ library. This means it is important to
  140. use UTF-8 encoded text when building a model. On Unix-like systems you
  141. can convert text using `iconv <https://en.wikipedia.org/wiki/Iconv>`__.
  142. fastText will tokenize (split text into pieces) based on the following
  143. ASCII characters (bytes). In particular, it is not aware of UTF-8
  144. whitespace. We advice the user to convert UTF-8 whitespace / word
  145. boundaries into one of the following symbols as appropiate.
  146. - space
  147. - tab
  148. - vertical tab
  149. - carriage return
  150. - formfeed
  151. - the null character
  152. The newline character is used to delimit lines of text. In particular,
  153. the EOS token is appended to a line of text if a newline character is
  154. encountered. The only exception is if the number of tokens exceeds the
  155. MAX\_LINE\_SIZE constant as defined in the `Dictionary
  156. header <https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h>`__.
  157. This means if you have text that is not separate by newlines, such as
  158. the `fil9 dataset <http://mattmahoney.net/dc/textdata>`__, it will be
  159. broken into chunks with MAX\_LINE\_SIZE of tokens and the EOS token is
  160. not appended.
  161. The length of a token is the number of UTF-8 characters by considering
  162. the `leading two bits of a
  163. byte <https://en.wikipedia.org/wiki/UTF-8#Description>`__ to identify
  164. `subsequent bytes of a multi-byte
  165. sequence <https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc>`__.
  166. Knowing this is especially important when choosing the minimum and
  167. maximum length of subwords. Further, the EOS token (as specified in the
  168. `Dictionary
  169. header <https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h>`__)
  170. is considered a character and will not be broken into subwords.
  171. More examples
  172. -------------
  173. In order to have a better knowledge of fastText models, please consider
  174. the main
  175. `README <https://github.com/facebookresearch/fastText/blob/master/README.md>`__
  176. and in particular `the tutorials on our
  177. website <https://fasttext.cc/docs/en/supervised-tutorial.html>`__.
  178. You can find further python examples in `the doc
  179. folder <https://github.com/facebookresearch/fastText/tree/master/python/doc/examples>`__.
  180. As with any package you can get help on any Python function using the
  181. help function.
  182. For example
  183. ::
  184. +>>> import fasttext
  185. +>>> help(fasttext.FastText)
  186. Help on module fasttext.FastText in fasttext:
  187. NAME
  188. fasttext.FastText
  189. DESCRIPTION
  190. # Copyright (c) 2017-present, Facebook, Inc.
  191. # All rights reserved.
  192. #
  193. # This source code is licensed under the MIT license found in the
  194. # LICENSE file in the root directory of this source tree.
  195. FUNCTIONS
  196. load_model(path)
  197. Load a model given a filepath and return a model object.
  198. tokenize(text)
  199. Given a string of text, tokenize it and return a list of tokens
  200. [...]
  201. API
  202. ===
  203. ``train_unsupervised`` parameters
  204. ---------------------------------
  205. .. code:: python
  206. input # training file path (required)
  207. model # unsupervised fasttext model {cbow, skipgram} [skipgram]
  208. lr # learning rate [0.05]
  209. dim # size of word vectors [100]
  210. ws # size of the context window [5]
  211. epoch # number of epochs [5]
  212. minCount # minimal number of word occurences [5]
  213. minn # min length of char ngram [3]
  214. maxn # max length of char ngram [6]
  215. neg # number of negatives sampled [5]
  216. wordNgrams # max length of word ngram [1]
  217. loss # loss function {ns, hs, softmax, ova} [ns]
  218. bucket # number of buckets [2000000]
  219. thread # number of threads [number of cpus]
  220. lrUpdateRate # change the rate of updates for the learning rate [100]
  221. t # sampling threshold [0.0001]
  222. verbose # verbose [2]
  223. ``train_supervised`` parameters
  224. -------------------------------
  225. .. code:: python
  226. input # training file path (required)
  227. lr # learning rate [0.1]
  228. dim # size of word vectors [100]
  229. ws # size of the context window [5]
  230. epoch # number of epochs [5]
  231. minCount # minimal number of word occurences [1]
  232. minCountLabel # minimal number of label occurences [1]
  233. minn # min length of char ngram [0]
  234. maxn # max length of char ngram [0]
  235. neg # number of negatives sampled [5]
  236. wordNgrams # max length of word ngram [1]
  237. loss # loss function {ns, hs, softmax, ova} [softmax]
  238. bucket # number of buckets [2000000]
  239. thread # number of threads [number of cpus]
  240. lrUpdateRate # change the rate of updates for the learning rate [100]
  241. t # sampling threshold [0.0001]
  242. label # label prefix ['__label__']
  243. verbose # verbose [2]
  244. pretrainedVectors # pretrained word vectors (.vec file) for supervised learning []
  245. ``model`` object
  246. ----------------
  247. ``train_supervised``, ``train_unsupervised`` and ``load_model``
  248. functions return an instance of ``_FastText`` class, that we generaly
  249. name ``model`` object.
  250. This object exposes those training arguments as properties : ``lr``,
  251. ``dim``, ``ws``, ``epoch``, ``minCount``, ``minCountLabel``, ``minn``,
  252. ``maxn``, ``neg``, ``wordNgrams``, ``loss``, ``bucket``, ``thread``,
  253. ``lrUpdateRate``, ``t``, ``label``, ``verbose``, ``pretrainedVectors``.
  254. So ``model.wordNgrams`` will give you the max length of word ngram used
  255. for training this model.
  256. In addition, the object exposes several functions :
  257. .. code:: python
  258. get_dimension # Get the dimension (size) of a lookup vector (hidden layer).
  259. # This is equivalent to `dim` property.
  260. get_input_vector # Given an index, get the corresponding vector of the Input Matrix.
  261. get_input_matrix # Get a copy of the full input matrix of a Model.
  262. get_labels # Get the entire list of labels of the dictionary
  263. # This is equivalent to `labels` property.
  264. get_line # Split a line of text into words and labels.
  265. get_output_matrix # Get a copy of the full output matrix of a Model.
  266. get_sentence_vector # Given a string, get a single vector represenation. This function
  267. # assumes to be given a single line of text. We split words on
  268. # whitespace (space, newline, tab, vertical tab) and the control
  269. # characters carriage return, formfeed and the null character.
  270. get_subword_id # Given a subword, return the index (within input matrix) it hashes to.
  271. get_subwords # Given a word, get the subwords and their indicies.
  272. get_word_id # Given a word, get the word id within the dictionary.
  273. get_word_vector # Get the vector representation of word.
  274. get_words # Get the entire list of words of the dictionary
  275. # This is equivalent to `words` property.
  276. is_quantized # whether the model has been quantized
  277. predict # Given a string, get a list of labels and a list of corresponding probabilities.
  278. quantize # Quantize the model reducing the size of the model and it's memory footprint.
  279. save_model # Save the model to the given path
  280. test # Evaluate supervised model using file given by path
  281. test_label # Return the precision and recall score for each label.
  282. The properties ``words``, ``labels`` return the words and labels from
  283. the dictionary :
  284. .. code:: py
  285. model.words # equivalent to model.get_words()
  286. model.labels # equivalent to model.get_labels()
  287. The object overrides ``__getitem__`` and ``__contains__`` functions in
  288. order to return the representation of a word and to check if a word is
  289. in the vocabulary.
  290. .. code:: py
  291. model['king'] # equivalent to model.get_word_vector('king')
  292. 'king' in model # equivalent to `'king' in model.get_words()`
  293. Join the fastText community
  294. ---------------------------
  295. - `Facebook page <https://www.facebook.com/groups/1174547215919768>`__
  296. - `Stack
  297. overflow <https://stackoverflow.com/questions/tagged/fasttext>`__
  298. - `Google
  299. group <https://groups.google.com/forum/#!forum/fasttext-library>`__
  300. - `GitHub <https://github.com/facebookresearch/fastText>`__
  301. .. |CircleCI| image:: https://circleci.com/gh/facebookresearch/fastText/tree/master.svg?style=svg
  302. :target: https://circleci.com/gh/facebookresearch/fastText/tree/master