README.rst 4.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133
  1. fastText
  2. ========
  3. `fastText <https://fasttext.cc/>`__ is a library for efficient learning
  4. of word representations and sentence classification.
  5. Requirements
  6. ------------
  7. `fastText <https://fasttext.cc/>`__ builds on modern Mac OS and Linux
  8. distributions. Since it uses C++11 features, it requires a compiler with
  9. good C++11 support. These include :
  10. - (gcc-4.8 or newer) or (clang-3.3 or newer)
  11. You will need
  12. - `Python <https://www.python.org/>`__ version 2.7 or >=3.4
  13. - `NumPy <http://www.numpy.org/>`__ &
  14. `SciPy <https://www.scipy.org/>`__
  15. - `pybind11 <https://github.com/pybind/pybind11>`__
  16. Building fastText
  17. -----------------
  18. The easiest way to get the latest version of `fastText is to use
  19. pip <https://pypi.python.org/pypi/fasttext>`__.
  20. ::
  21. $ pip install fasttext
  22. If you want to use the latest unstable release you will need to build
  23. from source using setup.py.
  24. Now you can import this library with
  25. ::
  26. import fastText
  27. Examples
  28. --------
  29. In general it is assumed that the reader already has good knowledge of
  30. fastText. For this consider the main
  31. `README <https://github.com/facebookresearch/fastText/blob/master/README.md>`__
  32. and in particular `the tutorials on our
  33. website <https://fasttext.cc/docs/en/supervised-tutorial.html>`__.
  34. We recommend you look at the `examples within the doc
  35. folder <https://github.com/facebookresearch/fastText/tree/master/python/doc/examples>`__.
  36. As with any package you can get help on any Python function using the
  37. help function.
  38. For example
  39. ::
  40. +>>> import fastText
  41. +>>> help(fastText.FastText)
  42. Help on module fastText.FastText in fastText:
  43. NAME
  44. fastText.FastText
  45. DESCRIPTION
  46. # Copyright (c) 2017-present, Facebook, Inc.
  47. # All rights reserved.
  48. #
  49. # This source code is licensed under the BSD-style license found in the
  50. # LICENSE file in the root directory of this source tree. An additional grant
  51. # of patent rights can be found in the PATENTS file in the same directory.
  52. FUNCTIONS
  53. load_model(path)
  54. Load a model given a filepath and return a model object.
  55. tokenize(text)
  56. Given a string of text, tokenize it and return a list of tokens
  57. [...]
  58. IMPORTANT: Preprocessing data / enconding conventions
  59. -----------------------------------------------------
  60. In general it is important to properly preprocess your data. In
  61. particular our example scripts in the `root
  62. folder <https://github.com/facebookresearch/fastText>`__ do this.
  63. fastText assumes UTF-8 encoded text. All text must be `unicode for
  64. Python2 <https://docs.python.org/2/library/functions.html#unicode>`__
  65. and `str for
  66. Python3 <https://docs.python.org/3.5/library/stdtypes.html#textseq>`__.
  67. The passed text will be `encoded as UTF-8 by
  68. pybind11 <https://pybind11.readthedocs.io/en/master/advanced/cast/strings.html?highlight=utf-8#strings-bytes-and-unicode-conversions>`__
  69. before passed to the fastText C++ library. This means it is important to
  70. use UTF-8 encoded text when building a model. On Unix-like systems you
  71. can convert text using `iconv <https://en.wikipedia.org/wiki/Iconv>`__.
  72. fastText will tokenize (split text into pieces) based on the following
  73. ASCII characters (bytes). In particular, it is not aware of UTF-8
  74. whitespace. We advice the user to convert UTF-8 whitespace / word
  75. boundaries into one of the following symbols as appropiate.
  76. - space
  77. - tab
  78. - vertical tab
  79. - carriage return
  80. - formfeed
  81. - the null character
  82. The newline character is used to delimit lines of text. In particular,
  83. the EOS token is appended to a line of text if a newline character is
  84. encountered. The only exception is if the number of tokens exceeds the
  85. MAX\_LINE\_SIZE constant as defined in the `Dictionary
  86. header <https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h>`__.
  87. This means if you have text that is not separate by newlines, such as
  88. the `fil9 dataset <http://mattmahoney.net/dc/textdata>`__, it will be
  89. broken into chunks with MAX\_LINE\_SIZE of tokens and the EOS token is
  90. not appended.
  91. The length of a token is the number of UTF-8 characters by considering
  92. the `leading two bits of a
  93. byte <https://en.wikipedia.org/wiki/UTF-8#Description>`__ to identify
  94. `subsequent bytes of a multi-byte
  95. sequence <https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc>`__.
  96. Knowing this is especially important when choosing the minimum and
  97. maximum length of subwords. Further, the EOS token (as specified in the
  98. `Dictionary
  99. header <https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h>`__)
  100. is considered a character and will not be broken into subwords.