1
0

README.rst 4.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132
  1. fastText
  2. ========
  3. `fastText <https://fasttext.cc/>`__ is a library for efficient learning
  4. of word representations and sentence classification.
  5. Requirements
  6. ------------
  7. `fastText <https://fasttext.cc/>`__ builds on modern Mac OS and Linux
  8. distributions. Since it uses C++11 features, it requires a compiler with
  9. good C++11 support. These include :
  10. - (gcc-4.8 or newer) or (clang-3.3 or newer)
  11. You will need
  12. - `Python <https://www.python.org/>`__ version 2.7 or >=3.4
  13. - `NumPy <http://www.numpy.org/>`__ &
  14. `SciPy <https://www.scipy.org/>`__
  15. - `pybind11 <https://github.com/pybind/pybind11>`__
  16. Building fastText
  17. -----------------
  18. The easiest way to get the latest version of `fastText is to use
  19. pip <https://pypi.python.org/pypi/fasttext>`__.
  20. ::
  21. $ pip install fasttext
  22. If you want to use the latest unstable release you will need to build
  23. from source using setup.py.
  24. Now you can import this library with
  25. ::
  26. import fastText
  27. Examples
  28. --------
  29. In general it is assumed that the reader already has good knowledge of
  30. fastText. For this consider the main
  31. `README <https://github.com/facebookresearch/fastText/blob/master/README.md>`__
  32. and in particular `the tutorials on our
  33. website <https://fasttext.cc/docs/en/supervised-tutorial.html>`__.
  34. We recommend you look at the `examples within the doc
  35. folder <https://github.com/facebookresearch/fastText/tree/master/python/doc/examples>`__.
  36. As with any package you can get help on any Python function using the
  37. help function.
  38. For example
  39. ::
  40. +>>> import fastText
  41. +>>> help(fastText.FastText)
  42. Help on module fastText.FastText in fastText:
  43. NAME
  44. fastText.FastText
  45. DESCRIPTION
  46. # Copyright (c) 2017-present, Facebook, Inc.
  47. # All rights reserved.
  48. #
  49. # This source code is licensed under the MIT license found in the
  50. # LICENSE file in the root directory of this source tree.
  51. FUNCTIONS
  52. load_model(path)
  53. Load a model given a filepath and return a model object.
  54. tokenize(text)
  55. Given a string of text, tokenize it and return a list of tokens
  56. [...]
  57. IMPORTANT: Preprocessing data / enconding conventions
  58. -----------------------------------------------------
  59. In general it is important to properly preprocess your data. In
  60. particular our example scripts in the `root
  61. folder <https://github.com/facebookresearch/fastText>`__ do this.
  62. fastText assumes UTF-8 encoded text. All text must be `unicode for
  63. Python2 <https://docs.python.org/2/library/functions.html#unicode>`__
  64. and `str for
  65. Python3 <https://docs.python.org/3.5/library/stdtypes.html#textseq>`__.
  66. The passed text will be `encoded as UTF-8 by
  67. pybind11 <https://pybind11.readthedocs.io/en/master/advanced/cast/strings.html?highlight=utf-8#strings-bytes-and-unicode-conversions>`__
  68. before passed to the fastText C++ library. This means it is important to
  69. use UTF-8 encoded text when building a model. On Unix-like systems you
  70. can convert text using `iconv <https://en.wikipedia.org/wiki/Iconv>`__.
  71. fastText will tokenize (split text into pieces) based on the following
  72. ASCII characters (bytes). In particular, it is not aware of UTF-8
  73. whitespace. We advice the user to convert UTF-8 whitespace / word
  74. boundaries into one of the following symbols as appropiate.
  75. - space
  76. - tab
  77. - vertical tab
  78. - carriage return
  79. - formfeed
  80. - the null character
  81. The newline character is used to delimit lines of text. In particular,
  82. the EOS token is appended to a line of text if a newline character is
  83. encountered. The only exception is if the number of tokens exceeds the
  84. MAX\_LINE\_SIZE constant as defined in the `Dictionary
  85. header <https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h>`__.
  86. This means if you have text that is not separate by newlines, such as
  87. the `fil9 dataset <http://mattmahoney.net/dc/textdata>`__, it will be
  88. broken into chunks with MAX\_LINE\_SIZE of tokens and the EOS token is
  89. not appended.
  90. The length of a token is the number of UTF-8 characters by considering
  91. the `leading two bits of a
  92. byte <https://en.wikipedia.org/wiki/UTF-8#Description>`__ to identify
  93. `subsequent bytes of a multi-byte
  94. sequence <https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc>`__.
  95. Knowing this is especially important when choosing the minimum and
  96. maximum length of subwords. Further, the EOS token (as specified in the
  97. `Dictionary
  98. header <https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h>`__)
  99. is considered a character and will not be broken into subwords.