fastText ======== `fastText `__ is a library for efficient learning of word representations and sentence classification. Requirements ------------ `fastText `__ builds on modern Mac OS and Linux distributions. Since it uses C++11 features, it requires a compiler with good C++11 support. These include : - (gcc-4.8 or newer) or (clang-3.3 or newer) You will need - `Python `__ version 2.7 or >=3.4 - `NumPy `__ & `SciPy `__ - `pybind11 `__ Building fastText ----------------- The easiest way to get the latest version of `fastText is to use pip `__. :: $ pip install fasttext If you want to use the latest unstable release you will need to build from source using setup.py. Now you can import this library with :: import fastText Examples -------- In general it is assumed that the reader already has good knowledge of fastText. For this consider the main `README `__ and in particular `the tutorials on our website `__. We recommend you look at the `examples within the doc folder `__. As with any package you can get help on any Python function using the help function. For example :: +>>> import fastText +>>> help(fastText.FastText) Help on module fastText.FastText in fastText: NAME fastText.FastText DESCRIPTION # Copyright (c) 2017-present, Facebook, Inc. # All rights reserved. # # This source code is licensed under the MIT license found in the # LICENSE file in the root directory of this source tree. FUNCTIONS load_model(path) Load a model given a filepath and return a model object. tokenize(text) Given a string of text, tokenize it and return a list of tokens [...] IMPORTANT: Preprocessing data / enconding conventions ----------------------------------------------------- In general it is important to properly preprocess your data. In particular our example scripts in the `root folder `__ do this. fastText assumes UTF-8 encoded text. All text must be `unicode for Python2 `__ and `str for Python3 `__. The passed text will be `encoded as UTF-8 by pybind11 `__ before passed to the fastText C++ library. This means it is important to use UTF-8 encoded text when building a model. On Unix-like systems you can convert text using `iconv `__. fastText will tokenize (split text into pieces) based on the following ASCII characters (bytes). In particular, it is not aware of UTF-8 whitespace. We advice the user to convert UTF-8 whitespace / word boundaries into one of the following symbols as appropiate. - space - tab - vertical tab - carriage return - formfeed - the null character The newline character is used to delimit lines of text. In particular, the EOS token is appended to a line of text if a newline character is encountered. The only exception is if the number of tokens exceeds the MAX\_LINE\_SIZE constant as defined in the `Dictionary header `__. This means if you have text that is not separate by newlines, such as the `fil9 dataset `__, it will be broken into chunks with MAX\_LINE\_SIZE of tokens and the EOS token is not appended. The length of a token is the number of UTF-8 characters by considering the `leading two bits of a byte `__ to identify `subsequent bytes of a multi-byte sequence `__. Knowing this is especially important when choosing the minimum and maximum length of subwords. Further, the EOS token (as specified in the `Dictionary header `__) is considered a character and will not be broken into subwords.