fastText
========
`fastText `__ is a library for efficient learning
of word representations and sentence classification.
Requirements
------------
`fastText `__ builds on modern Mac OS and Linux
distributions. Since it uses C++11 features, it requires a compiler with
good C++11 support. These include :
- (gcc-4.8 or newer) or (clang-3.3 or newer)
You will need
- `Python `__ version 2.7 or >=3.4
- `NumPy `__ &
`SciPy `__
- `pybind11 `__
Building fastText
-----------------
The easiest way to get the latest version of `fastText is to use
pip `__.
::
$ pip install fasttext
If you want to use the latest unstable release you will need to build
from source using setup.py.
Now you can import this library with
::
import fastText
Examples
--------
In general it is assumed that the reader already has good knowledge of
fastText. For this consider the main
`README `__
and in particular `the tutorials on our
website `__.
We recommend you look at the `examples within the doc
folder `__.
As with any package you can get help on any Python function using the
help function.
For example
::
+>>> import fastText
+>>> help(fastText.FastText)
Help on module fastText.FastText in fastText:
NAME
fastText.FastText
DESCRIPTION
# Copyright (c) 2017-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
FUNCTIONS
load_model(path)
Load a model given a filepath and return a model object.
tokenize(text)
Given a string of text, tokenize it and return a list of tokens
[...]
IMPORTANT: Preprocessing data / enconding conventions
-----------------------------------------------------
In general it is important to properly preprocess your data. In
particular our example scripts in the `root
folder `__ do this.
fastText assumes UTF-8 encoded text. All text must be `unicode for
Python2 `__
and `str for
Python3 `__.
The passed text will be `encoded as UTF-8 by
pybind11 `__
before passed to the fastText C++ library. This means it is important to
use UTF-8 encoded text when building a model. On Unix-like systems you
can convert text using `iconv `__.
fastText will tokenize (split text into pieces) based on the following
ASCII characters (bytes). In particular, it is not aware of UTF-8
whitespace. We advice the user to convert UTF-8 whitespace / word
boundaries into one of the following symbols as appropiate.
- space
- tab
- vertical tab
- carriage return
- formfeed
- the null character
The newline character is used to delimit lines of text. In particular,
the EOS token is appended to a line of text if a newline character is
encountered. The only exception is if the number of tokens exceeds the
MAX\_LINE\_SIZE constant as defined in the `Dictionary
header `__.
This means if you have text that is not separate by newlines, such as
the `fil9 dataset `__, it will be
broken into chunks with MAX\_LINE\_SIZE of tokens and the EOS token is
not appended.
The length of a token is the number of UTF-8 characters by considering
the `leading two bits of a
byte `__ to identify
`subsequent bytes of a multi-byte
sequence `__.
Knowing this is especially important when choosing the minimum and
maximum length of subwords. Further, the EOS token (as specified in the
`Dictionary
header `__)
is considered a character and will not be broken into subwords.