Browse Source

New release of python module

Summary:
This commit modifies the python module's name from `fastText` to `fasttext`. It also defines top-level functions from the unofficial api : `cbow`, `skipgram`, `supervised`, and displays an error message to proceed with the migration as described at https://fasttext.cc/blog/2019/06/25/blog-post.html

It also includes minor modifications to FastText model object returned by train functions to behave like the unofficial api's `WordVectorModel` and `SupervisedModel` classes.

Reviewed By: EdouardGrave

Differential Revision: D15770169

fbshipit-source-id: b13def267afd94b9a0f9fcf53a712a719a094f01
Onur Çelebi 6 years ago
parent
commit
caec5be09d

+ 7 - 2
docs/faqs.md

@@ -7,7 +7,7 @@ title:FAQ
 
 FastText is a library for text classification and representation. It transforms text into continuous vectors that can later be used on any language related task. A few tutorials are available.
 
-## Why are my fastText models that big?
+## How can I reduce the size of my fastText models?
 
 fastText uses a hashtable for either word or character ngrams. The size of the hashtable directly impacts the size of a model. To reduce the size of the model, it is possible to reduce the size of this table with the option '-hash'. For example a good value is 20000. Another option that greatly impacts the size of a model is the size of the vectors (-dim). This dimension can be reduced to save space but this can significantly impact performance. If that still produce a model that is too big, one can further reduce the size of a trained model with the quantization option.
 ```bash
@@ -37,7 +37,8 @@ Please note that one of the goal of fastText is to be an efficient CPU tool, all
 
 ## Can I use fastText with python? Or other languages?
 
-There are few unofficial wrappers for python or lua available on github.
+[Python is officially supported](/docs/en/support.html#building-fasttext-python-module).
+There are few unofficial wrappers for javascript, lua and other languages available on github.
 
 ## Can I use fastText with continuous data?
 
@@ -56,3 +57,7 @@ Try a newer version of your compiler. We try to maintain compatibility with olde
 
 ## How do I run fastText in a fully reproducible way? Each time I run it I get different results.
 If you run fastText multiple times you'll obtain slightly different results each time due to the optimization algorithm (asynchronous stochastic gradient descent, or Hogwild). If you need to get the same results (e.g. to confront different input params set) you have to set the 'thread' parameter to 1. In this way you'll get exactly the same performances at each run (with the same input params).
+
+
+## Why do I get a probability of 1.00001?
+This is a known rounding issue. You can consider it as 1.0.

+ 314 - 0
docs/python-module.md

@@ -0,0 +1,314 @@
+---
+id: python-module
+title: Python module
+---
+
+In this document we present how to use fastText in python.
+
+## Table of contents
+
+* [Requirements](#requirements)
+* [Installation](#installation)
+* [Usage overview](#usage-overview)
+   * [Word representation model](#word-representation-model)
+   * [Text classification model](#text-classification-model)
+   * [IMPORTANT: Preprocessing data / encoding conventions](#important-preprocessing-data-encoding-conventions)
+   * [More examples](#more-examples)
+* [API](#api)
+   * [`train_unsupervised` parameters](#train_unsupervised-parameters)
+   * [`train_supervised` parameters](#train_supervised-parameters)
+   * [`model` object](#model-object)
+
+
+# Requirements
+
+[fastText](https://fasttext.cc/) builds on modern Mac OS and Linux distributions.
+Since it uses C\++11 features, it requires a compiler with good C++11 support. You will need [Python](https://www.python.org/) (version 2.7 or ≥ 3.4), [NumPy](http://www.numpy.org/) & [SciPy](https://www.scipy.org/) and [pybind11](https://github.com/pybind/pybind11).
+
+
+# Installation
+
+To install the latest release, you can do :
+```bash
+$ pip install fasttext
+```
+
+or, to get the latest development version of fasttext, you can install from our github repository :
+```bash
+$ git clone https://github.com/facebookresearch/fastText.git
+$ cd fastText
+$ sudo pip install .
+$ # or :
+$ sudo python setup.py install
+```
+
+# Usage overview
+
+
+## Word representation model
+
+In order to learn word vectors, as [described here](/docs/en/references.html#enriching-word-vectors-with-subword-information), we can use `fasttext.train_unsupervised` function like this:
+
+
+```py
+import fasttext
+
+# Skipgram model :
+model = fasttext.train_unsupervised('data.txt', model='skipgram')
+
+# or, cbow model :
+model = fasttext.train_unsupervised('data.txt', model='cbow')
+
+```
+
+where `data.txt` is a training file containing utf-8 encoded text.
+
+
+The returned `model` object represents your learned model, and you can use it to retrieve information.
+
+```py
+print(model.words)   # list of words in dictionary
+print(model['king']) # get the vector of the word 'king'
+```
+
+
+### Saving and loading a model object
+
+You can save your trained model object by calling the function `save_model`.
+```py
+model.save_model("model_filename.bin")
+```
+
+and retrieve it later thanks to the function `load_model` :
+```py
+model = fasttext.load_model("model_filename.bin")
+```
+
+For more information about word representation usage of fasttext, you can refer to our [word representations tutorial](/docs/en/unsupervised-tutorial.html).
+
+
+## Text classification model
+
+In order to train a text classifier using the method [described here](/docs/en/references.html#bag-of-tricks-for-efficient-text-classification), we can use `fasttext.train_supervised` function like this:
+
+
+```py
+import fasttext
+
+model = fasttext.train_supervised('data.train.txt')
+```
+
+where `data.train.txt` is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string `__label__`
+
+Once the model is trained, we can retrieve the list of words and labels:
+
+```py
+print(model.words)
+print(model.labels)
+```
+
+To evaluate our model by computing the precision at 1 (P@1) and the recall on a test set, we use the `test` function:
+
+```py
+def print_results(N, p, r):
+    print("N\t" + str(N))
+    print("P@{}\t{:.3f}".format(1, p))
+    print("R@{}\t{:.3f}".format(1, r))
+
+print_results(*model.test('test.txt'))
+```
+
+We can also predict labels for a specific text :
+
+```py
+model.predict("Which baking dish is best to bake a banana bread ?")
+```
+
+By default, `predict` returns only one label : the one with the highest probability. You can also predict more than one label by specifying the parameter `k`:
+```py
+model.predict("Which baking dish is best to bake a banana bread ?", k=3)
+```
+
+If you want to predict more than one sentence you can pass an array of strings :
+
+```py
+model.predict(["Which baking dish is best to bake a banana bread ?", "Why not put knives in the dishwasher?"], k=3)
+```
+
+
+Of course, you can also save and load a model to/from a file as [in the word representation usage](#saving-and-loading-a-model-object).
+
+For more information about text classification usage of fasttext, you can refer to our [text classification tutorial](/docs/en/supervised-tutorial.html).
+
+
+
+
+### Compress model files with quantization
+
+When you want to save a supervised model file, fastText can compress it in order to have a much smaller model file by sacrificing only a little bit performance.
+
+```py
+# with the previously trained `model` object, call :
+model.quantize(input='data.train.txt', retrain=True)
+
+# then display results and save the new model :
+print_results(*model.test(valid_data))
+model.save_model("model_filename.ftz")
+```
+
+`model_filename.ftz` will have a much smaller size than `model_filename.bin`.
+
+For further reading on quantization, you can refer to [this paragraph from our blog post](/blog/2017/10/02/blog-post.html#model-compression).
+
+
+## IMPORTANT: Preprocessing data / encoding conventions
+
+In general it is important to properly preprocess your data. In particular our example scripts in the [root folder](https://github.com/facebookresearch/fastText) do this.
+
+fastText assumes UTF-8 encoded text. All text must be [unicode for Python2](https://docs.python.org/2/library/functions.html#unicode) and [str for Python3](https://docs.python.org/3.5/library/stdtypes.html#textseq). The passed text will be [encoded as UTF-8 by pybind11](https://pybind11.readthedocs.io/en/master/advanced/cast/strings.html?highlight=utf-8#strings-bytes-and-unicode-conversions) before passed to the fastText C++ library. This means it is important to use UTF-8 encoded text when building a model. On Unix-like systems you can convert text using [iconv](https://en.wikipedia.org/wiki/Iconv).
+
+fastText will tokenize (split text into pieces) based on the following ASCII characters (bytes). In particular, it is not aware of UTF-8 whitespace. We advice the user to convert UTF-8 whitespace / word boundaries into one of the following symbols as appropiate.
+
+* space
+* tab
+* vertical tab
+* carriage return
+* formfeed
+* the null character
+
+The newline character is used to delimit lines of text. In particular, the EOS token is appended to a line of text if a newline character is encountered. The only exception is if the number of tokens exceeds the MAX\_LINE\_SIZE constant as defined in the [Dictionary header](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h). This means if you have text that is not separate by newlines, such as the [fil9 dataset](http://mattmahoney.net/dc/textdata), it will be broken into chunks with MAX\_LINE\_SIZE of tokens and the EOS token is not appended.
+
+The length of a token is the number of UTF-8 characters by considering the [leading two bits of a byte](https://en.wikipedia.org/wiki/UTF-8#Description) to identify [subsequent bytes of a multi-byte sequence](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc). Knowing this is especially important when choosing the minimum and maximum length of subwords. Further, the EOS token (as specified in the [Dictionary header](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h)) is considered a character and will not be broken into subwords.
+
+## More examples
+
+In order to have a better knowledge of fastText models, please consider the main [README](https://github.com/facebookresearch/fastText/blob/master/README.md) and in particular [the tutorials on our website](https://fasttext.cc/docs/en/supervised-tutorial.html).
+
+You can find further python examples in [the doc folder](https://github.com/facebookresearch/fastText/tree/master/python/doc/examples).
+
+As with any package you can get help on any Python function using the help function.
+
+For example
+
+```
++>>> import fasttext
++>>> help(fasttext.FastText)
+
+Help on module fasttext.FastText in fasttext:
+
+NAME
+    fasttext.FastText
+
+DESCRIPTION
+    # Copyright (c) 2017-present, Facebook, Inc.
+    # All rights reserved.
+    #
+    # This source code is licensed under the MIT license found in the
+    # LICENSE file in the root directory of this source tree.
+
+FUNCTIONS
+    load_model(path)
+        Load a model given a filepath and return a model object.
+
+    tokenize(text)
+        Given a string of text, tokenize it and return a list of tokens
+[...]
+```
+
+
+# API
+
+
+## `train_unsupervised` parameters
+
+```python
+    input             # training file path (required)
+    model             # unsupervised fasttext model {cbow, skipgram} [skipgram]
+    lr                # learning rate [0.05]
+    dim               # size of word vectors [100]
+    ws                # size of the context window [5]
+    epoch             # number of epochs [5]
+    minCount          # minimal number of word occurences [5]
+    minn              # min length of char ngram [3]
+    maxn              # max length of char ngram [6]
+    neg               # number of negatives sampled [5]
+    wordNgrams        # max length of word ngram [1]
+    loss              # loss function {ns, hs, softmax, ova} [ns]
+    bucket            # number of buckets [2000000]
+    thread            # number of threads [number of cpus]
+    lrUpdateRate      # change the rate of updates for the learning rate [100]
+    t                 # sampling threshold [0.0001]
+    verbose           # verbose [2]
+```
+
+## `train_supervised` parameters
+
+```python
+    input             # training file path (required)
+    lr                # learning rate [0.1]
+    dim               # size of word vectors [100]
+    ws                # size of the context window [5]
+    epoch             # number of epochs [5]
+    minCount          # minimal number of word occurences [1]
+    minCountLabel     # minimal number of label occurences [1]
+    minn              # min length of char ngram [0]
+    maxn              # max length of char ngram [0]
+    neg               # number of negatives sampled [5]
+    wordNgrams        # max length of word ngram [1]
+    loss              # loss function {ns, hs, softmax, ova} [softmax]
+    bucket            # number of buckets [2000000]
+    thread            # number of threads [number of cpus]
+    lrUpdateRate      # change the rate of updates for the learning rate [100]
+    t                 # sampling threshold [0.0001]
+    label             # label prefix ['__label__']
+    verbose           # verbose [2]
+    pretrainedVectors # pretrained word vectors (.vec file) for supervised learning []
+```
+
+## `model` object
+
+`train_supervised`, `train_unsupervised` and `load_model` functions return an instance of `_FastText` class, that we generaly name `model` object.
+
+This object exposes those training arguments as properties : `lr`, `dim`, `ws`, `epoch`, `minCount`, `minCountLabel`, `minn`, `maxn`, `neg`, `wordNgrams`, `loss`, `bucket`, `thread`, `lrUpdateRate`, `t`, `label`, `verbose`, `pretrainedVectors`. So `model.wordNgrams` will give you the max length of word ngram used for training this model.
+
+In addition, the object exposes several functions :
+
+```python
+    get_dimension           # Get the dimension (size) of a lookup vector (hidden layer).
+                            # This is equivalent to `dim` property.
+    get_input_vector        # Given an index, get the corresponding vector of the Input Matrix.
+    get_input_matrix        # Get a copy of the full input matrix of a Model.
+    get_labels              # Get the entire list of labels of the dictionary
+                            # This is equivalent to `labels` property.
+    get_line                # Split a line of text into words and labels.
+    get_output_matrix       # Get a copy of the full output matrix of a Model.
+    get_sentence_vector     # Given a string, get a single vector represenation. This function
+                            # assumes to be given a single line of text. We split words on
+                            # whitespace (space, newline, tab, vertical tab) and the control
+                            # characters carriage return, formfeed and the null character.
+    get_subword_id          # Given a subword, return the index (within input matrix) it hashes to.
+    get_subwords            # Given a word, get the subwords and their indicies.
+    get_word_id             # Given a word, get the word id within the dictionary.
+    get_word_vector         # Get the vector representation of word.
+    get_words               # Get the entire list of words of the dictionary
+                            # This is equivalent to `words` property.
+    is_quantized            # whether the model has been quantized
+    predict                 # Given a string, get a list of labels and a list of corresponding probabilities.
+    quantize                # Quantize the model reducing the size of the model and it's memory footprint.
+    save_model              # Save the model to the given path
+    test                    # Evaluate supervised model using file given by path
+    test_label              # Return the precision and recall score for each label.    
+```
+
+The properties `words`, `labels` return the words and labels from the dictionary :
+```py
+model.words         # equivalent to model.get_words()
+model.labels        # equivalent to model.get_labels()
+```
+
+The object overrides `__getitem__` and `__contains__` functions in order to return the representation of a word and to check if a word is in the vocabulary.
+
+```py
+model['king']       # equivalent to model.get_word_vector('king')
+'king' in model     # equivalent to `'king' in model.get_words()`
+```

+ 23 - 1
docs/support.md

@@ -21,7 +21,7 @@ For the word-similarity evaluation script you will need:
 * python 2.6 or newer
 * numpy & scipy
 
-## Building fastText
+## Building fastText as a command line tool
 
 In order to build `fastText`, use the following:
 
@@ -34,3 +34,25 @@ $ make
 This will produce object files for all the classes as well as the main binary `fasttext`.
 If you do not plan on using the default system-wide compiler, update the two macros defined at the beginning of the Makefile (CC and INCLUDES).
 
+
+## Building `fasttext` python module
+
+In order to build `fasttext` module for python, use the following:
+
+```bash
+$ git clone https://github.com/facebookresearch/fastText.git
+$ cd fastText
+$ sudo pip install .
+$ # or :
+$ sudo python setup.py install
+```
+
+Then verify the installation went well :
+```bash
+$ python
+Python 2.7.15 |(default, May  1 2018, 18:37:05)
+Type "help", "copyright", "credits" or "license" for more information.
+>>> import fasttext
+>>>
+```
+If you don't see any error message, the installation was successful.

+ 265 - 39
python/README.md

@@ -1,64 +1,202 @@
-# fastText python
+# fastText [![CircleCI](https://circleci.com/gh/facebookresearch/fastText/tree/master.svg?style=svg)](https://circleci.com/gh/facebookresearch/fastText/tree/master)
 
 [fastText](https://fasttext.cc/) is a library for efficient learning of word representations and sentence classification.
-This folder contains python bindings of the fastText library.
 
-## Requirements
+In this document we present how to use fastText in python.
 
-[fastText](https://fasttext.cc/) builds on modern Mac OS and Linux distributions.
-Since it uses C\++11 features, it requires a compiler with good C++11 support.
-These include :
+## Table of contents
+
+* [Requirements](#requirements)
+* [Installation](#installation)
+* [Usage overview](#usage-overview)
+   * [Word representation model](#word-representation-model)
+   * [Text classification model](#text-classification-model)
+   * [IMPORTANT: Preprocessing data / encoding conventions](#important-preprocessing-data-encoding-conventions)
+   * [More examples](#more-examples)
+* [API](#api)
+   * [`train_unsupervised` parameters](#train_unsupervised-parameters)
+   * [`train_supervised` parameters](#train_supervised-parameters)
+   * [`model` object](#model-object)
 
-* (gcc-4.8 or newer) or (clang-3.3 or newer)
 
-You will need
+# Requirements
 
-* [Python](https://www.python.org/) version 2.7 or >=3.4
-* [NumPy](http://www.numpy.org/) & [SciPy](https://www.scipy.org/)
-* [pybind11](https://github.com/pybind/pybind11)
+[fastText](https://fasttext.cc/) builds on modern Mac OS and Linux distributions.
+Since it uses C\++11 features, it requires a compiler with good C++11 support. You will need [Python](https://www.python.org/) (version 2.7 or ≥ 3.4), [NumPy](http://www.numpy.org/) & [SciPy](https://www.scipy.org/) and [pybind11](https://github.com/pybind/pybind11).
 
-## Building fastText for python
 
-The easiest way to install fastText is to use [pip](https://pip.pypa.io/en/stable/).
+# Installation
 
+To install the latest release, you can do :
+```bash
+$ pip install fasttext
 ```
+
+or, to get the latest development version of fasttext, you can install from our github repository :
+```bash
 $ git clone https://github.com/facebookresearch/fastText.git
 $ cd fastText
-$ pip install .
+$ sudo pip install .
+$ # or :
+$ sudo python setup.py install
 ```
 
-Alternatively you can also install fastText using setuptools.
+# Usage overview
+
+
+## Word representation model
+
+In order to learn word vectors, as [described here](https://fasttext.cc/docs/en/references.html#enriching-word-vectors-with-subword-information), we can use `fasttext.train_unsupervised` function like this:
+
+
+```py
+import fasttext
+
+# Skipgram model :
+model = fasttext.train_unsupervised('data.txt', model='skipgram')
+
+# or, cbow model :
+model = fasttext.train_unsupervised('data.txt', model='cbow')
 
 ```
-$ git clone https://github.com/facebookresearch/fastText.git
-$ cd fastText
-$ python setup.py install
+
+where `data.txt` is a training file containing utf-8 encoded text.
+
+
+The returned `model` object represents your learned model, and you can use it to retrieve information.
+
+```py
+print(model.words)   # list of words in dictionary
+print(model['king']) # get the vector of the word 'king'
 ```
 
-Now you can import this library with
 
+### Saving and loading a model object
+
+You can save your trained model object by calling the function `save_model`.
+```py
+model.save_model("model_filename.bin")
+```
+
+and retrieve it later thanks to the function `load_model` :
+```py
+model = fasttext.load_model("model_filename.bin")
 ```
-import fastText
+
+For more information about word representation usage of fasttext, you can refer to our [word representations tutorial](https://fasttext.cc/docs/en/unsupervised-tutorial.html).
+
+
+## Text classification model
+
+In order to train a text classifier using the method [described here](https://fasttext.cc/docs/en/references.html#bag-of-tricks-for-efficient-text-classification), we can use `fasttext.train_supervised` function like this:
+
+
+```py
+import fasttext
+
+model = fasttext.train_supervised('data.train.txt')
+```
+
+where `data.train.txt` is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string `__label__`
+
+Once the model is trained, we can retrieve the list of words and labels:
+
+```py
+print(model.words)
+print(model.labels)
+```
+
+To evaluate our model by computing the precision at 1 (P@1) and the recall on a test set, we use the `test` function:
+
+```py
+def print_results(N, p, r):
+    print("N\t" + str(N))
+    print("P@{}\t{:.3f}".format(1, p))
+    print("R@{}\t{:.3f}".format(1, r))
+
+print_results(*model.test('test.txt'))
+```
+
+We can also predict labels for a specific text :
+
+```py
+model.predict("Which baking dish is best to bake a banana bread ?")
+```
+
+By default, `predict` returns only one label : the one with the highest probability. You can also predict more than one label by specifying the parameter `k`:
+```py
+model.predict("Which baking dish is best to bake a banana bread ?", k=3)
+```
+
+If you want to predict more than one sentence you can pass an array of strings :
+
+```py
+model.predict(["Which baking dish is best to bake a banana bread ?", "Why not put knives in the dishwasher?"], k=3)
+```
+
+
+Of course, you can also save and load a model to/from a file as [in the word representation usage](#saving-and-loading-a-model-object).
+
+For more information about text classification usage of fasttext, you can refer to our [text classification tutorial](https://fasttext.cc/docs/en/supervised-tutorial.html).
+
+
+
+
+### Compress model files with quantization
+
+When you want to save a supervised model file, fastText can compress it in order to have a much smaller model file by sacrificing only a little bit performance.
+
+```py
+# with the previously trained `model` object, call :
+model.quantize(input='data.train.txt', retrain=True)
+
+# then display results and save the new model :
+print_results(*model.test(valid_data))
+model.save_model("model_filename.ftz")
 ```
 
-## Examples
+`model_filename.ftz` will have a much smaller size than `model_filename.bin`.
+
+For further reading on quantization, you can refer to [this paragraph from our blog post](https://fasttext.cc/blog/2017/10/02/blog-post.html#model-compression).
+
+
+## IMPORTANT: Preprocessing data / encoding conventions
+
+In general it is important to properly preprocess your data. In particular our example scripts in the [root folder](https://github.com/facebookresearch/fastText) do this.
+
+fastText assumes UTF-8 encoded text. All text must be [unicode for Python2](https://docs.python.org/2/library/functions.html#unicode) and [str for Python3](https://docs.python.org/3.5/library/stdtypes.html#textseq). The passed text will be [encoded as UTF-8 by pybind11](https://pybind11.readthedocs.io/en/master/advanced/cast/strings.html?highlight=utf-8#strings-bytes-and-unicode-conversions) before passed to the fastText C++ library. This means it is important to use UTF-8 encoded text when building a model. On Unix-like systems you can convert text using [iconv](https://en.wikipedia.org/wiki/Iconv).
+
+fastText will tokenize (split text into pieces) based on the following ASCII characters (bytes). In particular, it is not aware of UTF-8 whitespace. We advice the user to convert UTF-8 whitespace / word boundaries into one of the following symbols as appropiate.
+
+* space
+* tab
+* vertical tab
+* carriage return
+* formfeed
+* the null character
 
-In general it is assumed that the reader already has good knowledge of fastText. For this consider the main [README](https://github.com/facebookresearch/fastText/blob/master/README.md) and in particular [the tutorials on our website](https://fasttext.cc/docs/en/supervised-tutorial.html).
+The newline character is used to delimit lines of text. In particular, the EOS token is appended to a line of text if a newline character is encountered. The only exception is if the number of tokens exceeds the MAX\_LINE\_SIZE constant as defined in the [Dictionary header](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h). This means if you have text that is not separate by newlines, such as the [fil9 dataset](http://mattmahoney.net/dc/textdata), it will be broken into chunks with MAX\_LINE\_SIZE of tokens and the EOS token is not appended.
 
-We recommend you look at the [examples within the doc folder](https://github.com/facebookresearch/fastText/tree/master/python/doc/examples).
+The length of a token is the number of UTF-8 characters by considering the [leading two bits of a byte](https://en.wikipedia.org/wiki/UTF-8#Description) to identify [subsequent bytes of a multi-byte sequence](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc). Knowing this is especially important when choosing the minimum and maximum length of subwords. Further, the EOS token (as specified in the [Dictionary header](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h)) is considered a character and will not be broken into subwords.
+
+## More examples
+
+In order to have a better knowledge of fastText models, please consider the main [README](https://github.com/facebookresearch/fastText/blob/master/README.md) and in particular [the tutorials on our website](https://fasttext.cc/docs/en/supervised-tutorial.html).
+
+You can find further python examples in [the doc folder](https://github.com/facebookresearch/fastText/tree/master/python/doc/examples).
 
 As with any package you can get help on any Python function using the help function.
 
 For example
 
 ```
-+>>> import fastText
-+>>> help(fastText.FastText)
++>>> import fasttext
++>>> help(fasttext.FastText)
 
-Help on module fastText.FastText in fastText:
+Help on module fasttext.FastText in fasttext:
 
 NAME
-    fastText.FastText
+    fasttext.FastText
 
 DESCRIPTION
     # Copyright (c) 2017-present, Facebook, Inc.
@@ -76,21 +214,109 @@ FUNCTIONS
 [...]
 ```
 
-## IMPORTANT: Preprocessing data / enconding conventions
 
-In general it is important to properly preprocess your data. In particular our example scripts in the [root folder](https://github.com/facebookresearch/fastText) do this.
+# API
 
-fastText assumes UTF-8 encoded text. All text must be [unicode for Python2](https://docs.python.org/2/library/functions.html#unicode) and [str for Python3](https://docs.python.org/3.5/library/stdtypes.html#textseq). The passed text will be [encoded as UTF-8 by pybind11](https://pybind11.readthedocs.io/en/master/advanced/cast/strings.html?highlight=utf-8#strings-bytes-and-unicode-conversions) before passed to the fastText C++ library. This means it is important to use UTF-8 encoded text when building a model. On Unix-like systems you can convert text using [iconv](https://en.wikipedia.org/wiki/Iconv).
 
-fastText will tokenize (split text into pieces) based on the following ASCII characters (bytes). In particular, it is not aware of UTF-8 whitespace. We advice the user to convert UTF-8 whitespace / word boundaries into one of the following symbols as appropiate.
+## `train_unsupervised` parameters
 
-* space
-* tab
-* vertical tab
-* carriage return
-* formfeed
-* the null character
+```python
+    input             # training file path (required)
+    model             # unsupervised fasttext model {cbow, skipgram} [skipgram]
+    lr                # learning rate [0.05]
+    dim               # size of word vectors [100]
+    ws                # size of the context window [5]
+    epoch             # number of epochs [5]
+    minCount          # minimal number of word occurences [5]
+    minn              # min length of char ngram [3]
+    maxn              # max length of char ngram [6]
+    neg               # number of negatives sampled [5]
+    wordNgrams        # max length of word ngram [1]
+    loss              # loss function {ns, hs, softmax, ova} [ns]
+    bucket            # number of buckets [2000000]
+    thread            # number of threads [number of cpus]
+    lrUpdateRate      # change the rate of updates for the learning rate [100]
+    t                 # sampling threshold [0.0001]
+    verbose           # verbose [2]
+```
 
-The newline character is used to delimit lines of text. In particular, the EOS token is appended to a line of text if a newline character is encountered. The only exception is if the number of tokens exceeds the MAX\_LINE\_SIZE constant as defined in the [Dictionary header](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h). This means if you have text that is not separate by newlines, such as the [fil9 dataset](http://mattmahoney.net/dc/textdata), it will be broken into chunks with MAX\_LINE\_SIZE of tokens and the EOS token is not appended.
+## `train_supervised` parameters
 
-The length of a token is the number of UTF-8 characters by considering the [leading two bits of a byte](https://en.wikipedia.org/wiki/UTF-8#Description) to identify [subsequent bytes of a multi-byte sequence](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc). Knowing this is especially important when choosing the minimum and maximum length of subwords. Further, the EOS token (as specified in the [Dictionary header](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h)) is considered a character and will not be broken into subwords.
+```python
+    input             # training file path (required)
+    lr                # learning rate [0.1]
+    dim               # size of word vectors [100]
+    ws                # size of the context window [5]
+    epoch             # number of epochs [5]
+    minCount          # minimal number of word occurences [1]
+    minCountLabel     # minimal number of label occurences [1]
+    minn              # min length of char ngram [0]
+    maxn              # max length of char ngram [0]
+    neg               # number of negatives sampled [5]
+    wordNgrams        # max length of word ngram [1]
+    loss              # loss function {ns, hs, softmax, ova} [softmax]
+    bucket            # number of buckets [2000000]
+    thread            # number of threads [number of cpus]
+    lrUpdateRate      # change the rate of updates for the learning rate [100]
+    t                 # sampling threshold [0.0001]
+    label             # label prefix ['__label__']
+    verbose           # verbose [2]
+    pretrainedVectors # pretrained word vectors (.vec file) for supervised learning []
+```
+
+## `model` object
+
+`train_supervised`, `train_unsupervised` and `load_model` functions return an instance of `_FastText` class, that we generaly name `model` object.
+
+This object exposes those training arguments as properties : `lr`, `dim`, `ws`, `epoch`, `minCount`, `minCountLabel`, `minn`, `maxn`, `neg`, `wordNgrams`, `loss`, `bucket`, `thread`, `lrUpdateRate`, `t`, `label`, `verbose`, `pretrainedVectors`. So `model.wordNgrams` will give you the max length of word ngram used for training this model.
+
+In addition, the object exposes several functions :
+
+```python
+    get_dimension           # Get the dimension (size) of a lookup vector (hidden layer).
+                            # This is equivalent to `dim` property.
+    get_input_vector        # Given an index, get the corresponding vector of the Input Matrix.
+    get_input_matrix        # Get a copy of the full input matrix of a Model.
+    get_labels              # Get the entire list of labels of the dictionary
+                            # This is equivalent to `labels` property.
+    get_line                # Split a line of text into words and labels.
+    get_output_matrix       # Get a copy of the full output matrix of a Model.
+    get_sentence_vector     # Given a string, get a single vector represenation. This function
+                            # assumes to be given a single line of text. We split words on
+                            # whitespace (space, newline, tab, vertical tab) and the control
+                            # characters carriage return, formfeed and the null character.
+    get_subword_id          # Given a subword, return the index (within input matrix) it hashes to.
+    get_subwords            # Given a word, get the subwords and their indicies.
+    get_word_id             # Given a word, get the word id within the dictionary.
+    get_word_vector         # Get the vector representation of word.
+    get_words               # Get the entire list of words of the dictionary
+                            # This is equivalent to `words` property.
+    is_quantized            # whether the model has been quantized
+    predict                 # Given a string, get a list of labels and a list of corresponding probabilities.
+    quantize                # Quantize the model reducing the size of the model and it's memory footprint.
+    save_model              # Save the model to the given path
+    test                    # Evaluate supervised model using file given by path
+    test_label              # Return the precision and recall score for each label.    
+```
+
+The properties `words`, `labels` return the words and labels from the dictionary :
+```py
+model.words         # equivalent to model.get_words()
+model.labels        # equivalent to model.get_labels()
+```
+
+The object overrides `__getitem__` and `__contains__` functions in order to return the representation of a word and to check if a word is in the vocabulary.
+
+```py
+model['king']       # equivalent to model.get_word_vector('king')
+'king' in model     # equivalent to `'king' in model.get_words()`
+```
+
+
+Join the fastText community
+---------------------------
+
+- [Facebook page](https://www.facebook.com/groups/1174547215919768)
+- [Stack overflow](https://stackoverflow.com/questions/tagged/fasttext)
+- [Google group](https://groups.google.com/forum/#!forum/fasttext-library)
+- [GitHub](https://github.com/facebookresearch/fastText)

+ 327 - 53
python/README.rst

@@ -1,88 +1,196 @@
-fastText
-========
+fastText |CircleCI|
+===================
 
 `fastText <https://fasttext.cc/>`__ is a library for efficient learning
 of word representations and sentence classification.
 
+In this document we present how to use fastText in python.
+
+Table of contents
+-----------------
+
+-  `Requirements <#requirements>`__
+-  `Installation <#installation>`__
+-  `Usage overview <#usage-overview>`__
+-  `Word representation model <#word-representation-model>`__
+-  `Text classification model <#text-classification-model>`__
+-  `IMPORTANT: Preprocessing data / encoding
+   conventions <#important-preprocessing-data-encoding-conventions>`__
+-  `More examples <#more-examples>`__
+-  `API <#api>`__
+-  `train_unsupervised parameters <#train_unsupervised-parameters>`__
+-  `train_supervised parameters <#train_supervised-parameters>`__
+-  `model object <#model-object>`__
+
 Requirements
-------------
+============
 
 `fastText <https://fasttext.cc/>`__ builds on modern Mac OS and Linux
 distributions. Since it uses C++11 features, it requires a compiler with
-good C++11 support. These include :
+good C++11 support. You will need `Python <https://www.python.org/>`__
+(version 2.7 or ≥ 3.4), `NumPy <http://www.numpy.org/>`__ &
+`SciPy <https://www.scipy.org/>`__ and
+`pybind11 <https://github.com/pybind/pybind11>`__.
 
--  (gcc-4.8 or newer) or (clang-3.3 or newer)
+Installation
+============
 
-You will need
+To install the latest release, you can do :
 
--  `Python <https://www.python.org/>`__ version 2.7 or >=3.4
--  `NumPy <http://www.numpy.org/>`__ &
-   `SciPy <https://www.scipy.org/>`__
--  `pybind11 <https://github.com/pybind/pybind11>`__
+.. code:: bash
 
-Building fastText
------------------
+    $ pip install fasttext
 
-The easiest way to get the latest version of `fastText is to use
-pip <https://pypi.python.org/pypi/fasttext>`__.
+or, to get the latest development version of fasttext, you can install
+from our github repository :
 
-::
+.. code:: bash
 
-    $ pip install fasttext
+    $ git clone https://github.com/facebookresearch/fastText.git
+    $ cd fastText
+    $ sudo pip install .
+    $ # or :
+    $ sudo python setup.py install
 
-If you want to use the latest unstable release you will need to build
-from source using setup.py.
+Usage overview
+==============
 
-Now you can import this library with
+Word representation model
+-------------------------
 
-::
+In order to learn word vectors, as `described
+here <https://fasttext.cc/docs/en/references.html#enriching-word-vectors-with-subword-information>`__,
+we can use ``fasttext.train_unsupervised`` function like this:
 
-    import fastText
+.. code:: py
 
-Examples
---------
+    import fasttext
 
-In general it is assumed that the reader already has good knowledge of
-fastText. For this consider the main
-`README <https://github.com/facebookresearch/fastText/blob/master/README.md>`__
-and in particular `the tutorials on our
-website <https://fasttext.cc/docs/en/supervised-tutorial.html>`__.
+    # Skipgram model :
+    model = fasttext.train_unsupervised('data.txt', model='skipgram')
 
-We recommend you look at the `examples within the doc
-folder <https://github.com/facebookresearch/fastText/tree/master/python/doc/examples>`__.
+    # or, cbow model :
+    model = fasttext.train_unsupervised('data.txt', model='cbow')
 
-As with any package you can get help on any Python function using the
-help function.
+where ``data.txt`` is a training file containing utf-8 encoded text.
 
-For example
+The returned ``model`` object represents your learned model, and you can
+use it to retrieve information.
 
-::
+.. code:: py
 
-    +>>> import fastText
-    +>>> help(fastText.FastText)
+    print(model.words)   # list of words in dictionary
+    print(model['king']) # get the vector of the word 'king'
 
-    Help on module fastText.FastText in fastText:
+Saving and loading a model object
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-    NAME
-        fastText.FastText
+You can save your trained model object by calling the function
+``save_model``.
 
-    DESCRIPTION
-        # Copyright (c) 2017-present, Facebook, Inc.
-        # All rights reserved.
-        #
-        # This source code is licensed under the MIT license found in the
-        # LICENSE file in the root directory of this source tree.
+.. code:: py
 
-    FUNCTIONS
-        load_model(path)
-            Load a model given a filepath and return a model object.
+    model.save_model("model_filename.bin")
 
-        tokenize(text)
-            Given a string of text, tokenize it and return a list of tokens
-    [...]
+and retrieve it later thanks to the function ``load_model`` :
+
+.. code:: py
+
+    model = fasttext.load_model("model_filename.bin")
+
+For more information about word representation usage of fasttext, you
+can refer to our `word representations
+tutorial <https://fasttext.cc/docs/en/unsupervised-tutorial.html>`__.
+
+Text classification model
+-------------------------
+
+In order to train a text classifier using the method `described
+here <https://fasttext.cc/docs/en/references.html#bag-of-tricks-for-efficient-text-classification>`__,
+we can use ``fasttext.train_supervised`` function like this:
+
+.. code:: py
+
+    import fasttext
+
+    model = fasttext.train_supervised('data.train.txt')
+
+where ``data.train.txt`` is a text file containing a training sentence
+per line along with the labels. By default, we assume that labels are
+words that are prefixed by the string ``__label__``
+
+Once the model is trained, we can retrieve the list of words and labels:
+
+.. code:: py
+
+    print(model.words)
+    print(model.labels)
+
+To evaluate our model by computing the precision at 1 (P@1) and the
+recall on a test set, we use the ``test`` function:
+
+.. code:: py
+
+    def print_results(N, p, r):
+        print("N\t" + str(N))
+        print("P@{}\t{:.3f}".format(1, p))
+        print("R@{}\t{:.3f}".format(1, r))
+
+    print_results(*model.test('test.txt'))
+
+We can also predict labels for a specific text :
+
+.. code:: py
+
+    model.predict("Which baking dish is best to bake a banana bread ?")
 
-IMPORTANT: Preprocessing data / enconding conventions
------------------------------------------------------
+By default, ``predict`` returns only one label : the one with the
+highest probability. You can also predict more than one label by
+specifying the parameter ``k``:
+
+.. code:: py
+
+    model.predict("Which baking dish is best to bake a banana bread ?", k=3)
+
+If you want to predict more than one sentence you can pass an array of
+strings :
+
+.. code:: py
+
+    model.predict(["Which baking dish is best to bake a banana bread ?", "Why not put knives in the dishwasher?"], k=3)
+
+Of course, you can also save and load a model to/from a file as `in the
+word representation usage <#saving-and-loading-a-model-object>`__.
+
+For more information about text classification usage of fasttext, you
+can refer to our `text classification
+tutorial <https://fasttext.cc/docs/en/supervised-tutorial.html>`__.
+
+Compress model files with quantization
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When you want to save a supervised model file, fastText can compress it
+in order to have a much smaller model file by sacrificing only a little
+bit performance.
+
+.. code:: py
+
+    # with the previously trained `model` object, call :
+    model.quantize(input='data.train.txt', retrain=True)
+
+    # then display results and save the new model :
+    print_results(*model.test(valid_data))
+    model.save_model("model_filename.ftz")
+
+``model_filename.ftz`` will have a much smaller size than
+``model_filename.bin``.
+
+For further reading on quantization, you can refer to `this paragraph
+from our blog
+post <https://fasttext.cc/blog/2017/10/02/blog-post.html#model-compression>`__.
+
+IMPORTANT: Preprocessing data / encoding conventions
+----------------------------------------------------
 
 In general it is important to properly preprocess your data. In
 particular our example scripts in the `root
@@ -130,3 +238,169 @@ maximum length of subwords. Further, the EOS token (as specified in the
 `Dictionary
 header <https://github.com/facebookresearch/fastText/blob/master/src/dictionary.h>`__)
 is considered a character and will not be broken into subwords.
+
+More examples
+-------------
+
+In order to have a better knowledge of fastText models, please consider
+the main
+`README <https://github.com/facebookresearch/fastText/blob/master/README.md>`__
+and in particular `the tutorials on our
+website <https://fasttext.cc/docs/en/supervised-tutorial.html>`__.
+
+You can find further python examples in `the doc
+folder <https://github.com/facebookresearch/fastText/tree/master/python/doc/examples>`__.
+
+As with any package you can get help on any Python function using the
+help function.
+
+For example
+
+::
+
+    +>>> import fasttext
+    +>>> help(fasttext.FastText)
+
+    Help on module fasttext.FastText in fasttext:
+
+    NAME
+        fasttext.FastText
+
+    DESCRIPTION
+        # Copyright (c) 2017-present, Facebook, Inc.
+        # All rights reserved.
+        #
+        # This source code is licensed under the MIT license found in the
+        # LICENSE file in the root directory of this source tree.
+
+    FUNCTIONS
+        load_model(path)
+            Load a model given a filepath and return a model object.
+
+        tokenize(text)
+            Given a string of text, tokenize it and return a list of tokens
+    [...]
+
+API
+===
+
+``train_unsupervised`` parameters
+---------------------------------
+
+.. code:: python
+
+        input             # training file path (required)
+        model             # unsupervised fasttext model {cbow, skipgram} [skipgram]
+        lr                # learning rate [0.05]
+        dim               # size of word vectors [100]
+        ws                # size of the context window [5]
+        epoch             # number of epochs [5]
+        minCount          # minimal number of word occurences [5]
+        minn              # min length of char ngram [3]
+        maxn              # max length of char ngram [6]
+        neg               # number of negatives sampled [5]
+        wordNgrams        # max length of word ngram [1]
+        loss              # loss function {ns, hs, softmax, ova} [ns]
+        bucket            # number of buckets [2000000]
+        thread            # number of threads [number of cpus]
+        lrUpdateRate      # change the rate of updates for the learning rate [100]
+        t                 # sampling threshold [0.0001]
+        verbose           # verbose [2]
+
+``train_supervised`` parameters
+-------------------------------
+
+.. code:: python
+
+        input             # training file path (required)
+        lr                # learning rate [0.1]
+        dim               # size of word vectors [100]
+        ws                # size of the context window [5]
+        epoch             # number of epochs [5]
+        minCount          # minimal number of word occurences [1]
+        minCountLabel     # minimal number of label occurences [1]
+        minn              # min length of char ngram [0]
+        maxn              # max length of char ngram [0]
+        neg               # number of negatives sampled [5]
+        wordNgrams        # max length of word ngram [1]
+        loss              # loss function {ns, hs, softmax, ova} [softmax]
+        bucket            # number of buckets [2000000]
+        thread            # number of threads [number of cpus]
+        lrUpdateRate      # change the rate of updates for the learning rate [100]
+        t                 # sampling threshold [0.0001]
+        label             # label prefix ['__label__']
+        verbose           # verbose [2]
+        pretrainedVectors # pretrained word vectors (.vec file) for supervised learning []
+
+``model`` object
+----------------
+
+``train_supervised``, ``train_unsupervised`` and ``load_model``
+functions return an instance of ``_FastText`` class, that we generaly
+name ``model`` object.
+
+This object exposes those training arguments as properties : ``lr``,
+``dim``, ``ws``, ``epoch``, ``minCount``, ``minCountLabel``, ``minn``,
+``maxn``, ``neg``, ``wordNgrams``, ``loss``, ``bucket``, ``thread``,
+``lrUpdateRate``, ``t``, ``label``, ``verbose``, ``pretrainedVectors``.
+So ``model.wordNgrams`` will give you the max length of word ngram used
+for training this model.
+
+In addition, the object exposes several functions :
+
+.. code:: python
+
+        get_dimension           # Get the dimension (size) of a lookup vector (hidden layer).
+                                # This is equivalent to `dim` property.
+        get_input_vector        # Given an index, get the corresponding vector of the Input Matrix.
+        get_input_matrix        # Get a copy of the full input matrix of a Model.
+        get_labels              # Get the entire list of labels of the dictionary
+                                # This is equivalent to `labels` property.
+        get_line                # Split a line of text into words and labels.
+        get_output_matrix       # Get a copy of the full output matrix of a Model.
+        get_sentence_vector     # Given a string, get a single vector represenation. This function
+                                # assumes to be given a single line of text. We split words on
+                                # whitespace (space, newline, tab, vertical tab) and the control
+                                # characters carriage return, formfeed and the null character.
+        get_subword_id          # Given a subword, return the index (within input matrix) it hashes to.
+        get_subwords            # Given a word, get the subwords and their indicies.
+        get_word_id             # Given a word, get the word id within the dictionary.
+        get_word_vector         # Get the vector representation of word.
+        get_words               # Get the entire list of words of the dictionary
+                                # This is equivalent to `words` property.
+        is_quantized            # whether the model has been quantized
+        predict                 # Given a string, get a list of labels and a list of corresponding probabilities.
+        quantize                # Quantize the model reducing the size of the model and it's memory footprint.
+        save_model              # Save the model to the given path
+        test                    # Evaluate supervised model using file given by path
+        test_label              # Return the precision and recall score for each label.
+
+The properties ``words``, ``labels`` return the words and labels from
+the dictionary :
+
+.. code:: py
+
+    model.words         # equivalent to model.get_words()
+    model.labels        # equivalent to model.get_labels()
+
+The object overrides ``__getitem__`` and ``__contains__`` functions in
+order to return the representation of a word and to check if a word is
+in the vocabulary.
+
+.. code:: py
+
+    model['king']       # equivalent to model.get_word_vector('king')
+    'king' in model     # equivalent to `'king' in model.get_words()`
+
+Join the fastText community
+---------------------------
+
+-  `Facebook page <https://www.facebook.com/groups/1174547215919768>`__
+-  `Stack
+   overflow <https://stackoverflow.com/questions/tagged/fasttext>`__
+-  `Google
+   group <https://groups.google.com/forum/#!forum/fasttext-library>`__
+-  `GitHub <https://github.com/facebookresearch/fastText>`__
+
+.. |CircleCI| image:: https://circleci.com/gh/facebookresearch/fastText/tree/master.svg?style=svg
+   :target: https://circleci.com/gh/facebookresearch/fastText/tree/master

+ 2 - 2
python/benchmarks/get_word_vector.py

@@ -9,8 +9,8 @@ from __future__ import division
 from __future__ import print_function
 from __future__ import unicode_literals
 
-from fastText import load_model
-from fastText import tokenize
+from fasttext import load_model
+from fasttext import tokenize
 import sys
 import time
 import tempfile

+ 1 - 1
python/doc/examples/FastTextEmbeddingBag.py

@@ -20,7 +20,7 @@ import torch
 import random
 import string
 import time
-from fastText import load_model
+from fasttext import load_model
 from torch.autograd import Variable
 
 

+ 1 - 1
python/doc/examples/bin_to_vec.py

@@ -12,7 +12,7 @@ from __future__ import print_function
 from __future__ import unicode_literals
 from __future__ import division, absolute_import, print_function
 
-from fastText import load_model
+from fasttext import load_model
 import argparse
 import errno
 

+ 2 - 2
python/doc/examples/compute_accuracy.py

@@ -12,8 +12,8 @@ from __future__ import print_function
 from __future__ import unicode_literals
 from __future__ import division, absolute_import, print_function
 
-from fastText import load_model
-from fastText import util
+from fasttext import load_model
+from fasttext import util
 import argparse
 import numpy as np
 

+ 1 - 1
python/doc/examples/get_vocab.py

@@ -12,7 +12,7 @@ from __future__ import print_function
 from __future__ import unicode_literals
 from __future__ import division, absolute_import, print_function
 
-from fastText import load_model
+from fasttext import load_model
 import argparse
 import errno
 

+ 1 - 1
python/doc/examples/train_supervised.py

@@ -12,7 +12,7 @@ from __future__ import print_function
 from __future__ import unicode_literals
 
 import os
-from fastText import train_supervised
+from fasttext import train_supervised
 
 
 def print_results(N, p, r):

+ 1 - 1
python/doc/examples/train_unsupervised.py

@@ -12,7 +12,7 @@ from __future__ import print_function
 from __future__ import unicode_literals
 from __future__ import division, absolute_import, print_function
 
-from fastText import train_unsupervised
+from fasttext import train_unsupervised
 import numpy as np
 import os
 from scipy import stats

+ 125 - 58
python/fastText/FastText.py → python/fasttext_module/fasttext/FastText.py

@@ -12,6 +12,8 @@ from __future__ import unicode_literals
 import fasttext_pybind as fasttext
 import numpy as np
 import multiprocessing
+import sys
+from itertools import chain
 
 loss_name = fasttext.loss_name
 model_name = fasttext.model_name
@@ -20,7 +22,11 @@ BOW = "<"
 EOW = ">"
 
 
-class _FastText():
+def eprint(cls, *args, **kwargs):
+    print(*args, file=sys.stderr, **kwargs)
+
+
+class _FastText(object):
     """
     This class defines the API to inspect models and should not be used to
     create objects. It will be returned by functions such as load_model or
@@ -31,10 +37,20 @@ class _FastText():
     strings are then encoded as UTF-8 and fed to the fastText C++ API.
     """
 
-    def __init__(self, model=None):
+    def __init__(self, model_path=None, args=None):
         self.f = fasttext.fasttext()
-        if model is not None:
-            self.f.loadModel(model)
+        if model_path is not None:
+            self.f.loadModel(model_path)
+        self._words = None
+        self._labels = None
+
+        if args:
+            arg_names = ['lr', 'dim', 'ws', 'epoch', 'minCount',
+                         'minCountLabel', 'minn', 'maxn', 'neg', 'wordNgrams',
+                         'loss', 'bucket', 'thread', 'lrUpdateRate', 't',
+                         'label', 'verbose', 'pretrainedVectors']
+            for arg_name in arg_names:
+                setattr(self, arg_name, getattr(args, arg_name))
 
     def is_quantized(self):
         return self.f.isQuant()
@@ -266,10 +282,23 @@ class _FastText():
             qnorm
         )
 
+    @property
+    def words(self):
+        if self._words is None:
+            self._words = self.get_words()
+        return self._words
+
+    @property
+    def labels(self):
+        if self._labels is None:
+            self._labels = self.get_labels()
+        return self._labels
 
-# TODO:
-# Not supported:
-# - pretrained vectors
+    def __getitem__(self, word):
+        return self.get_word_vector(word)
+
+    def __contains__(self, word):
+        return word in self.words
 
 
 def _parse_model_string(string):
@@ -317,30 +346,60 @@ def tokenize(text):
 
 def load_model(path):
     """Load a model given a filepath and return a model object."""
-    return _FastText(path)
-
-
-def train_supervised(
-    input,
-    lr=0.1,
-    dim=100,
-    ws=5,
-    epoch=5,
-    minCount=1,
-    minCountLabel=0,
-    minn=0,
-    maxn=0,
-    neg=5,
-    wordNgrams=1,
-    loss="softmax",
-    bucket=2000000,
-    thread=multiprocessing.cpu_count() - 1,
-    lrUpdateRate=100,
-    t=1e-4,
-    label="__label__",
-    verbose=2,
-    pretrainedVectors="",
-):
+    eprint("Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.")
+    return _FastText(model_path=path)
+
+
+unsupervised_default = {
+    'model' : "skipgram",
+    'lr' : 0.05,
+    'dim' : 100,
+    'ws' : 5,
+    'epoch' : 5,
+    'minCount' : 5,
+    'minCountLabel' : 0,
+    'minn' : 3,
+    'maxn' : 6,
+    'neg' : 5,
+    'wordNgrams' : 1,
+    'loss' : "ns",
+    'bucket' : 2000000,
+    'thread' : multiprocessing.cpu_count() - 1,
+    'lrUpdateRate' : 100,
+    't' : 1e-4,
+    'label' : "__label__",
+    'verbose' : 2,
+    'pretrainedVectors' : "",
+}
+
+
+def read_args(arg_list, arg_dict, arg_names, default_values):
+    param_map = {
+        'min_count' : 'minCount',
+        'word_ngrams' : 'wordNgrams',
+        'lr_update_rate' : 'lrUpdateRate',
+        'label_prefix' : 'label',
+        'pretrained_vectors' : 'pretrainedVectors'
+    }
+
+    ret = {}
+    for (arg_name, arg_value) in chain(zip(arg_names, arg_list), arg_dict.items()):
+        if arg_name in param_map:
+            arg_name = param_map[arg_name]
+        if arg_name not in arg_names:
+            raise TypeError("unexpected keyword argument '%s'" % arg_name)
+        if arg_name in ret:
+            raise TypeError("multiple values for argument '%s'" % arg_name)
+        ret[arg_name] = arg_value
+
+    for (arg_name, arg_value) in default_values.items():
+        if arg_name not in ret:
+            ret[arg_name] = arg_value
+
+    return ret
+
+
+def train_supervised(*kargs, **kwargs):
     """
     Train a supervised model and return a model object.
 
@@ -353,35 +412,27 @@ def train_supervised(
     example consult the example datasets which are part of the fastText
     repository such as the dataset pulled by classification-example.sh.
     """
-    model = "supervised"
-    a = _build_args(locals())
-    ft = _FastText()
+    supervised_default = unsupervised_default.copy()
+    supervised_default.update({
+        'lr' : 0.1,
+        'minCount' : 1,
+        'minn' : 0,
+        'maxn' : 0,
+        'loss' : "softmax",
+        'model' : "supervised"
+    })
+
+    arg_names = ['input', 'lr', 'dim', 'ws', 'epoch', 'minCount',
+        'minCountLabel', 'minn', 'maxn', 'neg', 'wordNgrams', 'loss', 'bucket',
+        'thread', 'lrUpdateRate', 't', 'label', 'verbose', 'pretrainedVectors']
+    params = read_args(kargs, kwargs, arg_names, supervised_default)
+    a = _build_args(params)
+    ft = _FastText(args=a)
     fasttext.train(ft.f, a)
     return ft
 
 
-def train_unsupervised(
-    input,
-    model="skipgram",
-    lr=0.05,
-    dim=100,
-    ws=5,
-    epoch=5,
-    minCount=5,
-    minCountLabel=0,
-    minn=3,
-    maxn=6,
-    neg=5,
-    wordNgrams=1,
-    loss="ns",
-    bucket=2000000,
-    thread=multiprocessing.cpu_count() -1,
-    lrUpdateRate=100,
-    t=1e-4,
-    label="__label__",
-    verbose=2,
-    pretrainedVectors="",
-):
+def train_unsupervised(*kargs, **kwargs):
     """
     Train an unsupervised model and return a model object.
 
@@ -395,7 +446,23 @@ def train_unsupervised(
     dataset pulled by the example script word-vector-example.sh, which is
     part of the fastText repository.
     """
-    a = _build_args(locals())
-    ft = _FastText()
+    arg_names = ['input', 'model', 'lr', 'dim', 'ws', 'epoch', 'minCount',
+        'minCountLabel', 'minn', 'maxn', 'neg', 'wordNgrams', 'loss', 'bucket',
+        'thread', 'lrUpdateRate', 't', 'label', 'verbose', 'pretrainedVectors']
+    params = read_args(kargs, kwargs, arg_names, unsupervised_default)
+    a = _build_args(params)
+    ft = _FastText(args=a)
     fasttext.train(ft.f, a)
     return ft
+
+
+def cbow(*kargs, **kwargs):
+    raise Exception("`cbow` is not supported any more. Please use `train_unsupervised` with model=`cbow`. For more information please refer to https://fasttext.cc/blog/2019/06/25/blog-post.html#2-you-were-using-the-unofficial-fasttext-module")
+
+
+def skipgram(*kargs, **kwargs):
+    raise Exception("`skipgram` is not supported any more. Please use `train_unsupervised` with model=`skipgram`. For more information please refer to https://fasttext.cc/blog/2019/06/25/blog-post.html#2-you-were-using-the-unofficial-fasttext-module")
+
+
+def supervised(*kargs, **kwargs):
+    raise Exception("`supervised` is not supported any more. Please use `train_supervised`. For more information please refer to https://fasttext.cc/blog/2019/06/25/blog-post.html#2-you-were-using-the-unofficial-fasttext-module")

+ 4 - 0
python/fastText/__init__.py → python/fasttext_module/fasttext/__init__.py

@@ -16,3 +16,7 @@ from .FastText import tokenize
 from .FastText import EOS
 from .FastText import BOW
 from .FastText import EOW
+
+from .FastText import cbow
+from .FastText import skipgram
+from .FastText import supervised

+ 0 - 0
python/fastText/pybind/fasttext_pybind.cc → python/fasttext_module/fasttext/pybind/fasttext_pybind.cc


+ 0 - 0
python/fastText/tests/__init__.py → python/fasttext_module/fasttext/tests/__init__.py


+ 1 - 1
python/fastText/tests/test_configurations.py → python/fasttext_module/fasttext/tests/test_configurations.py

@@ -14,7 +14,7 @@ import multiprocessing
 # This script represents a collection of integration tests
 # Each integration test comes with a full set of parameters,
 # a dataset, and expected metrics.
-# These configurations can be used by various fastText apis
+# These configurations can be used by various fastText APIs
 # to confirm some level of correctness.
 
 

+ 15 - 15
python/fastText/tests/test_script.py → python/fasttext_module/fasttext/tests/test_script.py

@@ -9,10 +9,10 @@ from __future__ import division
 from __future__ import print_function
 from __future__ import unicode_literals
 
-from fastText import train_supervised
-from fastText import train_unsupervised
-from fastText import util
-import fastText
+from fasttext import train_supervised
+from fasttext import train_unsupervised
+from fasttext import util
+import fasttext
 import os
 import subprocess
 import unittest
@@ -25,7 +25,7 @@ try:
     import unicode
 except ImportError:
     pass
-from fastText.tests.test_configurations import get_supervised_models
+from fasttext.tests.test_configurations import get_supervised_models
 
 
 def eprint(cls, *args, **kwargs):
@@ -290,11 +290,11 @@ class TestFastTextUnitPy(unittest.TestCase):
         words, freqs = f.get_words(include_freq=True)
         foundEOS = False
         for word, freq in zip(words, freqs):
-            if word == fastText.EOS:
+            if word == fasttext.EOS:
                 foundEOS = True
             else:
                 self.assertEqual(words_python[word], freq)
-        # EOS is special to fastText, but still part of the vocab
+        # EOS is special to fasttext, but still part of the vocab
         self.assertEqual(len(words_python), len(words) - 1)
         self.assertTrue(foundEOS)
 
@@ -316,16 +316,16 @@ class TestFastTextUnitPy(unittest.TestCase):
             f.get_subwords(w)
 
     def gen_test_tokenize(self, kwargs):
-        self.assertEqual(["asdf", "asdb"], fastText.tokenize("asdf asdb"))
-        self.assertEqual(["asdf"], fastText.tokenize("asdf"))
-        self.assertEqual([fastText.EOS], fastText.tokenize("\n"))
-        self.assertEqual(["asdf", fastText.EOS], fastText.tokenize("asdf\n"))
-        self.assertEqual([], fastText.tokenize(""))
-        self.assertEqual([], fastText.tokenize(" "))
+        self.assertEqual(["asdf", "asdb"], fasttext.tokenize("asdf asdb"))
+        self.assertEqual(["asdf"], fasttext.tokenize("asdf"))
+        self.assertEqual([fasttext.EOS], fasttext.tokenize("\n"))
+        self.assertEqual(["asdf", fasttext.EOS], fasttext.tokenize("asdf\n"))
+        self.assertEqual([], fasttext.tokenize(""))
+        self.assertEqual([], fasttext.tokenize(" "))
         # An empty string is not a token (it's just whitespace)
         # So the minimum length must be 1
         words = get_random_words(100, 1, 20)
-        self.assertEqual(words, fastText.tokenize(" ".join(words)))
+        self.assertEqual(words, fasttext.tokenize(" ".join(words)))
 
     def gen_test_unsupervised_dimension(self, kwargs):
         if "dim" in kwargs:
@@ -343,7 +343,7 @@ class TestFastTextUnitPy(unittest.TestCase):
         words += get_random_words(100, 1, 20)
         input_matrix = f.get_input_matrix()
         for word in words:
-            # Universal api to get word vector
+            # Universal API to get word vector
             vec1 = f.get_word_vector(word)
 
             # Build word vector from subwords

+ 0 - 0
python/fastText/util/__init__.py → python/fasttext_module/fasttext/util/__init__.py


+ 0 - 0
python/fastText/util/util.py → python/fasttext_module/fasttext/util/util.py


+ 2 - 2
runtests.py

@@ -19,8 +19,8 @@ from __future__ import unicode_literals
 
 import unittest
 import argparse
-from fastText.tests import gen_tests
-from fastText.tests import gen_unit_tests
+from fasttext.tests import gen_tests
+from fasttext.tests import gen_unit_tests
 
 
 def run_tests(tests):

+ 9 - 9
setup.py

@@ -20,7 +20,7 @@ import os
 import subprocess
 import platform
 
-__version__ = '0.8.22'
+__version__ = '0.9'
 FASTTEXT_SRC = "src"
 
 # Based on https://github.com/pybind/python_example
@@ -64,7 +64,7 @@ ext_modules = [
     Extension(
         str('fasttext_pybind'),
         [
-            str('python/fastText/pybind/fasttext_pybind.cc'),
+            str('python/fasttext_module/fasttext/pybind/fasttext_pybind.cc'),
         ] + fasttext_src_cc,
         include_dirs=[
             # Path to pybind11 headers
@@ -167,9 +167,9 @@ def _get_readme():
 setup(
     name='fasttext',
     version=__version__,
-    author='Christian Puhrsch',
-    author_email='cpuhrsch@fb.com',
-    description='fastText Python bindings',
+    author='Onur Celebi',
+    author_email='celebio@fb.com',
+    description='fasttext Python bindings',
     long_description=_get_readme(),
     ext_modules=ext_modules,
     url='https://github.com/facebookresearch/fastText',
@@ -193,10 +193,10 @@ setup(
     install_requires=['pybind11>=2.2', "setuptools >= 0.7.0", "numpy"],
     cmdclass={'build_ext': BuildExt},
     packages=[
-        str('fastText'),
-        str('fastText.util'),
-        str('fastText.tests'),
+        str('fasttext'),
+        str('fasttext.util'),
+        str('fasttext.tests'),
     ],
-    package_dir={str(''): str('python')},
+    package_dir={str(''): str('python/fasttext_module')},
     zip_safe=False,
 )

+ 168 - 0
website/blog/2019-06-25-blog-post.md

@@ -0,0 +1,168 @@
+---
+title: New release of python module
+author: Onur Çelebi
+authorURL: https://research.fb.com/people/celebi-onur/
+authorFBID: 663146146
+---
+
+Today, we are happy to release a new version of the fastText python library. The main goal of this release is to merge two existing python modules: the official `fastText` module which was available on our github repository and the unofficial `fasttext` module which was available on pypi.org. We hope that this new version will address the confusion due to the previous existence of two similar, but different, python modules.
+
+The new version of our library is now available on [pypi.org](https://pypi.org/project/fasttext/) as well as on our github repository, and you can find [an overview of its API here](/docs/en/python-module.html).
+
+
+
+fastText vs fasttext: what happened?
+----------------------------------
+There was an ongoing confusion among our user community about the existence of both `fastText` and `fasttext` modules.
+
+When fastText was first released in 2016, it was a command line only utility. Very soon, people wanted to use fastText's capabilities from python without having to call a binary for each action. In August 2016, [Bayu Aldi Yansyah](https://github.com/pyk), a developer outside of Facebook, published a python wrapper of fastText. His work was very helpful to a lot of people in our community and he published his unofficial python library on pypi with the pretty straighforward module name `fasttext` (note the lowercase `t`).
+
+Later, our team began to work on an official python binding of fastText, that was published under the same github repository as the C++ source code. However, the module name for this official library was `fastText` (note the uppercase `T`).
+
+Last year, Bayu Aldi Yansyah gave us admin access to the pypi project so that we could merge the two libraries.
+
+To sum up, we ended up with two libraries that had:
+
+- almost the same name
+- different APIs
+- different versions
+- different ways to install
+
+That was a very confusing situation for the community.
+
+What actions did we take?
+--------------------------
+Today we are merging the two python libraries. We decided to keep the official API and top level functions such as `train_unsupervised` and `train_supervised` as well as returning numpy objects. We remove `cbow`, `skipgram` and `supervised` functions from the unofficial API. However, [we bring nice ideas](#wordvectormodel-and-supervisedmodel-objects) from the unofficial API to the official one. In particular, we liked the pythonic approach of `WordVectorModel`. This new python module is named `fasttext`, and is available on both [pypi](https://pypi.org/project/fasttext/) and our [github](https://github.com/facebookresearch/fastText) repository.
+
+From now, we will refer to the tool as "fastText", however the name of the python module is `fasttext`.
+
+
+
+What is the right way to do now?
+--------------------------------
+
+Before, you would either use `fastText` (uppercase `T`):
+```python
+import fastText
+# and call:
+fastText.train_supervised
+fastText.train_unsupervised
+```
+
+or use `fasttext` (lowercase `t`):
+```python
+import fasttext
+# and call:
+fasttext.cbow
+fasttext.skipgram
+fasttext.supervised
+```
+
+
+Now, the right way to do is to
+`import fasttext` (lowercase `t`)
+and use
+```python
+import fasttext
+# and call:
+fasttext.train_supervised
+fasttext.train_unsupervised
+```
+
+We are keeping the lowercase `fasttext` module name, while we keep the `fastText` API.
+
+This is because:
+
+- the standard way to name python modules is all lowercases
+- the API from `fastText` is exposing numpy arrays, which is widely used by the machine learning community.
+
+
+You can find a more comprehensive overview of our python API [here](/docs/en/python-module.html).
+
+Should I modify my existing code?
+---------------------------------
+Depending on the version of the python module you were using, you might need to do some little modifications on your existing code.
+
+### 1) You were using the official `fastText` module:
+
+You don't have to do much. Just replace your `import fastText` lines by `import fasttext` and everything should work as usual.
+
+### 2) You were using the unofficial `fasttext` module:
+
+If you were using the functions `cbow`, `skipgram`, `supervised` and/or `WordVectorModel`, `SupervisedModel` objects, you were using the unofficial `fasttext` module.
+
+Updating your code should be pretty straightforward, but it still implies some little changes.
+
+#### `cbow` function: use `train_unsupervised` instead.
+For example, replace:
+
+```
+fasttext.cbow("train.txt", "model_file", lr=0.05, dim=100, ws=5, epoch=5)
+```
+with
+```
+model = fasttext.train_unsupervised("train.txt", model='cbow', lr=0.05, dim=100, ws=5, epoch=5)
+model.save_model("model_file.bin")
+```
+
+#### `skipgram` function: use `train_unsupervised` instead.
+For example, replace:
+
+```
+fasttext.skipgram("train.txt", "model_file", lr=0.05, dim=100, ws=5, epoch=5)
+```
+with
+```
+model = fasttext.train_unsupervised("train.txt", model='skipgram', lr=0.05, dim=100, ws=5, epoch=5)
+model.save_model("model_file.bin")
+```
+
+
+#### `supervised` function: use `train_supervised` instead
+For example, replace:
+```
+fasttext.supervised("train.txt", "model_file", lr=0.1, dim=100, epoch=5, word_ngrams=2, loss='softmax')
+```
+with
+```
+model = fasttext.train_supervised("train.txt", lr=0.1, dim=100, epoch=5, , word_ngrams=2, loss='softmax')
+model.save_model("model_file.bin")
+```
+
+#### Parameters
+
+- As you can see, you can use either `word_ngrams` or `wordNgrams` as parameter name. Because the parameter names from the unofficial API are mapped to the official ones: `min_count` to `minCount`, `word_ngrams` to `wordNgrams`, `lr_update_rate` to `lrUpdateRate`, `label_prefix` to `label` and `pretrained_vectors` to `pretrainedVectors`.
+- `silent` parameter is not supported. Use `verbose` parameter instead.
+- `encoding` parameter is not supported, every input should be encoded in `utf-8`.
+
+
+### `WordVectorModel` and `SupervisedModel` objects
+
+Instead of `WordVectorModel` and `SupervisedModel` objects, we return a model object that mimics some nice ideas from the unofficial API.
+
+```python
+model = fasttext.train_unsupervised("train.txt", model='skipgram')
+print(model.words)      # list of words in dictionary
+print(model['king'])    # get the vector of the word 'king'
+print('king' in model)  # check if a word is in dictionary
+```
+
+
+
+```python
+model = fasttext.train_supervised("train.txt")
+print(model.words)      # list of words in dictionary
+print(model.labels)     # list of labels
+```
+
+The model object also contains the arguments of the training:
+
+```python
+print(model.epoch)
+print(model.loss)
+print(model.wordNgrams)
+```
+
+Thank you!
+------------
+We want to thank our incredible community. We truly appreciate your feedback, a big thank you to everyone reporting issues and contributing to the project. In particular we want to express how grateful we are to [Bayu Aldi Yansyah](https://github.com/pyk) who did a great job with his python library and for giving us the ownership of the pypi `fasttext` project.

+ 2 - 2
website/sidebars.json

@@ -2,10 +2,10 @@
   "docs": {
     "Introduction": ["support", "cheatsheet", "options"],
     "Tutorials": ["supervised-tutorial", "unsupervised-tutorial"],
-    "Help": ["faqs", "api", "references"]
+    "Help": ["python-module", "faqs", "api", "references"]
   },
   "download": {
-    "Download": [
+    "Resources": [
       "english-vectors",
       "crawl-vectors",
       "pretrained-vectors",