|
|
@@ -18,14 +18,14 @@ The first step of this tutorial is to install and build fastText. It only requir
|
|
|
Let us start by downloading the [most recent release](https://github.com/facebookresearch/fastText/releases):
|
|
|
|
|
|
```bash
|
|
|
-$ wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
|
|
|
-$ unzip v0.1.0.zip
|
|
|
+$ wget https://github.com/facebookresearch/fastText/archive/v0.2.0.zip
|
|
|
+$ unzip v0.2.0.zip
|
|
|
```
|
|
|
|
|
|
Move to the fastText directory and build it:
|
|
|
|
|
|
```bash
|
|
|
-$ cd fastText-0.1.0
|
|
|
+$ cd fastText-0.2.0
|
|
|
$ make
|
|
|
```
|
|
|
|
|
|
@@ -62,12 +62,12 @@ As mentioned in the introduction, we need labeled data to train our supervised c
|
|
|
>> head cooking.stackexchange.txt
|
|
|
```
|
|
|
|
|
|
-Each line of the text file contains a list of labels, followed by the corresponding document. All the labels start by the `__label__` prefix, which is how fastText recognize what is a label or what is a word. The model is then trained to predict the labels given the word in the document.
|
|
|
+Each line of the text file contains a list of labels, followed by the corresponding document. All the labels start by the `__label__` prefix, which is how fastText recognize what is a label or what is a word. The model is then trained to predict the labels given the word in the document.
|
|
|
|
|
|
Before training our first classifier, we need to split the data into train and validation. We will use the validation set to evaluate how good the learned classifier is on new data.
|
|
|
|
|
|
```bash
|
|
|
->> wc cooking.stackexchange.txt
|
|
|
+>> wc cooking.stackexchange.txt
|
|
|
15404 169582 1401900 cooking.stackexchange.txt
|
|
|
```
|
|
|
|
|
|
@@ -109,7 +109,7 @@ The predicted tag is `baking` which fits well to this question. Let us now try
|
|
|
The label predicted by the model is `food-safety`, which is not relevant. Somehow, the model seems to fail on simple examples. To get a better sense of its quality, let's test it on the validation data by running:
|
|
|
|
|
|
```bash
|
|
|
->> ./fasttext test model_cooking.bin cooking.valid
|
|
|
+>> ./fasttext test model_cooking.bin cooking.valid
|
|
|
N 3000
|
|
|
P@1 0.124
|
|
|
R@1 0.0541
|
|
|
@@ -119,7 +119,7 @@ Number of examples: 3000
|
|
|
The output of fastText are the precision at one (`P@1`) and the recall at one (`R@1`). We can also compute the precision at five and recall at five with:
|
|
|
|
|
|
```bash
|
|
|
->> ./fasttext test model_cooking.bin cooking.valid 5
|
|
|
+>> ./fasttext test model_cooking.bin cooking.valid 5
|
|
|
N 3000
|
|
|
P@5 0.0668
|
|
|
R@5 0.146
|
|
|
@@ -191,14 +191,14 @@ Progress: 100.0% words/sec/thread: 77633 lr: 0.000000 loss: 7.147976 eta: 0h
|
|
|
Let's test the new model:
|
|
|
|
|
|
```bash
|
|
|
->> ./fasttext test model_cooking.bin cooking.valid
|
|
|
+>> ./fasttext test model_cooking.bin cooking.valid
|
|
|
N 3000
|
|
|
P@1 0.501
|
|
|
R@1 0.218
|
|
|
Number of examples: 3000
|
|
|
```
|
|
|
|
|
|
-This is much better! Another way to change the learning speed of our model is to increase (or decrease) the learning rate of the algorithm. This corresponds to how much the model changes after processing each example. A learning rate of 0 would means that the model does not change at all, and thus, does not learn anything. Good values of the learning rate are in the range `0.1 - 1.0`.
|
|
|
+This is much better! Another way to change the learning speed of our model is to increase (or decrease) the learning rate of the algorithm. This corresponds to how much the model changes after processing each example. A learning rate of 0 would mean that the model does not change at all, and thus, does not learn anything. Good values of the learning rate are in the range `0.1 - 1.0`.
|
|
|
|
|
|
```bash
|
|
|
>> ./fasttext supervised -input cooking.train -output model_cooking -lr 1.0
|
|
|
@@ -207,7 +207,7 @@ Number of words: 9012
|
|
|
Number of labels: 734
|
|
|
Progress: 100.0% words/sec/thread: 81469 lr: 0.000000 loss: 6.405640 eta: 0h0m
|
|
|
|
|
|
->> ./fasttext test model_cooking.bin cooking.valid
|
|
|
+>> ./fasttext test model_cooking.bin cooking.valid
|
|
|
N 3000
|
|
|
P@1 0.563
|
|
|
R@1 0.245
|
|
|
@@ -223,7 +223,7 @@ Number of words: 9012
|
|
|
Number of labels: 734
|
|
|
Progress: 100.0% words/sec/thread: 76394 lr: 0.000000 loss: 4.350277 eta: 0h0m
|
|
|
|
|
|
->> ./fasttext test model_cooking.bin cooking.valid
|
|
|
+>> ./fasttext test model_cooking.bin cooking.valid
|
|
|
N 3000
|
|
|
P@1 0.585
|
|
|
R@1 0.255
|
|
|
@@ -243,7 +243,7 @@ Number of words: 9012
|
|
|
Number of labels: 734
|
|
|
Progress: 100.0% words/sec/thread: 75366 lr: 0.000000 loss: 3.226064 eta: 0h0m
|
|
|
|
|
|
->> ./fasttext test model_cooking.bin cooking.valid
|
|
|
+>> ./fasttext test model_cooking.bin cooking.valid
|
|
|
N 3000
|
|
|
P@1 0.599
|
|
|
R@1 0.261
|
|
|
@@ -272,7 +272,7 @@ It is common to refer to a word as a unigram.
|
|
|
|
|
|
## Scaling things up
|
|
|
|
|
|
-Since we are training our model on a few thousands of examples, the training only takes a few seconds. But training models on larger datasets, with more labels can start to be too slow. A potential solution to make the training faster is to use the hierarchical softmax, instead of the regular softmax [Add a quick explanation of the hierarchical softmax]. This can be done with the option `-loss hs`:
|
|
|
+Since we are training our model on a few thousands of examples, the training only takes a few seconds. But training models on larger datasets, with more labels can start to be too slow. A potential solution to make the training faster is to use the [hierarchical softmax](#advanced-readers-hierarchical-softmax), instead of the regular softmax. This can be done with the option `-loss hs`:
|
|
|
|
|
|
```bash
|
|
|
>> ./fasttext supervised -input cooking.train -output model_cooking -lr 1.0 -epoch 25 -wordNgrams 2 -bucket 200000 -dim 50 -loss hs
|
|
|
@@ -284,6 +284,66 @@ Progress: 100.0% words/sec/thread: 2199406 lr: 0.000000 loss: 1.718807 eta:
|
|
|
|
|
|
Training should now take less than a second.
|
|
|
|
|
|
+
|
|
|
+## Advanced readers: hierarchical softmax
|
|
|
+
|
|
|
+The hierarchical softmax is a loss function that approximates the softmax with a much faster computation.
|
|
|
+
|
|
|
+The idea is to build a binary tree whose leaves correspond to the labels. Each intermediate node has a binary decision activation (e.g. sigmoid) that is trained, and predicts if we should go to the left or to the right. The probability of the output unit is then given by the product of the probabilities of intermediate nodes along the path from the root to the output unit leave.
|
|
|
+
|
|
|
+For a detailed explanation, you can have a look on [this video](https://www.youtube.com/watch?v=B95LTf2rVWM).
|
|
|
+
|
|
|
+In fastText, we use a Huffman tree, so that the lookup time is faster for more frequent outputs and thus the average lookup time for the output is optimal.
|
|
|
+
|
|
|
+## Multi-label classification
|
|
|
+
|
|
|
+When we want to assign a document to multiple labels, we can still use the softmax loss and play with the parameters for prediction, namely the number of labels to predict and the threshold for the predicted probability. However playing with these arguments can be tricky and unintuitive since the probabilities must sum to 1.
|
|
|
+
|
|
|
+A convenient way to handle multiple labels is to use independent binary classifiers for each label. This can be done with `-loss one-vs-all` or `-loss ova`.
|
|
|
+
|
|
|
+```bash
|
|
|
+>> ./fasttext supervised -input cooking.train -output model_cooking -lr 0.5 -epoch 25 -wordNgrams 2 -bucket 200000 -dim 50 -loss one-vs-all
|
|
|
+Read 0M words
|
|
|
+Number of words: 14543
|
|
|
+Number of labels: 735
|
|
|
+Progress: 100.0% words/sec/thread: 72104 lr: 0.000000 loss: 4.340807 ETA: 0h 0m
|
|
|
+```
|
|
|
+
|
|
|
+It is a good idea to decrease the learning rate compared to other loss functions.
|
|
|
+
|
|
|
+Now let's have a look on our predictions, we want as many prediction as possible (argument `-1`) and we want only labels with probability higher or equal to `0.5` :
|
|
|
+```bash
|
|
|
+>> ./fasttext predict-prob model_cooking.bin - -1 0.5
|
|
|
+```
|
|
|
+and then type the sentence:
|
|
|
+
|
|
|
+*Which baking dish is best to bake a banana bread ?*
|
|
|
+
|
|
|
+we get:
|
|
|
+```
|
|
|
+__label__baking 1.00000 __label__bananas 0.939923 __label__bread 0.592677
|
|
|
+```
|
|
|
+
|
|
|
+We can also evaluate our results with the `test` command :
|
|
|
+
|
|
|
+```bash
|
|
|
+>> ./fasttext test model_cooking.bin cooking.valid -1 0.5
|
|
|
+N 3000
|
|
|
+P@-1 0.702
|
|
|
+R@-1 0.2
|
|
|
+Number of examples: 3000
|
|
|
+```
|
|
|
+
|
|
|
+and play with the threshold to obtain desired precision/recall metrics :
|
|
|
+
|
|
|
+```bash
|
|
|
+>> ./fasttext test model_cooking.bin cooking.valid -1 0.1
|
|
|
+N 3000
|
|
|
+P@-1 0.591
|
|
|
+R@-1 0.272
|
|
|
+Number of examples: 3000
|
|
|
+```
|
|
|
+
|
|
|
## Conclusion
|
|
|
|
|
|
In this tutorial, we gave a brief overview of how to use fastText to train powerful text classifiers. We had a light overview of some of the most important options to tune.
|