6 anos atrás · 2d453cdd36
--- a/docs/unsupervised-tutorials.md
+++ b/docs/unsupervised-tutorials.md
@@ -22,7 +22,7 @@ $ wget -c http://mattmahoney.net/dc/enwik9.zip -P data
 
				 $ unzip data/enwik9.zip -d data
			
 
				 ```
			
 
				 
			
 
				-A raw Wikipedia dump contains a lot of HTML / XML data. We pre-process it with the wikifil.pl script bundled with fastText (this script was originally developed by Matt Mahoney, and can be found on his [website](http://mattmahoney.net/) )
			
 
				+A raw Wikipedia dump contains a lot of HTML / XML data. We pre-process it with the wikifil.pl script bundled with fastText (this script was originally developed by Matt Mahoney, and can be found on his [website](http://mattmahoney.net/)).
			
 
				 
			
 
				 ```bash
			
 
				 $ perl wikifil.pl data/enwik9 > data/fil9
			
@@ -147,7 +147,7 @@ $ ./fasttext skipgram -input data/fil9 -output result/fil9 -minn 2 -maxn 5 -dim
 
				 ```
			
 
				 <!--END_DOCUSAURUS_CODE_TABS-->
			
 
				 
			
 
				-Depending on the quantity of data you have, you may want to change the parameters of the training.  The *epoch* parameter controls how many time will loop over your data. By default, we loop over the dataset 5 times.  If you dataset is extremely massive, you may want to loop over it less often. Another important parameter is the learning rate -*lr*). The higher the learning rate is, the faster the model converge to a solution but at the risk of overfitting to the dataset. The default value is 0.05 which is a good compromise. If you want to play with it we suggest to stay in the range of [0.01, 1]:
			
 
				+Depending on the quantity of data you have, you may want to change the parameters of the training.  The *epoch* parameter controls how many times the model will loop over your data. By default, we loop over the dataset 5 times.  If you dataset is extremely massive, you may want to loop over it less often. Another important parameter is the learning rate -*lr*. The higher the learning rate is, the faster the model converge to a solution but at the risk of overfitting to the dataset. The default value is 0.05 which is a good compromise. If you want to play with it we suggest to stay in the range of [0.01, 1]:
			
 
				 
			
 
				 <!--DOCUSAURUS_CODE_TABS-->
			
 
				 <!--Command line-->
			
@@ -180,7 +180,7 @@ $ ./fasttext skipgram -input data/fil9 -output result/fil9 -thread 4
 
				 
			
 
				 Searching and printing word vectors directly from  the `fil9.vec`  file  is cumbersome. Fortunately, there is a `print-word-vectors` functionality in fastText.
			
 
				 
			
 
				-For examples, we can print the word vectors of words *asparagus,* *pidgey* and *yellow* with the following command:
			
 
				+For example, we can print the word vectors of words *asparagus,* *pidgey* and *yellow* with the following command:
			
 
				 <!--DOCUSAURUS_CODE_TABS-->
			
 
				 <!--Command line-->
			
 
				 ```bash
			
@@ -226,7 +226,7 @@ $ echo "enviroment" | ./fasttext print-word-vectors result/fil9.bin
 
				 ```
			
 
				 <!--END_DOCUSAURUS_CODE_TABS-->
			
 
				 
			
 
				-You still get a word vector for it! But how good it is? Let s find out in the next sections!
			
 
				+You still get a word vector for it! But how good it is? Let's find out in the next sections!
			
 
				 
			
 
				 
			
 
				 ## Nearest neighbor queries
			
@@ -322,7 +322,11 @@ In order to find nearest neighbors, we need to compute a similarity score betwee
 
				 
			
 
				 ## Word analogies
			
 
				 
			
 
				-In a similar spirit, one can play around with word analogies. For example, we can see if our model can guess what is to France, what Berlin is to Germany.
			
 
				+In a similar spirit, one can play around with word analogies. For example, we can see if our model can guess what is to France, and what Berlin is to Germany.
			
 
				+
			
 
				+
			
 
				+
			
 
				+
			
 
				 
			
 
				 This can be done with the *analogies* functionality. It takes a word triplet (like *Germany Berlin France*) and outputs  the analogy:
			
 
				 
			
@@ -350,7 +354,7 @@ pigneaux 0.736122
 
				 ```
			
 
				 <!--END_DOCUSAURUS_CODE_TABS-->
			
 
				 
			
 
				-The answer provides by our model is *Paris*, which is correct. Let's have a look at a less obvious example:
			
 
				+The answer provided by our model is *Paris*, which is correct. Let's have a look at a less obvious example:
			
 
				 
			
 
				 <!--DOCUSAURUS_CODE_TABS-->
			
 
				 <!--Command line-->
			
@@ -408,7 +412,7 @@ gearboxes 0.73986
 
				 
			
 
				 Most of the retrieved words share substantial substrings but a few are actually quite different, like *cogwheel*. You can try other words like *sunbathe* or *grandnieces*.
			
 
				 
			
 
				-Now that we have seen the interest of subword information for unknown words, let s check how it compares to a model that do not use subword information. To train a model without subwords, just run the following command:
			
 
				+Now that we have seen the interest of subword information for unknown words, let's check how it compares to a model that does not use subword information. To train a model without subwords, just run the following command:
			
 
				 
			
 
				 <!--DOCUSAURUS_CODE_TABS-->
			
 
				 <!--Command line-->
			
@@ -423,7 +427,7 @@ The results are saved in result/fil9-non.vec and result/fil9-non.bin.
 
				 <!--END_DOCUSAURUS_CODE_TABS-->
			
 
				 
			
 
				 
			
 
				-To illustrate the difference, let us take an uncommon word in Wikipedia, like *accomodation* which is a misspelling of *accommodation*. Here is the nearest neighbors obtained without subwords:
			
 
				+To illustrate the difference, let us take an uncommon word in Wikipedia, like *accomodation* which is a misspelling of *accommodation**.* Here is the nearest neighbors obtained without subwords:
			
 
				 
			
 
				 <!--DOCUSAURUS_CODE_TABS-->
			
 
				 <!--Command line-->
			
@@ -476,4 +480,4 @@ The nearest neighbors capture different variation around the word *accommodation
 
				 
			
 
				 ## Conclusion
			
 
				 
			
 
				-In this tutorial, we show how to obtain word vectors from Wikipedia. This can be done for any language and you we provide [pre-trained models](https://fasttext.cc/docs/en/pretrained-vectors.html) with the default setting for 294 of them.
			
 
				+In this tutorial, we show how to obtain word vectors from Wikipedia. This can be done for any language and we provide [pre-trained models](https://fasttext.cc/docs/en/pretrained-vectors.html) with the default setting for 294 of them.