id: crawl-vectors
We distribute pre-trained word vectors for 157 languages, trained on Common Crawl and Wikipedia using fastText. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. We also distribute three new word analogy datasets, for French, Hindi and Polish.
The word vectors are available in both binary and text formats.
Using the binary models, vectors for out-of-vocabulary words can be obtained with
$ ./fasttext print-word-vectors wiki.it.300.bin < oov_words.txt
where the file oov_words.txt contains out-of-vocabulary words.
In the text format, each line contain a word followed by its vector. Each value is space separated, and words are sorted by frequency in descending order. These text models can easily be loaded in Python using the following code:
import io
def load_vectors(fname):
fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
n, d = map(int, fin.readline().split())
data = {}
for line in fin:
tokens = line.rstrip().split(' ')
data[tokens[0]] = map(float, tokens[1:])
return data
We used the Stanford word segmenter for Chinese, Mecab for Japanese and UETsegmenter for Vietnamese. For languages using the Latin, Cyrillic, Hebrew or Greek scripts, we used the tokenizer from the Europarl preprocessing tools. For the remaining languages, we used the ICU tokenizer.
More information about the training of these models can be found in the article Learning Word Vectors for 157 Languages.
The word vectors are distributed under the Creative Commons Attribution-Share-Alike License 3.0.
If you use these word vectors, please cite the following paper:
E. Grave*, P. Bojanowski*, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157 Languages
@inproceedings{grave2018learning,
title={Learning Word Vectors for 157 Languages},
author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)},
year={2018}
}
The analogy evaluation datasets described in the paper are available here: French, Hindi, Polish.
The models can be downloaded from:
|||| |-|-|-| | Afrikaans: bin, text | Albanian: bin, text | Alemannic: bin, text | | Amharic: bin, text | Arabic: bin, text | Aragonese: bin, text | | Armenian: bin, text | Assamese: bin, text | Asturian: bin, text | | Azerbaijani: bin, text | Bashkir: bin, text | Basque: bin, text | | Bavarian: bin, text | Belarusian: bin, text | Bengali: bin, text | | Bihari: bin, text | Bishnupriya Manipuri: bin, text | Bosnian: bin, text | | Breton: bin, text | Bulgarian: bin, text | Burmese: bin, text | | Catalan: bin, text | Cebuano: bin, text | Central Bicolano: bin, text | | Chechen: bin, text | Chinese: bin, text | Chuvash: bin, text | | Corsican: bin, text | Croatian: bin, text | Czech: bin, text | | Danish: bin, text | Divehi: bin, text | Dutch: bin, text | | Eastern Punjabi: bin, text | Egyptian Arabic: bin, text | Emilian-Romagnol: bin, text | | English: bin, text | Erzya: bin, text | Esperanto: bin, text | | Estonian: bin, text | Fiji Hindi: bin, text | Finnish: bin, text | | French: bin, text | Galician: bin, text | Georgian: bin, text | | German: bin, text | Goan Konkani: bin, text | Greek: bin, text | | Gujarati: bin, text | Haitian: bin, text | Hebrew: bin, text | | Hill Mari: bin, text | Hindi: bin, text | Hungarian: bin, text | | Icelandic: bin, text | Ido: bin, text | Ilokano: bin, text | | Indonesian: bin, text | Interlingua: bin, text | Irish: bin, text | | Italian: bin, text | Japanese: bin, text | Javanese: bin, text | | Kannada: bin, text | Kapampangan: bin, text | Kazakh: bin, text | | Khmer: bin, text | Kirghiz: bin, text | Korean: bin, text | | Kurdish (Kurmanji): bin, text | Kurdish (Sorani): bin, text | Latin: bin, text | | Latvian: bin, text | Limburgish: bin, text | Lithuanian: bin, text | | Lombard: bin, text | Low Saxon: bin, text | Luxembourgish: bin, text | | Macedonian: bin, text | Maithili: bin, text | Malagasy: bin, text | | Malay: bin, text | Malayalam: bin, text | Maltese: bin, text | | Manx: bin, text | Marathi: bin, text | Mazandarani: bin, text | | Meadow Mari: bin, text | Minangkabau: bin, text | Mingrelian: bin, text | | Mirandese: bin, text | Mongolian: bin, text | Nahuatl: bin, text | | Neapolitan: bin, text | Nepali: bin, text | Newar: bin, text | | North Frisian: bin, text | Northern Sotho: bin, text | Norwegian (Bokmål): bin, text | | Norwegian (Nynorsk): bin, text | Occitan: bin, text | Oriya: bin, text | | Ossetian: bin, text | Palatinate German: bin, text | Pashto: bin, text | | Persian: bin, text | Piedmontese: bin, text | Polish: bin, text | | Portuguese: bin, text | Quechua: bin, text | Romanian: bin, text | | Romansh: bin, text | Russian: bin, text | Sakha: bin, text | | Sanskrit: bin, text | Sardinian: bin, text | Scots: bin, text | | Scottish Gaelic: bin, text | Serbian: bin, text | Serbo-Croatian: bin, text | | Sicilian: bin, text | Sindhi: bin, text | Sinhalese: bin, text | | Slovak: bin, text | Slovenian: bin, text | Somali: bin, text | | Southern Azerbaijani: bin, text | Spanish: bin, text | Sundanese: bin, text | | Swahili: bin, text | Swedish: bin, text | Tagalog: bin, text | | Tajik: bin, text | Tamil: bin, text | Tatar: bin, text | | Telugu: bin, text | Thai: bin, text | Tibetan: bin, text | | Turkish: bin, text | Turkmen: bin, text | Ukrainian: bin, text | | Upper Sorbian: bin, text | Urdu: bin, text | Uyghur: bin, text | | Uzbek: bin, text | Venetian: bin, text | Vietnamese: bin, text | | Volapük: bin, text | Walloon: bin, text | Waray: bin, text | | Welsh: bin, text | West Flemish: bin, text | West Frisian: bin, text | | Western Punjabi: bin, text | Yiddish: bin, text | Yoruba: bin, text | | Zazaki: bin, text | Zeelandic: bin, text |