crawl-vectors.md 30 KB


id: crawl-vectors

title: Word vectors for 157 languages

We distribute pre-trained word vectors for 157 languages, trained on Common Crawl and Wikipedia using fastText. These models were trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. We also distribute three new word analogy datasets, for French, Hindi and Polish.

Format

The word vectors are available in both binary and text formats.

Using the binary models, vectors for out-of-vocabulary words can be obtained with

$ ./fasttext print-word-vectors wiki.it.300.bin < oov_words.txt

where the file oov_words.txt contains out-of-vocabulary words.

In the text format, each line contain a word followed by its vector. Each value is space separated, and words are sorted by frequency in descending order. These text models can easily be loaded in Python using the following code:

import io

def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = map(float, tokens[1:])
    return data

Tokenization

We used the Stanford word segmenter for Chinese, Mecab for Japanese and UETsegmenter for Vietnamese. For languages using the Latin, Cyrillic, Hebrew or Greek scripts, we used the tokenizer from the Europarl preprocessing tools. For the remaining languages, we used the ICU tokenizer.

More information about the training of these models can be found in the article Learning Word Vectors for 157 Languages.

License

The word vectors are distributed under the Creative Commons Attribution-Share-Alike License 3.0.

References

If you use these word vectors, please cite the following paper:

E. Grave*, P. Bojanowski*, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157 Languages

@inproceedings{grave2018learning,
  title={Learning Word Vectors for 157 Languages},
  author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
  booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)},
  year={2018}
}

Evaluation datasets

The analogy evaluation datasets described in the paper are available here: French, Hindi, Polish.

Models

The models can be downloaded from:

|||| |-|-|-| | Afrikaans: bin, text | Albanian: bin, text | Alemannic: bin, text | | Amharic: bin, text | Arabic: bin, text | Aragonese: bin, text | | Armenian: bin, text | Assamese: bin, text | Asturian: bin, text | | Azerbaijani: bin, text | Bashkir: bin, text | Basque: bin, text | | Bavarian: bin, text | Belarusian: bin, text | Bengali: bin, text | | Bihari: bin, text | Bishnupriya Manipuri: bin, text | Bosnian: bin, text | | Breton: bin, text | Bulgarian: bin, text | Burmese: bin, text | | Catalan: bin, text | Cebuano: bin, text | Central Bicolano: bin, text | | Chechen: bin, text | Chinese: bin, text | Chuvash: bin, text | | Corsican: bin, text | Croatian: bin, text | Czech: bin, text | | Danish: bin, text | Divehi: bin, text | Dutch: bin, text | | Eastern Punjabi: bin, text | Egyptian Arabic: bin, text | Emilian-Romagnol: bin, text | | English: bin, text | Erzya: bin, text | Esperanto: bin, text | | Estonian: bin, text | Fiji Hindi: bin, text | Finnish: bin, text | | French: bin, text | Galician: bin, text | Georgian: bin, text | | German: bin, text | Goan Konkani: bin, text | Greek: bin, text | | Gujarati: bin, text | Haitian: bin, text | Hebrew: bin, text | | Hill Mari: bin, text | Hindi: bin, text | Hungarian: bin, text | | Icelandic: bin, text | Ido: bin, text | Ilokano: bin, text | | Indonesian: bin, text | Interlingua: bin, text | Irish: bin, text | | Italian: bin, text | Japanese: bin, text | Javanese: bin, text | | Kannada: bin, text | Kapampangan: bin, text | Kazakh: bin, text | | Khmer: bin, text | Kirghiz: bin, text | Korean: bin, text | | Kurdish (Kurmanji): bin, text | Kurdish (Sorani): bin, text | Latin: bin, text | | Latvian: bin, text | Limburgish: bin, text | Lithuanian: bin, text | | Lombard: bin, text | Low Saxon: bin, text | Luxembourgish: bin, text | | Macedonian: bin, text | Maithili: bin, text | Malagasy: bin, text | | Malay: bin, text | Malayalam: bin, text | Maltese: bin, text | | Manx: bin, text | Marathi: bin, text | Mazandarani: bin, text | | Meadow Mari: bin, text | Minangkabau: bin, text | Mingrelian: bin, text | | Mirandese: bin, text | Mongolian: bin, text | Nahuatl: bin, text | | Neapolitan: bin, text | Nepali: bin, text | Newar: bin, text | | North Frisian: bin, text | Northern Sotho: bin, text | Norwegian (Bokmål): bin, text | | Norwegian (Nynorsk): bin, text | Occitan: bin, text | Oriya: bin, text | | Ossetian: bin, text | Palatinate German: bin, text | Pashto: bin, text | | Persian: bin, text | Piedmontese: bin, text | Polish: bin, text | | Portuguese: bin, text | Quechua: bin, text | Romanian: bin, text | | Romansh: bin, text | Russian: bin, text | Sakha: bin, text | | Sanskrit: bin, text | Sardinian: bin, text | Scots: bin, text | | Scottish Gaelic: bin, text | Serbian: bin, text | Serbo-Croatian: bin, text | | Sicilian: bin, text | Sindhi: bin, text | Sinhalese: bin, text | | Slovak: bin, text | Slovenian: bin, text | Somali: bin, text | | Southern Azerbaijani: bin, text | Spanish: bin, text | Sundanese: bin, text | | Swahili: bin, text | Swedish: bin, text | Tagalog: bin, text | | Tajik: bin, text | Tamil: bin, text | Tatar: bin, text | | Telugu: bin, text | Thai: bin, text | Tibetan: bin, text | | Turkish: bin, text | Turkmen: bin, text | Ukrainian: bin, text | | Upper Sorbian: bin, text | Urdu: bin, text | Uyghur: bin, text | | Uzbek: bin, text | Venetian: bin, text | Vietnamese: bin, text | | Volapük: bin, text | Walloon: bin, text | Waray: bin, text | | Welsh: bin, text | West Flemish: bin, text | West Frisian: bin, text | | Western Punjabi: bin, text | Yiddish: bin, text | Yoruba: bin, text | | Zazaki: bin, text | Zeelandic: bin, text |