Piotr Bojanowski ddb5e06e3b Corrected URL for lid models in the scripts to process CC. 7 gadi atpakaļ
..
README.md 61f1838321 Common crawl processing scripts 7 gadi atpakaļ
dedup.cc 61f1838321 Common crawl processing scripts 7 gadi atpakaļ
download_crawl.sh ddb5e06e3b Corrected URL for lid models in the scripts to process CC. 7 gadi atpakaļ
filter_dedup.sh 61f1838321 Common crawl processing scripts 7 gadi atpakaļ
filter_utf8.cc 61f1838321 Common crawl processing scripts 7 gadi atpakaļ
process_wet_file.sh 61f1838321 Common crawl processing scripts 7 gadi atpakaļ

README.md

Preprocessing Common Crawl

This code downloads, preprocesses and splits per language the data from Common Crawl.

This script uses the scripts and language identifier of [1].

This code inherits its requirements form fastText.

Set the variable WET_PATHS_URL to the crawl you want to process. Please also set the variables NUM_LANGID and NUM_DEDUP in download_crawl.sh according to the capacity of your machine. Langid processes are mostly limited by CPU usage, while dedup processes are likely to be limited by RAM usage (each use 2GB of RAM).

Reference

If you use this code, please cite:

[1] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157 Languages

@inproceedings{grave2018learning,
  title={Learning Word Vectors for 157 Languages},
  author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
  booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)},
  year={2018}
}