Piotr Bojanowski ddb5e06e3b Corrected URL for lid models in the scripts to process CC. 7 سال پیش
..
README.md 61f1838321 Common crawl processing scripts 7 سال پیش
dedup.cc 61f1838321 Common crawl processing scripts 7 سال پیش
download_crawl.sh ddb5e06e3b Corrected URL for lid models in the scripts to process CC. 7 سال پیش
filter_dedup.sh 61f1838321 Common crawl processing scripts 7 سال پیش
filter_utf8.cc 61f1838321 Common crawl processing scripts 7 سال پیش
process_wet_file.sh 61f1838321 Common crawl processing scripts 7 سال پیش

README.md

Preprocessing Common Crawl

This code downloads, preprocesses and splits per language the data from Common Crawl.

This script uses the scripts and language identifier of [1].

This code inherits its requirements form fastText.

Set the variable WET_PATHS_URL to the crawl you want to process. Please also set the variables NUM_LANGID and NUM_DEDUP in download_crawl.sh according to the capacity of your machine. Langid processes are mostly limited by CPU usage, while dedup processes are likely to be limited by RAM usage (each use 2GB of RAM).

Reference

If you use this code, please cite:

[1] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157 Languages

@inproceedings{grave2018learning,
  title={Learning Word Vectors for 157 Languages},
  author={Grave, Edouard and Bojanowski, Piotr and Gupta, Prakhar and Joulin, Armand and Mikolov, Tomas},
  booktitle={Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018)},
  year={2018}
}