Kazuhiro Yamasaki c33299b322 Update URLs for GLUE downloader. 5 lat temu
..
images 36a6985ebc [BERT/TF] TRT int8 and Triton 5 lat temu
BooksDownloader.py a98df279fe [BERT/TF] Added multi-node support 6 lat temu
BookscorpusTextFormatting.py a98df279fe [BERT/TF] Added multi-node support 6 lat temu
ChemProtTextFormatting.py b82c372047 triton v2 api, download mrpc fix, update for mpi 4.2 5 lat temu
Downloader.py 1069a7358c converge to pyt 5 lat temu
GLUEDownloader.py c33299b322 Update URLs for GLUE downloader. 5 lat temu
GooglePretrainedWeightDownloader.py a98df279fe [BERT/TF] Added multi-node support 6 lat temu
NVIDIAPretrainedWeightDownloader.py a98df279fe [BERT/TF] Added multi-node support 6 lat temu
PubMedDownloader.py a98df279fe [BERT/TF] Added multi-node support 6 lat temu
PubMedTextFormatting.py 9cd3946603 Updating BERT/TF 6 lat temu
README.md 8218872051 Updating BERT with TRT-IS support and new results 6 lat temu
SquadDownloader.py a98df279fe [BERT/TF] Added multi-node support 6 lat temu
TextSharding.py a98df279fe [BERT/TF] Added multi-node support 6 lat temu
WikiDownloader.py 04988752a8 [BERT/PyT][BERT/TF] Switch back to the original server for data download 5 lat temu
WikicorpusTextFormatting.py a98df279fe [BERT/TF] Added multi-node support 6 lat temu
__init__.py a98df279fe [BERT/TF] Added multi-node support 6 lat temu
bertPrep.py 1069a7358c converge to pyt 5 lat temu
check.py b82c372047 triton v2 api, download mrpc fix, update for mpi 4.2 5 lat temu
create_biobert_datasets_from_start.sh 9cd3946603 Updating BERT/TF 6 lat temu
create_datasets_from_start.sh 36a6985ebc [BERT/TF] TRT int8 and Triton 5 lat temu

README.md

Steps to reproduce datasets from web

1) Build the container

  • docker build -t bert_tf . 2) Run the container interactively
  • nvidia-docker run -it --ipc=host bert_tf
  • Optional: Mount data volumes
    • -v yourpath:/workspace/bert/data/wikipedia_corpus/download
    • -v yourpath:/workspace/bert/data/wikipedia_corpus/extracted_articles
    • -v yourpath:/workspace/bert/data/wikipedia_corpus/raw_data
    • -v yourpath:/workspace/bert/data/wikipedia_corpus/intermediate_files
    • -v yourpath:/workspace/bert/data/wikipedia_corpus/final_text_file_single
    • -v yourpath:/workspace/bert/data/wikipedia_corpus/final_text_files_sharded
    • -v yourpath:/workspace/bert/data/wikipedia_corpus/final_tfrecords_sharded
    • -v yourpath:/workspace/bert/data/bookcorpus/download
    • -v yourpath:/workspace/bert/data/bookcorpus/final_text_file_single
    • -v yourpath:/workspace/bert/data/bookcorpus/final_text_files_sharded
    • -v yourpath:/workspace/bert/data/bookcorpus/final_tfrecords_sharded
  • Optional: Select visible GPUs
    • -e CUDA_VISIBLE_DEVICES=0

** Inside of the container starting here** 3) Download pretrained weights (they contain vocab files for preprocessing)

  • cd data/pretrained_models_google && python3 download_models.py 4) "One-click" SQuAD download
  • cd /workspace/bert/data/squad && . squad_download.sh 5) "One-click" Wikipedia data download and prep (provides tfrecords)
  • Set your configuration in data/wikipedia_corpus/config.sh
  • cd /data/wikipedia_corpus && ./run_preprocessing.sh 6) "One-click" BookCorpus data download and prep (provided tfrecords)
  • Set your configuration in data/wikipedia_corpus/config.sh
  • cd /data/bookcorpus && ./run_preprocessing.sh