Adrian Lancucki 9dd9fcb98f [wav2vec2.0/PyT] Fix pip dependencies (librosa - numpy)		2 vuotta sitten
..
common	29aaae3285 [Jasper/PyT] Update torch.stft for PyTorch 2.0	2 vuotta sitten
img	35d8759cb8 [wav2vec2/PyT] Initial release	3 vuotta sitten
scripts	35d8759cb8 [wav2vec2/PyT] Initial release	3 vuotta sitten
utils	35d8759cb8 [wav2vec2/PyT] Initial release	3 vuotta sitten
wav2vec2	9dd9fcb98f [wav2vec2.0/PyT] Fix pip dependencies (librosa - numpy)	2 vuotta sitten
.dockerignore	35d8759cb8 [wav2vec2/PyT] Initial release	3 vuotta sitten
Dockerfile	35d8759cb8 [wav2vec2/PyT] Initial release	3 vuotta sitten
README.md	35d8759cb8 [wav2vec2/PyT] Initial release	3 vuotta sitten
inference.py	5146a680c8 [Speech models/PyT] Update perf timers and cuda syncs	3 vuotta sitten
requirements.txt	9dd9fcb98f [wav2vec2.0/PyT] Fix pip dependencies (librosa - numpy)	2 vuotta sitten
train.py	5146a680c8 [Speech models/PyT] Update perf timers and cuda syncs	3 vuotta sitten

wav2vec 2.0 for PyTorch

This repository provides a script and recipe to train the wav2vec 2.0 model to achieve state-of-the-art accuracy. The content of this repository is tested and maintained by NVIDIA.

Model overview
Setup
- Requirements
Quick Start Guide
Advanced
Performance
- Benchmarking
  - Training performance benchmark
  - Inference performance benchmark
- Results
Release notes
- Changelog
- Known issues

Model overview

This repository provides an optimized implementation of the wav2vec 2.0 model, as described in the paper wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. It is based on the Fairseq codebase published by the authors of the paper. The wav2vec 2.0 model is pre-trained unsupervised on large corpora of speech recordings. Afterward, it can be quickly fine-tuned in a supervised way for speech recognition or serve as an extractor of high-level features and pseudo-phonemes for other applications.

The differences between this wav2vec 2.0 and the reference implementation are:

Support for increased batch size, which does not change batch-dependent constants for negative sampling and loss calculation and improves hardware utilization
Support for the Hourglass Transformer architecture, which in the default setting improves the training speed of the Base model by 1.4x, lowers memory consumption by 38%, and retains accuracy

This model is trained with mixed precision using Tensor Cores on NVIDIA Volta, NVIDIA Turning, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results up to 1.35x faster than training without Tensor Cores while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.

Model architecture

The model takes raw waveforms as its input. A fully convolutional feature extractor reduces the resolution of the signal to a single vector roughly every 20 ms. Most of the computation is performed in the transformer encoder part of the model. The outputs of the transformer, and quantized outputs from the feature extractor, serve as inputs to the contrastive loss. During fine-tuning, this loss is replaced with the CTC loss, and quantization is not performed.

Figure 1. The architecture of wav2vec 2.0 ([source](https://proceedings.neurips.cc/paper/2020/file/92d1e1eb1c-Paper.pdf)). The model is composed of a convolutional feature extractor, and a transformer encoder. During fine-tuning, quantization is disabled and contrastive loss is replaced with the CTC loss function.

In addition, our model uses the Hourglass Transformer architecture for the encoder. This architecture uses fixed-sized pooling in order to reduce the time dimension T of the signal, and thus, lower the O(T²) cost of the self-attention mechanism.

Figure 2. The Hourglass Transformer module ([source](https://arxiv.org/abs/2110.13711)). The signal is processed by the initial layers and downsampled. Most of the layers operate on the downsampled signal. Finally, the signal is upsampled for the final layers. The Hourglass Transformer replaced a regular stack of transformer layers, typically improving throughput and lowering memory consumption.

Default configuration

The following features were implemented in this model:

general:
- multi-GPU and multi-node training
- Hourglass Transformer architecture
- dynamic loss scaling with backoff for tensor cores (mixed precision) training
- mixed-precision training with O2 optimization level, based on float16 or bfloat16
training:
- support for variable batch size without changing batch-dependent constants for the loss function
inference:
- masking for inference with a larger batch

Our main recipes replicate the Base model described in the wav2vec 2.0 paper, and use Hourglass Transformer with pooling factor 4. Note that Hourglass Transformer can be entirely disabled and this codebase is compatible with Fairseq checkpoints.

Below we present performance numbers for the Hourglass Transformer with different pooling factors (Base model, pre-training, A100 80GB GPU, bfloat16):

Configuration	Throughput speedup	GPU memory (% of Baseline)
Baseline	1.00	100.00%
Hourglass factor=2	1.25	70.98%
Hourglass factor=3	1.33	64.31%
Hourglass factor=4 (default)	1.37	62.35%
Hourglass factor=5	1.39	60.00%
Hourglass factor=6	1.40	59.61%

Feature support matrix

This model supports the following features:

Feature	wav2vec 2.0
Multi-node training	yes
Automatic mixed precision (AMP)	yes

Features

Automatic Mixed Precision (AMP) This implementation uses automatic mixed-precision training ported from Fairseq. It allows us to use FP16 or BF16 training with FP16 master weights.

Mixed precision training

Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of Tensor Cores in NVIDIA Volta, and following with both the NVIDIA Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training previously required two steps:

Porting the model to use the FP16 data type where appropriate.
Adding loss scaling to preserve small gradient values.

For information about:

How to train using mixed precision, refer to the Mixed Precision Training paper and Training With Mixed Precision documentation.
Techniques used for mixed precision training, refer to the Mixed-Precision Training of Deep Neural Networks blog.

Enabling mixed precision

For training and inference, mixed precision can be enabled by adding the --fp16 flag or --bf16 flag, depending on the target’s lower precision. NVIDIA Ampere and later architectures provide hardware support for bfloat16, which is beneficial for this model, as it skips certain stabilizing FP32 casts. For NVIDIA Volta and NVIDIA Turing architectures, select --fp16.

Enabling TF32

TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math, also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on NVIDIA Volta GPUs.

TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require a high dynamic range for weights or activations.

For more information, refer to the TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x blog post.

TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.

Glossary

Brain Floating Point (bfloat16) A 16-bit floating point format that uses an 8-bit exponent, a 7-bit fraction, and a sign bit. Contrary to float16, which uses a 5-bit exponent, bfloat16 retains the same exponent precision as float32, and its robustness with respect to wide ranges of values during training.

Fine-tuning Training an already pretrained model further using a task-specific dataset for subject-specific refinements by adding task-specific layers on top if required.

Hourglass Transformer Architecture proposed in the paper Hierarchical Transformers Are More Efficient Language Models, which improves resource consumption of a stack of transformer layers, in many cases retaining the accuracy.

Pre-training Training a model on vast amounts of data on the same (or different) task to build general understandings.

Transformer The paper Attention Is All You Need introduces a novel architecture called transformer that uses an attention mechanism and transforms one sequence into another.

Connectionist Temporal Classification (CTC) Loss A loss function introduced in Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. It calculates the probability of all valid output sequences with repetitions, and allows to train end-to-end ASR models without any prior alignments of transcriptions to audio.

Setup

The following section lists the requirements you need to meet in order to start training the wav2vec 2.0 model.

Requirements

This repository contains a Dockerfile that extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:

NVIDIA Docker
PyTorch 22.11-py3 NGC container or newer
Supported GPUs:

For more information about how to get started with NGC containers, refer to the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:

For those unable to use the PyTorch NGC container to set up the required environment or create your own container, refer to the versioned NVIDIA Container Support Matrix.

Quick Start Guide

To train your model using mixed or TF32 precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the wav2vec 2.0 model on the LibriSpeech dataset. For the specifics concerning training and inference, refer to the Advanced section.

Clone the repository.

git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/PyTorch/SpeechRecognition/wav2vec2

Build the 22.11-py3 PyTorch NGC container and start an interactive session to run training/inference. DATASET_DIR on the host will be mounted as /datasets inside the container.
```
bash scripts/docker/build.sh
DATASET_DIR=[PATH] bash scripts/docker/run.sh
```
Download and preprocess the dataset. The dataset size is about 70GB and this step could take up to a few hours to complete.
```
bash scripts/download_data.sh
```
Generate filelists.
```
bash scripts/generate_filelists.sh
```

Start pre-training.

NUM_GPUS=[NUM] UPDATE_FREQUENCY=[NUM] NUM_CONCAT_BATCHES=[NUM] BF16=[true|false] FP16=[true|false] \
    bash scripts/pretrain_base.sh

Adjust the variables to maintain NUM_GPUS x NUM_CONCAT_BATCHES x UPDATE_FREQUENCY = 64. For more details, refer to Adjusting batch size and the number of GPUs and Adjusting mixed precision.

For instance:

# Mixed precision training on 4x A100 40GB
NUM_GPUS=4 NUM_CONCAT_BATCHES=8 UPDATE_FREQUENCY=2 BF16=true bash scripts/pretrain_base.sh

Start fine-tuning.

PRETRAINED_MODEL=[PATH] NUM_GPUS=[NUM] UPDATE_FREQUENCY=[NUM] BF16=[true|false] FP16=[true|false] \
    bash scripts/finetune_base_960h.sh

Adjust the variables to maintain NUM_GPUS x NUM_CONCAT_BATCHES x UPDATE_FREQUENCY = 8.

Start inference/predictions.

FINETUNED_MODEL=[PATH] BF16=[true|false] FP16=[true|false] BATCH_SIZE=[NUM] bash scripts/inference.sh

Now that you have your model trained and evaluated, you can choose to compare your training results with our Training accuracy results. You can also choose to benchmark your performance to Training performance benchmark or Inference performance benchmark. Following the steps in these sections ensures you achieve the same accuracy and performance results as stated in the Results section.

Advanced

The following sections provide greater details of the dataset, running training and inference, and the training results.

Scripts and sample code

In the root directory, the most important files are:

.
├── common                         # Generic code for training
│   ├── fairseq                    # Parts of https://github.com/facebookresearch/fairseq
│   └── ...
├── inference.py                   # Evaluates trained models and measures latency
├── scripts
│   ├── download_wav2vec2_base.sh  # Downloads pre-trained models from NGC
│   ├── finetune_base_960h.sh      # Helper script for fine-tuning with train.py
│   ├── inference.sh               # Helper script for inference.py
│   ├── pretrain_base.sh           # Helper script for pre-training with train.py
│   └── ...
├── train.py                       # Main pre-training and fine-tuning script
├── utils                          # Misc standalone Python scripts
└── wav2vec2                       # Code specific to wav2vec 2.0 model
    ├── arg_parser.py
    ├── criterion.py
    ├── logging.py
    ├── model.py
    └── utils.py

Parameters

Parameters can be set through environment variables. The most important available parameters for scripts/pretrain_base.sh script are:

OUTPUT_DIR              directory for results, logs, and created checkpoints
                        (default: "./results/pretrain_base")
NUM_GPUS                number of GPUs to use. (default: 8)
MAX_TOKENS              upper limit for the number of tokens in a batch; changing
                        this value alters loss function consts (default: 1400000)
NUM_CONCAT_BATCHES      number of sub-batches, each with MAX_TOKENS tokens,
                        to make up one large batch (default: 8)
UPDATE_FREQ             number of grad accumulation steps before the update (default: 1)
MAX_UPDATE              training length expressed as the number of updates (default: 400000)
LEARNING_RATE           peak learning rate (default: 0.0005)
SEED                    random seed controlling model weights and data shuffling (default: disabled)
FP16                    enables mixed-precision training with float16 (default: false)
BF16                    enabled mixed-precision training with bfloat16 (default: false)
DATASET_DIR             directory with file lists (default: /datasets/LibriSpeech)
TRAIN_SUBSET            base name of the .tsv file list in the DATASET_DIR (default: "train-full-960")
VALID_SUBSET            base name of the validation .tsv file list in the DATASET_DIR (default: "dev-other")
SAVE_FREQUENCY          frequency of saving checkpoints to disk (default: 1)
HOURGLASS_CONFIG        configuration of Hourglass Transformer; refer to the section
                        below for details (default: "[2,(8,4),2]")

In addition, important parameters for scripts/finetune_base_960h.sh script are:

PRETRAINED_MODEL        a path to a pre-trained model checkpoint for fine-tuning
                        (default: "./results/pretrain_base/wav2vec2_update400000.pt")
FREEZE_FINETUNE_UPDATES freeze wav2vec 2.0 encoder for an initial number of steps and train only
                        the output linear projection (default: 0)

Below we present more details on how to set crucial parameters.

Adjusting batch size and the number of GPUs

Every training recipe assumes a constant world size, and variables need to be adjusted to maintain that world size, for example, NUM_GPUS x NUM_CONCAT_BATCHES x UPDATE_FREQUENCY = 64 for pre-training of the Base model:

first, set NUM_GPUS to the number of available GPUs,
then, adjust NUM_CONCAT_BATCHES to a high value that does not cause out-of-memory errors
finally, adjust the update frequency that controls gradient accumulation, to maintain the effective world size.

NUM_CONCAT_BATCHES controls the number of sub-batches that are forwarded through the model, each with --max_tokens tokens. In the case of out-of-memory errors, it has to be lowered. With Hourglass Transformer and mixed-precision training, the model should fit within 12GB of GPU memory on the lowest NUM_CONCAT_BATCHES=1 setting.

Adjusting mixed precision

By default, the model is trained in TF32 (A100 GPUs) or FP32 (V100 and older GPUs). Mixed-precision training can be performed in float16 or bfloat16 precisions. Training in bfloat16 is more stable and requires less stabilizing casts to FP32; thus, it is a bit faster. It is supported on the hardware level in NVIDIA Ampere and newer architectures. Scripts scripts/pretrain_base.sh and scripts/finetune_base_960h.sh provide env vars for setting appropriate casting flags. In order to benefit from mixed-precision training, set either BF16=true or FP16=true, depending on the architecture of the GPU.

Adjusting Hourglass Transformer

The Hourglass Transformer architecture is configurable by four parameters:

the number of initial transformer layers,
the number of middle transformer layers that process the downsampled signal,
downsampling rate,
the number of output transformer layers.

These are expressed in that exact order by a Python list without whitespace. For instance, the default setting is HOURGLASS_CONFIG="[2,(8,4),2]". It uses 12 layers in total (two initial, eight middle with a downsampling rate 4, and two output layers).

During fine-tuning, the same architecture as during pre-training has to be set.

Command-line options

To view the full list of available options and their descriptions, use the -h or --help command-line option, for example: python train.py -h. Most of the command-line options are a subset of those from the original Fairseq wav2vec 2.0 codebase.

Getting the data

The wav2vec 2.0 model described in the paper was pre-trained on either the LibriSpeech or LibriVox datasets. We publish recipes for training on pre-training and fine-tuning on the LibriSpeech dataset. The dev-other subset is used as a validation dataset, and test-other is used as a testing dataset.

The ./scripts/download_ls_dataset.sh [TARGET_DATA_DIR] script downloads and extracts the LibriSpeech dataset to the directory of choice, by default /datasets/LibriSpeech if the argument is omitted.

The ./scripts/generate_ls_filelists.sh [SOURCE_DATA_DIR] [TARGET_FILELISTS_DIR] script prepares filelists and collect transcriptions. Again, positional arguments are optional and default to /datasets/LibriSpeech.

Dataset guidelines

LibriSpeech data is kept at the default sampling rate of 16 kHz. The model works with either .wav or .flac files. Both are lossless, with .flac being more efficient in terms of storage but requiring extra computation during training. Files are listed in .tsv filelists. The first row is the top-level directory, and subsequent lines listths to files and a number of samples delimited by tab:

/datasets/LibriSpeech/test-other
367/293981/367-293981-0017.flac\t46560
367/293981/367-293981-0009.flac\t52720
...

The .ltr files, generated alongside .tsv filelists, hold character-level transcriptions for filelists with the same basename. Filelists and transcription lists should list samples in matching order.

A N D | A | V E R Y | R E S P E C T A B L E | O N E | S A I D | T H E | I N N K E E P E R |
T H E | O F F I C E R | T U R N E D | T O | H I M | A N D | S A I D | W E L L | H O W | G O E S | I T | G O O D | M A N |
...

Finally, generate a dict.ltr.txt dictionary using training .ltr transcripts:

python utils/generate_dictionary.py /my/dataset/path/train.ltr /my/dataset/path/dict.ltr.txt

Multi-dataset

In order to train on multiple datasets, prepare a filelist and transcription list with all files from those datasets. Refer to scripts/generate_filelists.sh for an example of concatenating LibriSpeech training filelists.

Training process

Training of wav2vec 2.0 is performed in two stages: unsupervised pre-training and supervised fine-tuning. Both are performed with the train.py script.

Pre-training The scripts/pretrain_base.sh script sets command-line arguments for train.py and runs a job on a single node that trains the wav2vec 2.0 model from scratch. Key variables can be conveniently changed via env variables.

Fine-tuning The scripts/finetune_base_960h.sh script sets command-line arguments for train.py and runs a job on a single node that fine-tunes a pre-trained wav2vec 2.0 model. Key variables can be conveniently changed via env variables. Note that a checkpoint trained with Fairseq can be loaded and fine-tuned using this repository.

Apart from the arguments as listed in the Parameters section, by default both training scripts:

Run on eight GPUs with at least 80GB of memory with increased batch size, so that gradient accumulation is not necessary
Use TF32 precision (A100 GPU) or FP32 (other GPUs)
Use Hourglass Transformer architecture with shortening factor of 4
Train on 960 hours of LibriSpeech training data and evaluate on the dev-other subset
Remove old checkpoints and preserve milestone checkpoints automatically
Maintain a separate checkpoint with the lowest WER on the dev set
Create a DLLogger log file and a TensorBoard log
Set the remaining parameters according to the recipes published with the original paper

The current training setup recreates WER Results published in the original paper, while significantly lowering the time and memory required for training.

Inference process

Inference is performed using the inference.py script along with parameters defined in scripts/inference.sh. The scripts/inference.sh script runs the job on a single GPU, taking a fine-tuned wav2vec 2.0 model checkpoint and running it on the specified dataset. Apart from the default arguments as listed in the Parameters section, by default, the inference script:

Evaluates on the LibriSpeech test-other dataset and prints out the final word error rate
Uses a batch size of 8
Creates a log file with progress and results, which will be stored in the results folder
Does greedy decoding and optionally saves the transcriptions in the results folder
Has the option to save the model output tensors for more complex decoding, for example, beam search

To view all available options for inference, run python inference.py --help

Performance

The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA’s latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.

Benchmarking

The following section shows how to run benchmarks measuring the model performance in training and inference modes.

Training performance benchmark

To benchmark the training performance with a number of specific configurations, run:

NUM_GPUS=[NUM] UPDATE_FREQ=[NUM] NUM_CONCAT_BATCHES=[NUM] NUM_EPOCHS=[NUM] NUM_WARUP_EPOCHS=[NUM] \
    BF16=[true|false] FP16=[true|false] bash scripts/pretrain_base_benchmark.sh

NUM_GPUS=[NUM] UPDATE_FREQ=[NUM] NUM_CONCAT_BATCHES=[NUM] NUM_EPOCHS=[NUM] NUM_WARUP_EPOCHS=[NUM] \
    BF16=[true|false] FP16=[true|false] bash scripts/finetune_base_benchmark.sh

for example:

NUM_GPUS=8 UPDATE_FREQ=1 NUM_CONCAT_BATCHES=8 BF16=true bash scripts/pretrain_base_benchmark.sh
NUM_GPUS=8 UPDATE_FREQ=1 NUM_CONCAT_BATCHES=1 BF16=true bash scripts/finetune_base_benchmark.sh

By default, these scripts run initially for NUM_WARMUP_EPOCHS=2, and collect performance results for another NUM_EPOCHS=5 on the train-clean-100 subset of LibriSpeech.

Inference performance benchmark

To benchmark the inference performance on a specific batch size, run:

NUM_WARMUP_REPEATS=[NUM] NUM_REPEATS=[NUM] BATCH_SIZE=[NUM] BF16=[true|false] FP16=[true|false] \
    bash scripts/inference_benchmark.sh

for example:

NUM_WARMUP_REPEATS=2 NUM_REPEATS=10 BATCH_SIZE=8 BF16=true bash scripts/inference_benchmark.sh

By default, the model will process all samples in the test-other subset of LibriSpeech initially NUM_WARMUP_REPEATS times for warmup, and then NUM_REPEATS times recording the measurements. The number of iterations will depend on the batch size.

Results

The following sections provide details on how we achieved our performance and accuracy in training and inference.

Training accuracy results

Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

Pre-training results were obtained by running the scripts/pretrain_base.sh training script in the PyTorch 22.11-py3 NGC container on NVIDIA A100 (8x A100 80GB) GPUs. We report a median of eight (BF16 mixed precision) and three (TF32) runs.

GPUs	(Concatenated) batch size / GPU	Accuracy - TF32	Accuracy - mixed precision	Time to train - TF32	Time to train - mixed precision	Time to train speedup (TF32 to mixed precision)
8	8 x 1400k max tokens	0.619	0.633	64.9 h	48.1 h	1.35

Fine-tuning results were obtained by running the scripts/finetune_base_960h.sh training script in the PyTorch 22.11-py3 NGC container on NVIDIA A100 (8x A100 80GB) GPUs. We report a median of eight runs; each resumed from a different pre-training checkpoint.

GPUs	(Concatenated) batch size / GPU	WER - mixed precision	Time to train - TF32	Time to train - mid precision	Time to train speedup (TF32 to mixed precision)
8	1 x 3200k max tokens	8.878	8.2 h	6.5 h	1.27

Training stability test

The wav2vec 2.0 Base model was pre-trained with eight different initial random seeds in bfloat16 precision in the PyTorch 22.11-py3 NGC container on NVIDIA DGX A100 with 8x A100 80GB.

Below we present accuracy of this model in the self-training task:

Update	Average	Std	Min	Max	Median
50k	0.491	0.011	0.471	0.514	0.493
100k	0.537	0.009	0.518	0.550	0.539
150k	0.564	0.009	0.544	0.577	0.564
200k	0.580	0.009	0.558	0.589	0.583
250k	0.599	0.008	0.586	0.607	0.602
300k	0.610	0.010	0.589	0.622	0.611
350k	0.619	0.009	0.607	0.634	0.617
400k	0.629	0.007	0.614	0.636	0.633

Afterward, each of those runs was fine-tuned on LibriSpeech 960 h dataset with yet another different initial random seed. Below we present the word error rate (WER) on the dev-other subset of LibriSpeech:

Update	Average	Std	Min	Max	Median
50k	11.198	0.303	10.564	11.628	11.234
100k	10.825	0.214	10.574	11.211	10.763
150k	10.507	0.160	10.224	10.778	10.518
200k	9.567	0.186	9.235	9.836	9.530
250k	9.115	0.193	8.764	9.339	9.194
300k	8.885	0.201	8.507	9.151	8.972
320k	8.827	0.188	8.440	9.043	8.878

Training performance results

Training performance: NVIDIA DGX A100 (8x A100 80GB)

Pre-training

Our results were obtained by running the scripts/pretrain_base_benchmark.sh training script in the PyTorch 22.11-py3 NGC container on NVIDIA A100 (8x A100 80GB) GPUs. Performance numbers in transformer tokens per second were averaged over an entire training epoch.

GPUs	Concat batches / GPU	Grad accumulation	Throughput - TF32	Throughput - mixed precision	Throughput speedup (TF32 to mixed precision)	Strong scaling - TF32	Strong scaling - mixed precision
1	8	8	28045.27	37609.84	1.34	1.00	1.00
4	8	2	103842.47	138956.38	1.34	3.70	3.69
8	8	1	194306.46	261881.29	1.35	6.93	6.96

To achieve these same results, follow the steps in the Quick Start Guide.

Fine-tuning

Our results were obtained by running the scripts/finetune_base_benchmark.sh training script in the PyTorch 22.11-py3 NGC container on NVIDIA A100 (8x A100 80GB) GPUs. Performance numbers in transformer tokens per second were averaged over an entire training epoch.

GPUs	Concat batches / GPU	Grad accumulation	Throughput - TF32	Throughput - mixed precision	Throughput speedup (TF32 to mixed precision)	Strong scaling - TF32	Strong scaling - mixed precision
1	8	1	34813.46	41275.76	1.19	1.00	1.00
4	2	1	102326.57	132361.62	1.29	2.94	3.21
8	1	1	163610.16	207200.91	1.27	4.70	5.02

To achieve these same results, follow the steps in the Quick Start Guide.

Inference performance results

Inference performance: NVIDIA DGX A100 (1x A100 80GB)

Our results were obtained by running the scripts/inference_benchmark.sh inferencing benchmarking script in the PyTorch 22.11-py3 NGC container on the NVIDIA A100 (1x A100 80GB) GPU. The script runs inference on the test-other subset of LibriSpeech in variable-length batches.

	Duration	BF16 Latency (ms) Percentiles				TF32 Latency (ms) Percentiles				BF16/TF32 speedup
BS	Avg	90%	95%	99%	Avg	90%	95%	99%	Avg	Avg
1	6.54 s	11.02	11.41	12.42	10.45	10.88	11.23	12.51	10.31	0.99
4	6.54 s	21.74	24.12	35.80	17.69	23.17	26.85	41.62	18.42	1.04
8	6.54 s	40.06	48.07	74.59	28.70	46.43	54.86	88.73	31.30	1.09
16	6.54 s	88.78	117.40	151.37	58.82	102.64	135.92	175.68	67.44	1.15

To achieve these same results, follow the steps in the Quick Start Guide.

Release notes

Changelog

December 2022

Initial release

Known issues

There are no known issues in this release.

README.md

wav2vec 2.0 for PyTorch

Table Of Contents

Model overview

Model architecture

Default configuration

Feature support matrix

Features

Mixed precision training

Enabling mixed precision

Enabling TF32

Glossary

Setup

Requirements

Quick Start Guide

Advanced

Scripts and sample code

Parameters

Adjusting batch size and the number of GPUs

Adjusting mixed precision

Adjusting Hourglass Transformer

Command-line options

Getting the data

Dataset guidelines

Multi-dataset

Training process

Inference process

Performance

Benchmarking

Training performance benchmark

Inference performance benchmark

Results

Training accuracy results

Training accuracy: NVIDIA DGX A100 (8x A100 80GB)

Training stability test

Training performance results

Training performance: NVIDIA DGX A100 (8x A100 80GB)

Inference performance results

Inference performance: NVIDIA DGX A100 (1x A100 80GB)

Release notes

Changelog

Known issues