|
|
@@ -1,6 +1,6 @@
|
|
|
# GNMT v2 For TensorFlow
|
|
|
|
|
|
-This repository provides a script and recipe to train the GNMT v2 model to achieve state of the art accuracy, and is tested and maintained by NVIDIA.
|
|
|
+This repository provides a script and recipe to train the GNMT v2 model to achieve state-of-the-art accuracy and is tested and maintained by NVIDIA.
|
|
|
|
|
|
## Table Of Contents
|
|
|
- [Model overview](#model-overview)
|
|
|
@@ -10,6 +10,7 @@ This repository provides a script and recipe to train the GNMT v2 model to achie
|
|
|
* [Features](#features)
|
|
|
* [Mixed precision training](#mixed-precision-training)
|
|
|
* [Enabling mixed precision](#enabling-mixed-precision)
|
|
|
+ * [Enabling TF32](#enabling-tf32)
|
|
|
- [Setup](#setup)
|
|
|
* [Requirements](#requirements)
|
|
|
- [Quick Start Guide](#quick-start-guide)
|
|
|
@@ -29,13 +30,17 @@ This repository provides a script and recipe to train the GNMT v2 model to achie
|
|
|
* [Inference performance benchmark](#inference-performance-benchmark)
|
|
|
* [Results](#results)
|
|
|
* [Training accuracy results](#training-accuracy-results)
|
|
|
- * [Training accuracy: NVIDIA DGX-1 (8x V100 16G)](#training-accuracy-nvidia-dgx-1-(8x-v100-16G))
|
|
|
+ * [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
|
|
|
+ * [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
|
|
|
* [Training stability test](#training-stability-test)
|
|
|
* [Inference accuracy results](#inference-accuracy-results)
|
|
|
- * [Inference accuracy: NVIDIA DGX-1 (8x V100 16G)](#inference-accuracy-nvidia-dgx-1-(8x-v100-16G))
|
|
|
+ * [Inference accuracy: NVIDIA DGX-1 (8x V100 16GB)](#inference-accuracy-nvidia-dgx-1-8x-v100-16gb)
|
|
|
* [Training performance results](#training-performance-results)
|
|
|
- * [Training performance: NVIDIA DGX-1 (8x V100 16G)](#training-performance-nvidia-dgx-1-(8x-v100-16G))
|
|
|
+ * [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
|
|
|
+ * [Training performance: NVIDIA DGX-1 (8x V100 16GB)](#training-performance-nvidia-dgx-1-8x-v100-16gb)
|
|
|
* [Inference performance results](#inference-performance-results)
|
|
|
+ * [Inference performance: NVIDIA DGX A100 (1x A100 40GB)](#inference-performance-nvidia-dgx-a100-1x-a100-40gb)
|
|
|
+ * [Inference performance: NVIDIA DGX-1 (1x V100 16GB)](#inference-performance-nvidia-dgx-1-1x-v100-16gb)
|
|
|
* [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4)
|
|
|
- [Release notes](#release-notes)
|
|
|
* [Changelog](#changelog)
|
|
|
@@ -60,7 +65,7 @@ GNMT-like models from
|
|
|
and
|
|
|
[NVIDIA OpenSeq2Seq Toolkit](https://github.com/NVIDIA/OpenSeq2Seq).
|
|
|
|
|
|
-This model is trained with mixed precision using Tensor Cores on NVIDIA Volta and Turing GPUs. Therefore, researchers can get results 2x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
|
|
|
+This model is trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 2x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
|
|
|
|
|
|
### Model architecture
|
|
|
|
|
|
@@ -139,47 +144,64 @@ The following features are supported by this model.
|
|
|
|
|
|
The following features are supported by this model.
|
|
|
|
|
|
-* Automatic Mixed Precision (TF-AMP) - enables mixed precision training without any changes to the code-base by performing automatic graph rewrites and loss scaling controlled by an environmental variable.
|
|
|
+* Automatic Mixed Precision (AMP) - Computation graphs can be modified by TensorFlow on runtime to support mixed precision training. Detailed explanation of mixed precision can be found in the next section.
|
|
|
+
|
|
|
|
|
|
|
|
|
### Mixed precision training
|
|
|
|
|
|
-Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
|
|
|
+Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using [mixed precision training](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) previously required two steps:
|
|
|
1. Porting the model to use the FP16 data type where appropriate.
|
|
|
2. Adding loss scaling to preserve small gradient values.
|
|
|
|
|
|
-The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
|
|
|
+This can now be achieved using Automatic Mixed Precision (AMP) for TensorFlow to enable the full [mixed precision methodology](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#tensorflow) in your existing TensorFlow model code. AMP enables mixed precision training on Volta, Turing, and NVIDIA Ampere GPU architectures automatically. The TensorFlow framework code makes all necessary model changes internally.
|
|
|
+
|
|
|
+In TF-AMP, the computational graph is optimized to use as few casts as necessary and maximize the use of FP16, and the loss scaling is automatically applied inside of supported optimizers. AMP can be configured to work with the existing tf.contrib loss scaling manager by disabling the AMP scaling with a single environment variable to perform only the automatic mixed-precision optimization. It accomplishes this by automatically rewriting all computation graphs with the necessary operations to enable mixed precision training and automatic loss scaling.
|
|
|
|
|
|
For information about:
|
|
|
-- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
|
|
|
+- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) documentation.
|
|
|
- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
|
|
|
- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
|
|
|
+- APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
|
|
|
|
|
|
#### Enabling mixed precision
|
|
|
|
|
|
-To enable this feature inside the container, simply set a single environment variable:
|
|
|
+Mixed precision is enabled in TensorFlow by using the Automatic Mixed Precision (TF-AMP) extension which casts variables to half-precision upon retrieval, while storing variables in single-precision format. Furthermore, to preserve small gradient magnitudes in backpropagation, a [loss scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling) step must be included when applying gradients. In TensorFlow, loss scaling can be applied statically by using simple multiplication of loss by a constant value or automatically, by TF-AMP. Automatic mixed precision makes all the adjustments internally in TensorFlow, providing two benefits over manual operations. First, programmers need not modify network model code, reducing development and maintenance effort. Second, using AMP maintains forward and backward compatibility with all the APIs for defining and running TensorFlow models.
|
|
|
|
|
|
-```bash
|
|
|
-export TF_ENABLE_AUTO_MIXED_PRECISION=1
|
|
|
-```
|
|
|
+To enable mixed precision, you can simply add the values to the environmental variables inside your training script:
|
|
|
+- Enable TF-AMP graph rewrite:
|
|
|
+ ```
|
|
|
+ os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1"
|
|
|
+ ```
|
|
|
|
|
|
-As an alternative, the environment variable can be set inside the TensorFlow Python script:
|
|
|
+- Enable Automated Mixed Precision:
|
|
|
+ ```
|
|
|
+ os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
|
|
|
+ ```
|
|
|
|
|
|
-```python
|
|
|
-os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
|
|
|
-```
|
|
|
+#### Enabling TF32
|
|
|
+
|
|
|
+TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.
|
|
|
+
|
|
|
+TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
|
|
|
+
|
|
|
+For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post.
|
|
|
+
|
|
|
+TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
|
|
|
|
|
|
## Setup
|
|
|
|
|
|
-The following section lists the requirements in order to start training the GNMT
|
|
|
-v2 model.
|
|
|
+The following section lists the requirements that you need to meet in order to start training the GNMT v2 model.
|
|
|
|
|
|
### Requirements
|
|
|
|
|
|
This repository contains Dockerfile which extends the TensorFlow NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
|
|
|
- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
|
|
|
-- [TensorFlow 19.07-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow)
|
|
|
-- [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
|
|
|
+- [TensorFlow 20.06-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow)
|
|
|
+- Supported GPUs:
|
|
|
+ - [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
|
|
|
+ - [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
|
|
|
+ - [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
|
|
|
|
|
|
For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
|
|
|
- [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
|
|
|
@@ -191,9 +213,10 @@ For those unable to use the TensorFlow NGC container, to set up the required env
|
|
|
|
|
|
## Quick Start Guide
|
|
|
|
|
|
-To train your model using mixed precision with Tensor Cores or using FP32,
|
|
|
+To train your model using mixed or TF32 precision with Tensor Cores or using FP32,
|
|
|
perform the following steps using the default parameters of the GNMT v2 model
|
|
|
on the WMT16 English German dataset.
|
|
|
+For the specifics concerning training and inference, see the [Advanced](#advanced) section.
|
|
|
|
|
|
**1. Clone the repository.**
|
|
|
```
|
|
|
@@ -232,25 +255,25 @@ argument.
|
|
|
To launch mixed precision training on 1 GPU, run:
|
|
|
|
|
|
```
|
|
|
-python nmt.py --output_dir=results --batch_size=128 --learning_rate=5e-4
|
|
|
+python nmt.py --output_dir=results --batch_size=128 --learning_rate=5e-4 --amp
|
|
|
```
|
|
|
|
|
|
To launch mixed precision training on 8 GPUs, run:
|
|
|
|
|
|
```
|
|
|
-python nmt.py --output_dir=results --batch_size=1024 --num_gpus=8 --learning_rate=2e-3
|
|
|
+python nmt.py --output_dir=results --batch_size=1024 --num_gpus=8 --learning_rate=2e-3 --amp
|
|
|
```
|
|
|
|
|
|
-To launch FP32 training on 1 GPU, run:
|
|
|
+To launch FP32 (TF32 on NVIDIA Ampere GPUs) training on 1 GPU, run:
|
|
|
|
|
|
```
|
|
|
-python nmt.py --output_dir=results --batch_size=128 --learning_rate=5e-4 --use_amp=false
|
|
|
+python nmt.py --output_dir=results --batch_size=128 --learning_rate=5e-4
|
|
|
```
|
|
|
|
|
|
-To launch FP32 training on 8 GPUs, run:
|
|
|
+To launch FP32 (TF32 on NVIDIA Ampere GPUs) training on 8 GPUs, run:
|
|
|
|
|
|
```
|
|
|
-python nmt.py --output_dir=results --batch_size=1024 --num_gpus=8 --learning_rate=2e-3 --use_amp=false
|
|
|
+python nmt.py --output_dir=results --batch_size=1024 --num_gpus=8 --learning_rate=2e-3
|
|
|
```
|
|
|
|
|
|
**6. Start evaluation.**
|
|
|
@@ -263,13 +286,13 @@ training.
|
|
|
To launch mixed precision inference on 1 GPU, run:
|
|
|
|
|
|
```
|
|
|
-python nmt.py --output_dir=results --infer_batch_size=128 --mode=infer
|
|
|
+python nmt.py --output_dir=results --infer_batch_size=128 --mode=infer --amp
|
|
|
```
|
|
|
|
|
|
-To launch FP32 inference on 1 GPU, run:
|
|
|
+To launch FP32 (TF32 on NVIDIA Ampere GPUs) inference on 1 GPU, run:
|
|
|
|
|
|
```
|
|
|
-python nmt.py --output_dir=results --infer_batch_size=128 --use_amp=false --mode=infer
|
|
|
+python nmt.py --output_dir=results --infer_batch_size=128 --mode=infer
|
|
|
```
|
|
|
|
|
|
**7. Start translation.**
|
|
|
@@ -341,7 +364,7 @@ The most useful arguments are as follows:
|
|
|
--beam_width BEAM_WIDTH
|
|
|
beam width when using beam search decoder. If 0, use
|
|
|
standard decoder with greedy helper.
|
|
|
- --use_amp USE_AMP use_amp for training and inference
|
|
|
+ --amp use amp for training and inference
|
|
|
--mode {train_and_eval,infer,translate}
|
|
|
```
|
|
|
|
|
|
@@ -449,7 +472,7 @@ pre-trained model checkpoint and tokenized input (for validation) and non-token
|
|
|
|
|
|
#### Validation process
|
|
|
|
|
|
-The `nmt.py` script, supports batched validation (`--mode=infer` flag). By
|
|
|
+The `nmt.py` script supports batched validation (`--mode=infer` flag). By
|
|
|
default, it launches beam search with beam size of 5, coverage penalty term and
|
|
|
length normalization term. Greedy decoding can be enabled by setting the
|
|
|
`--beam_width=1` flag for the `nmt.py` inference script. To control the
|
|
|
@@ -459,7 +482,7 @@ To view all available options for validation, run `python nmt.py --help`.
|
|
|
|
|
|
#### Translation process
|
|
|
|
|
|
-The `nmt.py` script, supports batched translation (`--mode=translate` flag). By
|
|
|
+The `nmt.py` script supports batched translation (`--mode=translate` flag). By
|
|
|
default, it launches beam search with beam size of 5, coverage penalty term and
|
|
|
length normalization term. Greedy decoding can be enabled by setting the
|
|
|
`--beam_width=1` flag for the `nmt.py` prediction script. To control the
|
|
|
@@ -482,8 +505,8 @@ The following section shows how to run benchmarks measuring the model performanc
|
|
|
|
|
|
To benchmark training performance, run:
|
|
|
|
|
|
-* `python nmt.py --output_dir=results --max_train_epochs=1 --num_gpus <num GPUs> --batch_size <total batch size>` for mixed precision
|
|
|
-* `python nmt.py --output_dir=results --max_train_epochs=1 --num_gpus <num GPUs> --batch_size <total batch size> --use_amp=false` for FP32
|
|
|
+* `python nmt.py --output_dir=results --max_train_epochs=1 --num_gpus <num GPUs> --batch_size <total batch size> --amp` for mixed precision
|
|
|
+* `python nmt.py --output_dir=results --max_train_epochs=1 --num_gpus <num GPUs> --batch_size <total batch size>` for FP32/TF32
|
|
|
|
|
|
|
|
|
The log file will contain training performance in the following format:
|
|
|
@@ -496,11 +519,11 @@ training time for epoch 1: 25.75 mins (3625.19 sent/sec, 173461.27 tokens/sec)
|
|
|
|
|
|
To benchmark inference performance, run the `scripts/translate.py` script:
|
|
|
|
|
|
-* For FP32:
|
|
|
- `python scripts/translate.py --output_dir=/path/to/trained/model --use_amp=false --beam_width <comma separated beam widths> --infer_batch_size <comma separated batch sizes>`
|
|
|
+* For FP32/TF32:
|
|
|
+ `python scripts/translate.py --output_dir=/path/to/trained/model --beam_width <comma separated beam widths> --infer_batch_size <comma separated batch sizes>`
|
|
|
|
|
|
* For mixed precision
|
|
|
- `python scripts/translate.py --output_dir=/path/to/trained/model --beam_width <comma separated beam widths> --infer_batch_size <comma separated batch sizes>`
|
|
|
+ `python scripts/translate.py --output_dir=/path/to/trained/model --amp --beam_width <comma separated beam widths> --infer_batch_size <comma separated batch sizes>`
|
|
|
|
|
|
The benchmark requires a checkpoint from a fully trained model.
|
|
|
|
|
|
@@ -512,10 +535,20 @@ accuracy in training and inference.
|
|
|
|
|
|
#### Training accuracy results
|
|
|
|
|
|
-##### Training accuracy: NVIDIA DGX-1 (8x V100 16G)
|
|
|
+##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
|
|
|
+
|
|
|
+Our results were obtained by running the `examples/DGXA100_{TF32,AMP}_8GPU.sh`
|
|
|
+training script in the tensorflow-20.06-tf1-py3 NGC container
|
|
|
+on NVIDIA DGX A100 (8x A100 40GB) GPUs.
|
|
|
+
|
|
|
+| **GPUs** | **Batch size / GPU** |**Accuracy - mixed precision (BLEU)** | **Accuracy - TF32 (BLEU)** | **Time to train - mixed precision** | **Time to train - TF32** | **Time to train speedup (TF32 to mixed precision)** |
|
|
|
+| --- | --- | ----- | ----- | -------- | -------- | ---- |
|
|
|
+| 8 | 128 | 25.1 | 24.31 | 96 min | 139 min | 1.45 |
|
|
|
+
|
|
|
+##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
|
|
|
|
|
|
Our results were obtained by running the `nmt.py` script in the
|
|
|
-tensorflow-19.07-py3 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs.
|
|
|
+tensorflow-19.07-py3 NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
|
|
|
|
|
|
| **GPUs** | **Batch size / GPU** |**Accuracy - mixed precision (BLEU)** | **Accuracy - FP32 (BLEU)** | **Time to train - mixed precision** | **Time to train - FP32** | **Time to train speedup (FP32 to mixed precision)** |
|
|
|
| --- | --- | ----- | ----- | -------- | -------- | ---- |
|
|
|
@@ -534,30 +567,42 @@ configurations are displayed.
|
|
|
The GNMT v2 model was trained for 6 epochs, starting from 6 different initial
|
|
|
random seeds. After each training epoch, the model was evaluated on the test
|
|
|
dataset and the BLEU score was recorded. The training was performed in the
|
|
|
-tensorflow-19.07-py3 NGC container on NVIDIA DGX-1 with 8 Tesla V100 16G GPUs.
|
|
|
+tensorflow-20.06-tf1-py3 NGC container.
|
|
|
|
|
|
-In the following table, the BLEU scores after each training epoch for different
|
|
|
+In the following tables, the BLEU scores after each training epoch for different
|
|
|
initial random seeds are displayed.
|
|
|
|
|
|
-| **Epoch** | **Average** | **Standard deviation** | **Minimum** | **Maximum** | **Median** |
|
|
|
-| --- | ------ | ----- | ------ | ------ | ------ |
|
|
|
-| 1 | 20.365 | 0.096 | 20.200 | 20.480 | 20.385 |
|
|
|
-| 2 | 22.002 | 0.080 | 21.900 | 22.110 | 22.000 |
|
|
|
-| 3 | 22.793 | 0.078 | 22.690 | 22.890 | 22.790 |
|
|
|
-| 4 | 23.220 | 0.160 | 22.890 | 23.360 | 23.260 |
|
|
|
-| 5 | 24.007 | 0.153 | 23.870 | 24.220 | 23.925 |
|
|
|
-| 6 | 24.362 | 0.167 | 24.210 | 24.710 | 24.310 |
|
|
|
+###### NVIDIA DGX A100 with 8 Ampere A100 40GB GPUs with TF32.
|
|
|
+
|
|
|
+| Epoch | Average | Standard deviation | Minimum | Median | Maximum |
|
|
|
+| ----- | ------- | ------------------ | ------- | ------ | ------- |
|
|
|
+| 1 | 20.272 | 0.165 | 19.760 | 20.295 | 20.480 |
|
|
|
+| 2 | 21.911 | 0.145 | 21.650 | 21.910 | 22.230 |
|
|
|
+| 3 | 22.731 | 0.140 | 22.490 | 22.725 | 23.020 |
|
|
|
+| 4 | 23.142 | 0.164 | 22.930 | 23.090 | 23.440 |
|
|
|
+| 5 | 23.967 | 0.137 | 23.760 | 23.940 | 24.260 |
|
|
|
+| 6 | 24.358 | 0.143 | 24.120 | 24.360 | 24.610 |
|
|
|
|
|
|
+###### NVIDIA DGX-1 with 8 Tesla V100 16GB GPUs with FP32.
|
|
|
+
|
|
|
+| Epoch | Average | Standard deviation | Minimum | Median | Maximum |
|
|
|
+| ----- | ------- | ------------------ | ------- | ------ | ------- |
|
|
|
+| 1 | 20.259 | 0.225 | 19.820 | 20.300 | 20.590 |
|
|
|
+| 2 | 21.954 | 0.194 | 21.540 | 21.955 | 22.370 |
|
|
|
+| 3 | 22.729 | 0.150 | 22.480 | 22.695 | 23.110 |
|
|
|
+| 4 | 23.218 | 0.210 | 22.820 | 23.225 | 23.470 |
|
|
|
+| 5 | 23.921 | 0.114 | 23.680 | 23.910 | 24.080 |
|
|
|
+| 6 | 24.381 | 0.131 | 24.160 | 24.375 | 24.590 |
|
|
|
|
|
|
#### Inference accuracy results
|
|
|
|
|
|
-##### Inference accuracy: NVIDIA DGX-1 (8x V100 16G)
|
|
|
+##### Inference accuracy: NVIDIA DGX-1 (8x V100 16GB)
|
|
|
|
|
|
-Our results were obtained by running the `scripts/translate.py` script in the tensorflow-19.07-py3 NGC container on NVIDIA DGX-1 8x V100 16G GPUs.
|
|
|
+Our results were obtained by running the `scripts/translate.py` script in the tensorflow-19.07-py3 NGC container on NVIDIA DGX-1 8x V100 16GB GPUs.
|
|
|
|
|
|
-* For mixed precision: `python scripts/translate.py --output_dir=/path/to/trained/model --beam_width 1,2,5 --infer_batch_size 128`
|
|
|
+* For mixed precision: `python scripts/translate.py --output_dir=/path/to/trained/model --beam_width 1,2,5 --infer_batch_size 128 --amp`
|
|
|
|
|
|
-* For FP32: `python scripts/translate.py --output_dir=/path/to/trained/model --beam_width 1,2,5 --infer_batch_size 128 --use_amp=false`
|
|
|
+* For FP32: `python scripts/translate.py --output_dir=/path/to/trained/model --beam_width 1,2,5 --infer_batch_size 128`
|
|
|
|
|
|
| **Batch size** | **Beam size** | **Mixed precision BLEU** | **FP32 BLEU** |
|
|
|
|:---:|:---:|:---:|:---:|
|
|
|
@@ -567,7 +612,25 @@ Our results were obtained by running the `scripts/translate.py` script in the te
|
|
|
|
|
|
#### Training performance results
|
|
|
|
|
|
-##### Training performance: NVIDIA DGX-1 (8x V100 16G)
|
|
|
+##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
|
|
|
+
|
|
|
+Our results were obtained by running the `examples/DGXA100_{TF32,AMP}_{1,8}GPU.sh`
|
|
|
+training script in the tensorflow-20.06-tf1-py3 NGC container
|
|
|
+on NVIDIA DGX A100 (8x A100 40GB) GPUs.
|
|
|
+Performance numbers (in items/images per second)
|
|
|
+were averaged over an entire training epoch.
|
|
|
+
|
|
|
+| **GPUs** | **Batch size / GPU** | **Throughput - mixed precision (tokens/s)** | **Throughput - TF32 (tokens/s)** | **Throughput speedup (TF32 - mixed precision)** | **Weak scaling - mixed precision** | **Weak scaling - TF32** |
|
|
|
+| --- | --- | ------- | ------- | ---- | ---- | ---- |
|
|
|
+| 1 | 128 | 29 911 | 31 110 | 0.96 | 1.00 | 1.00 |
|
|
|
+| 8 | 128 | 181 384 | 175 292 | 1.03 | 6.06 | 5.63 |
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+To achieve these same results, follow the steps in the
|
|
|
+[Quick Start Guide](#quick-start-guide).
|
|
|
+
|
|
|
+##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
|
|
|
|
|
|
Our results were obtained by running the `nmt.py` script in the tensorflow-19.07-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.
|
|
|
Performance numbers (in tokens per second) were averaged over an entire
|
|
|
@@ -588,18 +651,141 @@ The benchmark requires a checkpoint from a fully trained model.
|
|
|
To launch the inference benchmark in mixed precision on 1 GPU, run:
|
|
|
|
|
|
```
|
|
|
-python scripts/translate.py --output_dir=/path/to/trained/model --beam_width 1,2,5 --infer_batch_size 1,2,4,8,32,128,512
|
|
|
+python scripts/translate.py --output_dir=/path/to/trained/model --beam_width 1,2,5 --infer_batch_size 1,2,4,8,32,128,512 --amp
|
|
|
```
|
|
|
|
|
|
-To launch the inference benchmark in FP32 on 1 GPU, run:
|
|
|
+To launch the inference benchmark in FP32/TF32 on 1 GPU, run:
|
|
|
|
|
|
```
|
|
|
-python scripts/translate.py --output_dir=/path/to/trained/model --beam_width 1,2,5 --infer_batch_size 1,2,4,8,32,128,512 --use_amp=false
|
|
|
+python scripts/translate.py --output_dir=/path/to/trained/model --beam_width 1,2,5 --infer_batch_size 1,2,4,8,32,128,512
|
|
|
```
|
|
|
|
|
|
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide)
|
|
|
outlined above.
|
|
|
|
|
|
+##### Inference performance: NVIDIA DGX A100 (1x A100 40GB)
|
|
|
+
|
|
|
+Our results were obtained by running the
|
|
|
+`python scripts/translate.py --infer_batch_size 1,2,4,8,32,128,512 --beam_width 1,2,5 {--amp}`
|
|
|
+inferencing benchmarking script in the tensorflow-20.06-tf1-py3 NGC container
|
|
|
+on NVIDIA DGX A100 (1x A100 40GB) GPU.
|
|
|
+
|
|
|
+FP16
|
|
|
+
|
|
|
+| **Batch size** | **Beam width** | **Bleu** | **Sentences/sec** | **Tokens/sec** | **Latency Avg** | **Latency 50%** | **Latency 90%** | **Latency 95%** | **Latency 99%** | **Latency 100%** |
|
|
|
+|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
|
|
+| 1 | 1 | 23.80 | 13.67 | 737.89 | 73.15 | 67.69 | 121.98 | 137.20 | 162.74 | 201.06 |
|
|
|
+| 1 | 2 | 24.58 | 13.40 | 721.18 | 74.65 | 69.12 | 123.99 | 138.82 | 169.58 | 198.49 |
|
|
|
+| 1 | 5 | 25.10 | 12.12 | 647.78 | 82.53 | 76.53 | 136.35 | 152.59 | 196.09 | 216.55 |
|
|
|
+| 2 | 1 | 23.80 | 21.55 | 1163.16 | 92.82 | 88.15 | 139.88 | 152.49 | 185.18 | 208.35 |
|
|
|
+| 2 | 2 | 24.58 | 21.07 | 1134.42 | 94.91 | 89.62 | 142.08 | 158.12 | 188.00 | 205.08 |
|
|
|
+| 2 | 5 | 25.10 | 19.59 | 1047.21 | 102.10 | 96.20 | 152.36 | 172.46 | 211.96 | 219.87 |
|
|
|
+| 4 | 1 | 23.80 | 36.98 | 1996.27 | 108.16 | 105.07 | 150.42 | 161.56 | 200.99 | 205.87 |
|
|
|
+| 4 | 2 | 24.57 | 34.92 | 1880.48 | 114.53 | 111.42 | 160.29 | 177.14 | 205.32 | 211.80 |
|
|
|
+| 4 | 5 | 25.10 | 31.56 | 1687.34 | 126.74 | 122.06 | 179.68 | 201.38 | 225.08 | 229.14 |
|
|
|
+| 8 | 1 | 23.80 | 64.52 | 3482.81 | 123.99 | 122.89 | 159.89 | 174.66 | 201.12 | 205.59 |
|
|
|
+| 8 | 2 | 24.57 | 59.04 | 3178.17 | 135.50 | 135.23 | 180.50 | 191.66 | 214.95 | 216.84 |
|
|
|
+| 8 | 5 | 25.09 | 55.51 | 2967.82 | 144.11 | 141.98 | 198.39 | 218.88 | 223.55 | 225.61 |
|
|
|
+| 32 | 1 | 23.80 | 193.54 | 10447.04 | 165.34 | 163.56 | 211.67 | 215.37 | 221.07 | 221.14 |
|
|
|
+| 32 | 2 | 24.57 | 182.00 | 9798.09 | 175.82 | 176.04 | 220.33 | 224.25 | 226.45 | 227.05 |
|
|
|
+| 32 | 5 | 25.10 | 141.63 | 7572.02 | 225.94 | 225.59 | 278.38 | 279.56 | 281.61 | 282.13 |
|
|
|
+| 128 | 1 | 23.80 | 556.57 | 30042.59 | 229.98 | 226.81 | 259.05 | 260.26 | 260.74 | 260.85 |
|
|
|
+| 128 | 2 | 24.57 | 400.02 | 21535.38 | 319.98 | 328.23 | 351.31 | 352.82 | 353.01 | 353.06 |
|
|
|
+| 128 | 5 | 25.10 | 235.14 | 12570.95 | 544.35 | 576.62 | 581.95 | 582.64 | 583.61 | 583.85 |
|
|
|
+| 512 | 1 | 23.80 | 903.83 | 48786.58 | 566.48 | 570.44 | 579.74 | 580.66 | 581.39 | 581.57 |
|
|
|
+| 512 | 2 | 24.58 | 588.63 | 31689.07 | 869.81 | 894.90 | 902.65 | 902.85 | 903.00 | 903.04 |
|
|
|
+| 512 | 5 | 25.10 | 285.86 | 15283.40 | 1791.06 | 1835.19 | 1844.29 | 1845.59 | 1846.63 | 1846.89 |
|
|
|
+
|
|
|
+TF32
|
|
|
+
|
|
|
+| **Batch size** | **Beam width** | **Bleu** | **Sentences/sec** | **Tokens/sec** | **Latency Avg** | **Latency 50%** | **Latency 90%** | **Latency 95%** | **Latency 99%** | **Latency 100%** |
|
|
|
+|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
|
|
+| 1 | 1 | 23.82 | 13.25 | 715.47 | 75.45 | 69.81 | 125.63 | 141.89 | 169.70 | 209.78 |
|
|
|
+| 1 | 2 | 24.59 | 13.21 | 711.16 | 75.72 | 70.06 | 124.75 | 140.20 | 173.23 | 201.39 |
|
|
|
+| 1 | 5 | 25.08 | 12.38 | 661.99 | 80.76 | 74.90 | 131.93 | 148.91 | 187.05 | 208.39 |
|
|
|
+| 2 | 1 | 23.82 | 21.61 | 1166.56 | 92.55 | 87.25 | 139.54 | 151.77 | 180.24 | 209.05 |
|
|
|
+| 2 | 2 | 24.59 | 21.24 | 1143.63 | 94.17 | 88.78 | 139.70 | 156.61 | 189.09 | 205.06 |
|
|
|
+| 2 | 5 | 25.10 | 19.49 | 1042.17 | 102.62 | 96.14 | 153.38 | 172.89 | 213.99 | 219.54 |
|
|
|
+| 4 | 1 | 23.81 | 35.84 | 1934.49 | 111.62 | 108.73 | 154.52 | 165.42 | 207.88 | 211.29 |
|
|
|
+| 4 | 2 | 24.58 | 34.71 | 1869.20 | 115.24 | 111.24 | 161.24 | 177.73 | 208.12 | 212.74 |
|
|
|
+| 4 | 5 | 25.09 | 32.24 | 1723.86 | 124.07 | 119.35 | 177.54 | 196.69 | 221.10 | 223.52 |
|
|
|
+| 8 | 1 | 23.80 | 64.08 | 3459.74 | 124.84 | 123.61 | 161.92 | 177.06 | 205.47 | 206.47 |
|
|
|
+| 8 | 2 | 24.61 | 59.31 | 3193.52 | 134.89 | 133.44 | 182.92 | 192.71 | 216.04 | 218.78 |
|
|
|
+| 8 | 5 | 25.10 | 56.60 | 3026.29 | 141.35 | 138.61 | 194.52 | 213.65 | 220.24 | 221.45 |
|
|
|
+| 32 | 1 | 23.80 | 195.31 | 10544.22 | 163.85 | 162.80 | 212.71 | 215.41 | 216.92 | 217.34 |
|
|
|
+| 32 | 2 | 24.61 | 185.66 | 9996.59 | 172.36 | 171.07 | 216.46 | 221.64 | 223.68 | 225.25 |
|
|
|
+| 32 | 5 | 25.11 | 147.24 | 7872.61 | 217.34 | 214.97 | 269.75 | 270.71 | 271.44 | 272.87 |
|
|
|
+| 128 | 1 | 23.81 | 576.54 | 31123.19 | 222.02 | 219.25 | 249.44 | 249.75 | 249.88 | 249.91 |
|
|
|
+| 128 | 2 | 24.57 | 419.87 | 22609.82 | 304.86 | 314.47 | 332.18 | 334.13 | 336.22 | 336.74 |
|
|
|
+| 128 | 5 | 25.10 | 245.76 | 13138.84 | 520.83 | 552.68 | 558.89 | 559.09 | 559.13 | 559.13 |
|
|
|
+| 512 | 1 | 23.80 | 966.24 | 52156.34 | 529.89 | 534.82 | 558.30 | 559.33 | 560.16 | 560.36 |
|
|
|
+| 512 | 2 | 24.58 | 642.41 | 34590.81 | 797.00 | 812.40 | 824.23 | 825.92 | 827.27 | 827.61 |
|
|
|
+| 512 | 5 | 25.10 | 289.33 | 15468.09 | 1769.61 | 1817.19 | 1849.83 | 1855.17 | 1859.45 | 1860.51 |
|
|
|
+
|
|
|
+To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
|
|
|
+
|
|
|
+##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
|
|
|
+
|
|
|
+Our results were obtained by running the
|
|
|
+`python scripts/translate.py --infer_batch_size 1,2,4,8,32,128,512 --beam_width 1,2,5 {--amp}`
|
|
|
+inferencing benchmarking script in the tensorflow-20.06-tf1-py3 NGC container
|
|
|
+on NVIDIA DGX-1 with (1x V100 16GB) GPU.
|
|
|
+
|
|
|
+FP16
|
|
|
+
|
|
|
+| **Batch size** | **Sequence length** | **Throughput Avg** | **Latency Avg** | **Latency 90%** |**Latency 95%** |**Latency 99%** |
|
|
|
+|------------|-----------------|-----|-----|-----|-----|-----|
|
|
|
+| 1 | 1 | 23.78 | 9.06 | 489.00 | 110.41 | 102.80 | 183.54 | 206.33 | 242.44 | 306.21 |
|
|
|
+| 1 | 2 | 24.58 | 8.68 | 467.35 | 115.22 | 107.17 | 188.75 | 212.36 | 258.15 | 306.15 |
|
|
|
+| 1 | 5 | 25.09 | 8.39 | 448.32 | 119.25 | 109.79 | 195.68 | 220.56 | 276.41 | 325.65 |
|
|
|
+| 2 | 1 | 23.82 | 14.59 | 787.70 | 137.04 | 129.38 | 206.35 | 224.94 | 267.30 | 318.60 |
|
|
|
+| 2 | 2 | 24.57 | 14.44 | 777.60 | 138.51 | 131.07 | 206.67 | 228.95 | 275.56 | 311.23 |
|
|
|
+| 2 | 5 | 25.11 | 13.78 | 736.99 | 145.11 | 136.76 | 216.01 | 243.24 | 299.28 | 315.88 |
|
|
|
+| 4 | 1 | 23.82 | 23.79 | 1284.24 | 168.14 | 164.13 | 234.70 | 248.42 | 308.38 | 325.46 |
|
|
|
+| 4 | 2 | 24.59 | 22.67 | 1220.66 | 176.45 | 171.40 | 243.76 | 271.92 | 314.79 | 330.19 |
|
|
|
+| 4 | 5 | 25.08 | 22.33 | 1194.00 | 179.12 | 174.04 | 253.36 | 281.88 | 318.76 | 340.01 |
|
|
|
+| 8 | 1 | 23.81 | 43.33 | 2338.68 | 184.63 | 183.25 | 237.66 | 266.73 | 305.89 | 315.03 |
|
|
|
+| 8 | 2 | 24.60 | 39.12 | 2106.44 | 204.49 | 200.96 | 276.05 | 294.53 | 327.61 | 335.50 |
|
|
|
+| 8 | 5 | 25.10 | 37.16 | 1987.05 | 215.26 | 210.92 | 295.65 | 323.83 | 337.09 | 343.03 |
|
|
|
+| 32 | 1 | 23.82 | 129.52 | 6992.15 | 247.06 | 245.81 | 317.71 | 325.54 | 330.09 | 335.04 |
|
|
|
+| 32 | 2 | 24.55 | 123.28 | 6637.86 | 259.57 | 261.07 | 319.13 | 333.45 | 338.75 | 342.57 |
|
|
|
+| 32 | 5 | 25.05 | 88.74 | 4744.33 | 360.61 | 359.27 | 446.65 | 448.40 | 455.93 | 461.86 |
|
|
|
+| 128 | 1 | 23.80 | 332.81 | 17964.83 | 384.60 | 382.14 | 434.46 | 436.71 | 439.64 | 440.37 |
|
|
|
+| 128 | 2 | 24.59 | 262.87 | 14153.59 | 486.93 | 506.45 | 528.87 | 530.90 | 533.09 | 533.64 |
|
|
|
+| 128 | 5 | 25.08 | 143.91 | 7695.36 | 889.42 | 932.93 | 965.67 | 966.26 | 966.53 | 966.59 |
|
|
|
+| 512 | 1 | 23.80 | 613.57 | 33126.42 | 834.46 | 848.06 | 868.21 | 869.04 | 869.70 | 869.86 |
|
|
|
+| 512 | 2 | 24.59 | 387.72 | 20879.62 | 1320.54 | 1343.05 | 1354.40 | 1356.50 | 1358.19 | 1358.61 |
|
|
|
+| 512 | 5 | 25.10 | 199.48 | 10664.34 | 2566.67 | 2628.50 | 2642.59 | 2644.73 | 2646.44 | 2646.86 |
|
|
|
+
|
|
|
+
|
|
|
+FP32
|
|
|
+
|
|
|
+| **Batch size** | **Sequence length** | **Throughput Avg** | **Latency Avg** | **Latency 90%** |**Latency 95%** |**Latency 99%** |
|
|
|
+|------------|-----------------|-----|-----|-----|-----|-----|
|
|
|
+| 1 | 1 | 23.80 | 8.37 | 451.86 | 119.46 | 111.26 | 199.36 | 224.49 | 269.03 | 330.72 |
|
|
|
+| 1 | 2 | 24.59 | 8.83 | 475.11 | 113.31 | 104.54 | 187.79 | 210.64 | 260.42 | 317.45 |
|
|
|
+| 1 | 5 | 25.09 | 7.74 | 413.92 | 129.15 | 119.44 | 212.84 | 239.52 | 305.47 | 349.09 |
|
|
|
+| 2 | 1 | 23.80 | 13.96 | 753.79 | 143.22 | 135.73 | 213.96 | 235.89 | 284.62 | 330.71 |
|
|
|
+| 2 | 2 | 24.59 | 12.96 | 697.63 | 154.33 | 145.01 | 230.88 | 255.31 | 306.71 | 340.36 |
|
|
|
+| 2 | 5 | 25.09 | 12.67 | 677.23 | 157.88 | 148.24 | 236.50 | 266.91 | 322.94 | 349.55 |
|
|
|
+| 4 | 1 | 23.80 | 22.42 | 1209.97 | 178.44 | 172.70 | 247.51 | 266.07 | 326.95 | 343.86 |
|
|
|
+| 4 | 2 | 24.59 | 20.55 | 1106.07 | 194.68 | 188.83 | 271.75 | 295.08 | 345.76 | 364.00 |
|
|
|
+| 4 | 5 | 25.09 | 21.19 | 1132.58 | 188.81 | 182.77 | 268.18 | 298.53 | 331.96 | 357.36 |
|
|
|
+| 8 | 1 | 23.80 | 39.32 | 2122.26 | 203.48 | 201.89 | 263.28 | 286.71 | 332.70 | 348.93 |
|
|
|
+| 8 | 2 | 24.59 | 37.51 | 2019.43 | 213.26 | 211.55 | 283.67 | 302.28 | 338.47 | 356.51 |
|
|
|
+| 8 | 5 | 25.09 | 31.69 | 1694.02 | 252.46 | 245.33 | 348.95 | 378.16 | 392.72 | 401.73 |
|
|
|
+| 32 | 1 | 23.80 | 118.51 | 6396.93 | 270.02 | 269.22 | 337.17 | 352.12 | 361.36 | 361.40 |
|
|
|
+| 32 | 2 | 24.59 | 100.23 | 5395.33 | 319.28 | 318.89 | 399.80 | 403.12 | 414.51 | 423.41 |
|
|
|
+| 32 | 5 | 25.09 | 68.59 | 3666.77 | 466.55 | 466.84 | 581.77 | 586.42 | 589.04 | 593.41 |
|
|
|
+| 128 | 1 | 23.80 | 256.49 | 13845.09 | 499.04 | 492.36 | 562.12 | 567.20 | 571.18 | 572.18 |
|
|
|
+| 128 | 2 | 24.59 | 176.83 | 9519.12 | 723.86 | 754.89 | 792.12 | 793.86 | 796.44 | 797.09 |
|
|
|
+| 128 | 5 | 25.09 | 96.21 | 5143.17 | 1330.48 | 1420.94 | 1427.91 | 1431.02 | 1435.23 | 1436.28 |
|
|
|
+| 512 | 1 | 23.80 | 366.07 | 19759.97 | 1398.63 | 1421.81 | 1457.81 | 1461.04 | 1463.63 | 1464.27 |
|
|
|
+| 512 | 2 | 24.59 | 225.48 | 12137.77 | 2270.75 | 2323.62 | 2338.62 | 2340.94 | 2342.80 | 2343.27 |
|
|
|
+| 512 | 5 | 25.09 | 106.02 | 5667.78 | 4829.31 | 4946.65 | 4956.15 | 4957.85 | 4959.21 | 4959.55 |
|
|
|
+
|
|
|
+To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
|
|
|
+
|
|
|
##### Inference performance: NVIDIA T4
|
|
|
|
|
|
Our results were obtained by running the `scripts/translate.py` script in the tensorflow-19.07-py3 NGC container on NVIDIA T4.
|
|
|
@@ -638,6 +824,8 @@ Reported mixed precision speedups are relative to FP32 numbers for corresponding
|
|
|
* Initial release
|
|
|
2. June, 2019
|
|
|
* Performance improvements
|
|
|
+3. June, 2020
|
|
|
+ * Updated performance tables to include A100 results
|
|
|
|
|
|
### Known issues
|
|
|
There are no known issues in this release.
|