|
@@ -28,6 +28,7 @@ This repository provides a script and recipe to train the SIM model to achieve s
|
|
|
* [Command-line options](#command-line-options)
|
|
* [Command-line options](#command-line-options)
|
|
|
* [Getting the data](#getting-the-data)
|
|
* [Getting the data](#getting-the-data)
|
|
|
* [Dataset guidelines](#dataset-guidelines)
|
|
* [Dataset guidelines](#dataset-guidelines)
|
|
|
|
|
+ * [Prebatching](#prebatching)
|
|
|
* [BYO dataset](#byo-dataset)
|
|
* [BYO dataset](#byo-dataset)
|
|
|
* [Channel definitions and requirements](#channel-definitions-and-requirements)
|
|
* [Channel definitions and requirements](#channel-definitions-and-requirements)
|
|
|
* [Training process](#training-process)
|
|
* [Training process](#training-process)
|
|
@@ -78,7 +79,7 @@ In the author’s SIM implementation, the internals of submodels differs slightl
|
|
|
List of implementation differences between original SIM code and DIN/DIEN/SIM papers
|
|
List of implementation differences between original SIM code and DIN/DIEN/SIM papers
|
|
|
</b></summary>
|
|
</b></summary>
|
|
|
|
|
|
|
|
-- Batch normalization before NLP is not included in papers.
|
|
|
|
|
|
|
+- Batch normalization before MLP is not included in papers.
|
|
|
- Batch normalization in code used `trainable=False` during the training phase.
|
|
- Batch normalization in code used `trainable=False` during the training phase.
|
|
|
- ItemItemInteraction in DIN`s attention module in SIM implementation didn't correspond to activation unit inside DIN paper.
|
|
- ItemItemInteraction in DIN`s attention module in SIM implementation didn't correspond to activation unit inside DIN paper.
|
|
|
- Element-wise subtraction and multiplications are fed to MLP, skipping outer product operation.
|
|
- Element-wise subtraction and multiplications are fed to MLP, skipping outer product operation.
|
|
@@ -375,7 +376,7 @@ The following section lists the requirements that you need to meet in order to s
|
|
|
|
|
|
|
|
This repository contains a Dockerfile that extends the TensorFflow2 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
|
|
This repository contains a Dockerfile that extends the TensorFflow2 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
|
|
|
- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
|
|
- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
|
|
|
-- [TensorFlow2 21.10-py3](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow/tags) NGC container
|
|
|
|
|
|
|
+- [TensorFlow2 22.01-py3](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow/tags) NGC container
|
|
|
- Supported GPUs:
|
|
- Supported GPUs:
|
|
|
- [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
|
|
- [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
|
|
|
- [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
|
|
- [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
|
|
@@ -417,9 +418,6 @@ To train your model using mixed or TF32 precision with Tensor Cores or using FP3
|
|
|
5. Start preprocessing.
|
|
5. Start preprocessing.
|
|
|
|
|
|
|
|
For details of the required file format and certain preprocessing parameters refer to [BYO dataset](#byo-dataset).
|
|
For details of the required file format and certain preprocessing parameters refer to [BYO dataset](#byo-dataset).
|
|
|
-
|
|
|
|
|
-
|
|
|
|
|
- `${NUMBER_OF_USER_FEATURES}` defines how many user specific features are present in dataset. If using default Amazon Books dataset and `sim_preprocessing` script (as shown below), this parameter should be set to <b>1</b> (in this case, the only user specific features is <b>user_id</b>. Other features are item specific).
|
|
|
|
|
|
|
|
|
|
```bash
|
|
```bash
|
|
|
python preprocessing/sim_preprocessing.py \
|
|
python preprocessing/sim_preprocessing.py \
|
|
@@ -428,8 +426,7 @@ To train your model using mixed or TF32 precision with Tensor Cores or using FP3
|
|
|
|
|
|
|
|
python preprocessing/parquet_to_tfrecord.py \
|
|
python preprocessing/parquet_to_tfrecord.py \
|
|
|
--amazon_dataset_path ${PARQUET_PATH} \
|
|
--amazon_dataset_path ${PARQUET_PATH} \
|
|
|
- --tfrecord_output_dir ${TF_RECORD_PATH} \
|
|
|
|
|
- --number_of_user_features ${NUMBER_OF_USER_FEATURES}
|
|
|
|
|
|
|
+ --tfrecord_output_dir ${TF_RECORD_PATH}
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
6. Start training (`${GPU}` is an arbitrary number of GPUs to be used).
|
|
6. Start training (`${GPU}` is an arbitrary number of GPUs to be used).
|
|
@@ -496,10 +493,11 @@ The `main.py` script parameters are detailed in the following table.
|
|
|
| training | drop_remainder | Drop remainder batch for training set (flag) | False |
|
|
| training | drop_remainder | Drop remainder batch for training set (flag) | False |
|
|
|
| training | disable_cache | Disable dataset caching after the first time it is iterated over (flag) | False |
|
|
| training | disable_cache | Disable dataset caching after the first time it is iterated over (flag) | False |
|
|
|
| training | repeat_count | Repeat training dataset this number of times | 0 |
|
|
| training | repeat_count | Repeat training dataset this number of times | 0 |
|
|
|
-| training | prefetch_train_size | Number of batches to prefetch in training. | -1 |
|
|
|
|
|
-| training | prefetch_test_size | Number of batches to prefetch in evaluation. | -1 |
|
|
|
|
|
-| training | train_dataset_size | Number of samples in training dataset (used to determine prefetch_train_size when --prefetch_train_size < 0) | 11796480 |
|
|
|
|
|
|
|
+| training | prefetch_train_size | Number of batches to prefetch in training. | 10 |
|
|
|
|
|
+| training | prefetch_test_size | Number of batches to prefetch in evaluation. | 2 |
|
|
|
| training | long_seq_length | Determines the long history - short history split of history features | 90 |
|
|
| training | long_seq_length | Determines the long history - short history split of history features | 90 |
|
|
|
|
|
+| training | prebatch_train_size | Batch size of batching applied during preprocessing to train dataset. | 0 |
|
|
|
|
|
+| training | prebatch_test_size | Batch size of batching applied during preprocessing to test dataset. | 0 |
|
|
|
| results | results_dir | Path to the model result files storage | /tmp/sim |
|
|
| results | results_dir | Path to the model result files storage | /tmp/sim |
|
|
|
| results | log_filename | Name of the file to store logger output | log.json |
|
|
| results | log_filename | Name of the file to store logger output | log.json |
|
|
|
| results | save_checkpoint_path | Directory to save model checkpoints | "" |
|
|
| results | save_checkpoint_path | Directory to save model checkpoints | "" |
|
|
@@ -511,8 +509,10 @@ The `main.py` script parameters are detailed in the following table.
|
|
|
| run mode | affinity | Type of CPU affinity | socket_unique_interleaved |
|
|
| run mode | affinity | Type of CPU affinity | socket_unique_interleaved |
|
|
|
| run mode | inter_op_parallelism | Number of inter op threads | 0 |
|
|
| run mode | inter_op_parallelism | Number of inter op threads | 0 |
|
|
|
| run mode | intra_op_parallelism | Number of intra op threads | 0 |
|
|
| run mode | intra_op_parallelism | Number of intra op threads | 0 |
|
|
|
|
|
+| run mode | num_parallel_calls | Parallelism level for tf.data API. If None, heuristic based on number of CPUs and number of GPUs will be used | None |
|
|
|
| reproducibility | seed | Random seed | -1 |
|
|
| reproducibility | seed | Random seed | -1 |
|
|
|
|
|
|
|
|
|
|
+
|
|
|
### Command-line options
|
|
### Command-line options
|
|
|
|
|
|
|
|
To view the full list of available options and their descriptions, use the `--help` command-line option, for example:
|
|
To view the full list of available options and their descriptions, use the `--help` command-line option, for example:
|
|
@@ -534,6 +534,56 @@ The preprocessing steps applied to the raw data include:
|
|
|
- Determining embedding table sizes for categorical features needed to construct a model
|
|
- Determining embedding table sizes for categorical features needed to construct a model
|
|
|
- Filter users for training split based on their number of interactions (discard users with less than 20 interactions)
|
|
- Filter users for training split based on their number of interactions (discard users with less than 20 interactions)
|
|
|
|
|
|
|
|
|
|
+#### Prebatching
|
|
|
|
|
+
|
|
|
|
|
+Preprocessing scripts allow to apply batching prior to the model`s dataloader. This reduces the size of produced TFrecord files and speeds up dataloading.
|
|
|
|
|
+To do so, specify `--prebatch_train_size` and `--prebatch_test_size` while converting data using `scripts/parquet_to_tfrecord.py`. Later, while using the `main.py` script, pass the information about applied prebatch size via the same parameters.
|
|
|
|
|
+
|
|
|
|
|
+Example
|
|
|
|
|
+
|
|
|
|
|
+Start preprocessing from step 5. from [Quick Start Guide](#quick-start-guide):
|
|
|
|
|
+
|
|
|
|
|
+```bash
|
|
|
|
|
+python preprocessing/sim_preprocessing.py \
|
|
|
|
|
+--amazon_dataset_path ${RAW_DATASET_PATH} \
|
|
|
|
|
+--output_path ${PARQUET_PATH}
|
|
|
|
|
+
|
|
|
|
|
+python preprocessing/parquet_to_tfrecord.py \
|
|
|
|
|
+--amazon_dataset_path ${PARQUET_PATH} \
|
|
|
|
|
+--tfrecord_output_dir ${TF_RECORD_PATH} \
|
|
|
|
|
+--prebatch_train_size ${PREBATCH_TRAIN_SIZE} \
|
|
|
|
|
+--prebatch_train_size ${PREBATCH_TEST_SIZE}
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+And then train the model (step 6.):
|
|
|
|
|
+
|
|
|
|
|
+```bash
|
|
|
|
|
+mpiexec --allow-run-as-root --bind-to socket -np ${GPU} python main.py \
|
|
|
|
|
+--dataset_dir ${TF_RECORD_PATH} \
|
|
|
|
|
+--mode train \
|
|
|
|
|
+--model_type sim \
|
|
|
|
|
+--embedding_dim 16 \
|
|
|
|
|
+--drop_remainder \
|
|
|
|
|
+--optimizer adam \
|
|
|
|
|
+--lr 0.01 \
|
|
|
|
|
+--epochs 3 \
|
|
|
|
|
+--global_batch_size 131072 \
|
|
|
|
|
+--amp \
|
|
|
|
|
+--prebatch_train_size ${PREBATCH_TRAIN_SIZE} \
|
|
|
|
|
+--prebatch_train_size ${PREBATCH_TEST_SIZE}
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+<details>
|
|
|
|
|
+<summary><b>Prebatching details</b></summary>
|
|
|
|
|
+
|
|
|
|
|
+- The last batch for each split will pe saved to the separate file `remainder.tfrecord` unless there are enough samples to form a full batch.
|
|
|
|
|
+- Final batch size used in main script can be a multiple of prebatch size.
|
|
|
|
|
+- Final batch size used in main script can be a divider of prebatch size. In this case, when using multi GPU training, the number of batches received by each worker can be greater than 1 thus resulting in error during allgather operation. Dataset size, batch size and prebatch size have to be chosen with that limitation in mind.
|
|
|
|
|
+- For the orignal Amazon Books Dataset, parameters were set to PREBATCH_TRAIN_SIZE = PREBATCH_TEST_SIZE = 4096 for performance benchmarking purposes.
|
|
|
|
|
+</details>
|
|
|
|
|
+
|
|
|
|
|
+
|
|
|
|
|
+
|
|
|
#### BYO dataset
|
|
#### BYO dataset
|
|
|
|
|
|
|
|
This implementation supports using other datasets thanks to BYO dataset functionality.
|
|
This implementation supports using other datasets thanks to BYO dataset functionality.
|
|
@@ -676,7 +726,7 @@ source_spec:
|
|
|
type: tfrecord
|
|
type: tfrecord
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
-`dimensions` should contain the length of the history to which the entries will be padded.
|
|
|
|
|
|
|
+`dimensions` should contain the length of the sequencial features.
|
|
|
|
|
|
|
|
Note that corresponsive features in `negative_history`, `positive_history`, `target_item_features` need to be listed in the same order in channel spec in each channel since they share embedding tables in the model. (for example `item_id` needs to be first and `cat_id` second).
|
|
Note that corresponsive features in `negative_history`, `positive_history`, `target_item_features` need to be listed in the same order in channel spec in each channel since they share embedding tables in the model. (for example `item_id` needs to be first and `cat_id` second).
|
|
|
|
|
|
|
@@ -705,7 +755,7 @@ For performance reasons, the only supported dataset type is tfrecord.
|
|
|
|
|
|
|
|
### Training process
|
|
### Training process
|
|
|
|
|
|
|
|
-Training can be run using `main.py` script by specifying the `--mode train` parameter. The speed of training is measured by throughput, that is, the number of samples processed per second. Evaluation is based on the [Area under ROC Curve (ROC AUC)](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) metric. Model checkpoints may be stored using Checkpoint manager as specified via (...). Training and inference logs are saved to a directory specified via the `--results_dir` parameter. Mixed precision training is supported via the `--amp` flag. Multi-GPU training is performed using mpiexec and Horovod libraries.
|
|
|
|
|
|
|
+Training can be run using `main.py` script by specifying the `--mode train` parameter. The speed of training is measured by throughput, that is, the number of samples processed per second. Evaluation is based on the [Area under ROC Curve (ROC AUC)](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) metric. Model checkpoints may be stored using Checkpoint manager via the `--save_checkpoint_path` and `--load_checkpoint_path` parameters. Training and inference logs are saved to a directory specified via the `--results_dir` parameter. Mixed precision training is supported via the `--amp` flag. Multi-GPU training is performed using mpiexec and Horovod libraries.
|
|
|
|
|
|
|
|
### Inference process
|
|
### Inference process
|
|
|
|
|
|
|
@@ -778,7 +828,9 @@ mpiexec --allow-run-as-root --bind-to socket -np ${GPU} python main.py \
|
|
|
--global_batch_size 131072 \
|
|
--global_batch_size 131072 \
|
|
|
--drop_remainder \
|
|
--drop_remainder \
|
|
|
--amp \
|
|
--amp \
|
|
|
- --benchmark
|
|
|
|
|
|
|
+ --benchmark \
|
|
|
|
|
+ --prebatch_train_size ${PREBATCH_TRAIN_SIZE} \
|
|
|
|
|
+ --prebatch_test_size ${PREBATCH_TEST_SIZE}
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
Equivalent:
|
|
Equivalent:
|
|
@@ -787,7 +839,9 @@ scripts/run_model.sh \
|
|
|
--data_path ${TF_RECORD_PATH} \
|
|
--data_path ${TF_RECORD_PATH} \
|
|
|
--gpus ${GPU} \
|
|
--gpus ${GPU} \
|
|
|
--amp 1 \
|
|
--amp 1 \
|
|
|
- --benchmark 1
|
|
|
|
|
|
|
+ --benchmark 1 \
|
|
|
|
|
+ --prebatch_train_size ${PREBATCH_TRAIN_SIZE} \
|
|
|
|
|
+ --prebatch_test_size ${PREBATCH_TEST_SIZE}
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
#### Inference performance benchmark
|
|
#### Inference performance benchmark
|
|
@@ -801,7 +855,9 @@ mpiexec --allow-run-as-root --bind-to socket -np ${GPU} python main.py \
|
|
|
--model_type sim \
|
|
--model_type sim \
|
|
|
--global_batch_size 131072 \
|
|
--global_batch_size 131072 \
|
|
|
--amp \
|
|
--amp \
|
|
|
- --benchmark
|
|
|
|
|
|
|
+ --benchmark \
|
|
|
|
|
+ --prebatch_train_size ${PREBATCH_TRAIN_SIZE} \
|
|
|
|
|
+ --prebatch_test_size ${PREBATCH_TEST_SIZE}
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
Equivalent:
|
|
Equivalent:
|
|
@@ -811,7 +867,8 @@ scripts/run_model.sh \
|
|
|
--gpus ${GPU} \
|
|
--gpus ${GPU} \
|
|
|
--amp 1 \
|
|
--amp 1 \
|
|
|
--benchmark 1 \
|
|
--benchmark 1 \
|
|
|
- --mode inference
|
|
|
|
|
|
|
+ --prebatch_train_size ${PREBATCH_TRAIN_SIZE} \
|
|
|
|
|
+ --prebatch_test_size ${PREBATCH_TEST_SIZE}
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
### Results
|
|
### Results
|
|
@@ -820,7 +877,7 @@ The following sections provide details on how we achieved our performance and ac
|
|
|
|
|
|
|
|
#### Training accuracy results
|
|
#### Training accuracy results
|
|
|
|
|
|
|
|
-Our results were obtained by running the `run_model.sh` bash script in the TensorFlow2 21.10-py3 NGC container. Experiments were run on 1 and 8 GPUs, with FP32/TF32 Precision and AMP and with XLA-OFF/XLA-ON. Other parameters were set to defaults.
|
|
|
|
|
|
|
+Our results were obtained by running the `run_model.sh` bash script in the TensorFlow2 21.10-py3 NGC container. Experiments were run on 1 and 8 GPUs, with FP32/TF32 Precision and AMP and with XLA-OFF/XLA-ON. Dataset was prebatched with the size of 16384. Other parameters were set to defaults.
|
|
|
|
|
|
|
|
There were 10 runs for each configuration. In the `Training accuracy` sections, average values are reported. In the `Training stability` sections, values from all runs are included in plots.
|
|
There were 10 runs for each configuration. In the `Training accuracy` sections, average values are reported. In the `Training stability` sections, values from all runs are included in plots.
|
|
|
|
|
|
|
@@ -962,7 +1019,7 @@ Figure 8. ROC curve for different configurations of Ampere/Volta, 1/8 GPUs, doub
|
|
|
|
|
|
|
|
#### Training performance results
|
|
#### Training performance results
|
|
|
|
|
|
|
|
-Our results were obtained by running the `scripts/run_model.sh` script in the TensorFlow2 21.10-py3 NGC container.
|
|
|
|
|
|
|
+Our results were obtained by running the `scripts/run_model.sh` script in the TensorFlow2 21.10-py3 NGC container. Dataset was prebatched with the size of 16384.
|
|
|
|
|
|
|
|
Numbers were averaged over 10 separate runs for each configuration.
|
|
Numbers were averaged over 10 separate runs for each configuration.
|
|
|
|
|
|
|
@@ -974,12 +1031,12 @@ To achieve these same results, follow the steps in the [Quick Start Guide](#quic
|
|
|
|
|
|
|
|
##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
|
|
##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
|
|
|
|
|
|
|
|
-|GPUs |XLA |Throughput - TF32 (samples/s) |Throughput - mixed precision (samples/s) |Throughput speedup (mixed precision / TF32) | Strong scaling - TF32 | Strong scaling - mixed precision |
|
|
|
|
|
-|-----|-----|--------------------|------------------------------|---------------------------------------------|-----------|-------------|
|
|
|
|
|
-|1 |OFF |381211.31 |484360.65 |1.27 | 1.00 | 1.00 |
|
|
|
|
|
-|1 |ON |462012.86 |571727.91 |1.24 | 1.00 | 1.00 |
|
|
|
|
|
-|8 |OFF |2304284.08 |2475445.94 |1.07 | 6.04 | 5.11 |
|
|
|
|
|
-|8 |ON |2679300.61 |3006370.96 |1.12 | 5.80 | 5.26 |
|
|
|
|
|
|
|
+| GPUs | XLA | Throughput - TF32 (samples/s) | Throughput - mixed precision (samples/s) | Throughput speedup (mixed precision / TF32) | Strong scaling - TF32 | Strong scaling - mixed precision |
|
|
|
|
|
+|-------:|------:|--------------------------------:|-------------------------------------------:|----------------------------------------------:|------------------------:|-----------------------------------:|
|
|
|
|
|
+| 1 | OFF | 377254.65 | 479921.54 | 1.27 | 1.00 | 1.00 |
|
|
|
|
|
+| 1 | ON | 455724.01 | 565221.04 | 1.24 | 1.00 | 1.00 |
|
|
|
|
|
+| 8 | OFF | 2161681.55 | 2603489.60 | 1.20 | 5.73 | 5.42 |
|
|
|
|
|
+| 8 | ON | 2662368.18 | 2979441.80 | 1.12 | 5.84 | 5.27 |
|
|
|
|
|
|
|
|
<details>
|
|
<details>
|
|
|
<summary><b>
|
|
<summary><b>
|
|
@@ -990,24 +1047,24 @@ For each configuration of parameters present in the table, the `Speedup` column
|
|
|
|
|
|
|
|
|GPUs |Precision |Speedup |
|
|
|GPUs |Precision |Speedup |
|
|
|
|-----|---------------|--------|
|
|
|-----|---------------|--------|
|
|
|
-|1 |TF32 |1.212 |
|
|
|
|
|
-|1 |AMP |1.180 |
|
|
|
|
|
-|8 |TF32 |1.163 |
|
|
|
|
|
-|8 |AMP |1.214 |
|
|
|
|
|
|
|
+|1 |TF32 |1.208 |
|
|
|
|
|
+|1 |AMP |1.178 |
|
|
|
|
|
+|8 |TF32 |1.232 |
|
|
|
|
|
+|8 |AMP |1.119 |
|
|
|
</details>
|
|
</details>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
##### Training performance: NVIDIA DGX-2 (16x V100 32GB)
|
|
##### Training performance: NVIDIA DGX-2 (16x V100 32GB)
|
|
|
|
|
|
|
|
-|GPUs |XLA |Throughput - FP32 (samples/s) |Throughput - mixed precision (samples/s) |Throughput speedup (mixed precision / FP32) | Strong scaling - FP32 | Strong scaling - mixed precision |
|
|
|
|
|
-|-----|-----|--------------------|------------------------------|---------------------------------------------|----------|-------------|
|
|
|
|
|
-|1 |OFF |210772.27 |312580.01 |1.48 | 1.00 | 1.00 |
|
|
|
|
|
-|1 |ON |248514.27 |358305.52 |1.44 | 1.00 | 1.00 |
|
|
|
|
|
-|8 |OFF |1357463.39 |1785361.62 |1.32 | 6.44 | 5.71 |
|
|
|
|
|
-|8 |ON |1584757.09 |2091403.04 |1.32 | 7.52 | 6.69 |
|
|
|
|
|
-|16 |OFF |2319719.76 |2837309.15 |1.22 | 11.00 | 9.08 |
|
|
|
|
|
-|16 |ON |2681789.69 |3168488.89 |1.18 | 12.73 | 10.14 |
|
|
|
|
|
|
|
+| GPUs | XLA | Throughput - FP32 (samples/s) | Throughput - mixed precision (samples/s) | Throughput speedup (mixed precision / FP32) | Strong scaling - FP32 | Strong scaling - mixed precision |
|
|
|
|
|
+|-------:|------:|--------------------------------:|-------------------------------------------:|----------------------------------------------:|------------------------:|-----------------------------------:|
|
|
|
|
|
+| 1 | OFF | 209376.38 | 309752.48 | 1.48 | 1.00 | 1.00 |
|
|
|
|
|
+| 1 | ON | 245414.62 | 348945.59 | 1.42 | 1.00 | 1.00 |
|
|
|
|
|
+| 8 | OFF | 1310239.01 | 1689602.79 | 1.29 | 6.26 | 5.45 |
|
|
|
|
|
+| 8 | ON | 1483120.32 | 1962226.32 | 1.32 | 6.04 | 5.62 |
|
|
|
|
|
+| 16 | OFF | 2127221.65 | 2555926.79 | 1.20 | 10.16 | 8.25 |
|
|
|
|
|
+| 16 | ON | 2450499.40 | 2788997.07 | 1.14 | 9.99 | 7.99 |
|
|
|
|
|
|
|
|
<details>
|
|
<details>
|
|
|
<summary><b>
|
|
<summary><b>
|
|
@@ -1018,12 +1075,12 @@ For each configuration of parameters present in the table, the `Speedup` column
|
|
|
|
|
|
|
|
|GPUs |AMP |Speedup |
|
|
|GPUs |AMP |Speedup |
|
|
|
|-----|--------------------|---------------|
|
|
|-----|--------------------|---------------|
|
|
|
-|1 |FP32 |1.179 |
|
|
|
|
|
-|1 |AMP |1.146 |
|
|
|
|
|
-|8 |FP32 |1.167 |
|
|
|
|
|
-|8 |AMP |1.171 |
|
|
|
|
|
-|16 |FP32 |1.156 |
|
|
|
|
|
-|16 |AMP |1.117 |
|
|
|
|
|
|
|
+|1 |FP32 |1.172 |
|
|
|
|
|
+|1 |AMP |1.127 |
|
|
|
|
|
+|8 |FP32 |1.132 |
|
|
|
|
|
+|8 |AMP |1.161 |
|
|
|
|
|
+|16 |FP32 |1.152 |
|
|
|
|
|
+|16 |AMP |1.091 |
|
|
|
</details>
|
|
</details>
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1033,16 +1090,17 @@ For each configuration of parameters present in the table, the `Speedup` column
|
|
|
NVIDIA DGX A100 / DGX-2 (Ampere / Volta) training speedup
|
|
NVIDIA DGX A100 / DGX-2 (Ampere / Volta) training speedup
|
|
|
</b></summary>
|
|
</b></summary>
|
|
|
|
|
|
|
|
-|GPUs |XLA |Precision |Speedup|
|
|
|
|
|
-|-----|-------|---------------|-------|
|
|
|
|
|
-|1 |OFF |TF32/FP32 |1.809 |
|
|
|
|
|
-|1 |OFF |AMP |1.550 |
|
|
|
|
|
-|1 |ON |TF32/FP32 |1.860 |
|
|
|
|
|
-|1 |ON |AMP |1.596 |
|
|
|
|
|
-|8 |OFF |TF32/FP32 |1.697 |
|
|
|
|
|
-|8 |OFF |AMP |1.387 |
|
|
|
|
|
-|8 |ON |TF32/FP32 |1.691 |
|
|
|
|
|
-|8 |ON |AMP |1.437 |
|
|
|
|
|
|
|
+
|
|
|
|
|
+| GPUs | XLA | Precision | Speedup |
|
|
|
|
|
+|-------:|------:|:------------|----------:|
|
|
|
|
|
+| 1 | OFF | TF32/FP32 | 1.802 |
|
|
|
|
|
+| 1 | OFF | AMP | 1.549 |
|
|
|
|
|
+| 1 | ON | TF32/FP32 | 1.857 |
|
|
|
|
|
+| 1 | ON | AMP | 1.620 |
|
|
|
|
|
+| 8 | OFF | TF32/FP32 | 1.650 |
|
|
|
|
|
+| 8 | OFF | AMP | 1.541 |
|
|
|
|
|
+| 8 | ON | TF32/FP32 | 1.795 |
|
|
|
|
|
+| 8 | ON | AMP | 1.518 |
|
|
|
|
|
|
|
|
</details>
|
|
</details>
|
|
|
|
|
|
|
@@ -1060,74 +1118,44 @@ To achieve these same results, follow the steps in the [Quick Start Guide](#quic
|
|
|
|
|
|
|
|
##### Inference performance: NVIDIA DGX A100 (8x A100 80GB)
|
|
##### Inference performance: NVIDIA DGX A100 (8x A100 80GB)
|
|
|
|
|
|
|
|
-|GPUs |Global batch size|XLA |Throughput - TF32 (samples/s)|Throughput - mixed precision (samples/s)|Throughput speedup (mixed precision / TF32) | Strong scaling - TF32 | Strong scaling - mixed precision |
|
|
|
|
|
-|-----|----------|-----|---------------|----------------------------|---------------------------------------------|----------------|---------|
|
|
|
|
|
-|1 |4096 |ON |561967.1 |535674.63 |0.95 | 1.00 | 1.00 |
|
|
|
|
|
-|1 |8192 |ON |670885.47 |758801.43 |1.13 | 1.00 | 1.00 |
|
|
|
|
|
-|1 |16384 |ON |788890.79 |920695.88 |1.17 | 1.00 | 1.00 |
|
|
|
|
|
-|1 |32768 |ON |855056.39 |1035530.23 |1.21 | 1.00 | 1.00 |
|
|
|
|
|
-|1 |65536 |ON |918649.98 |1081408.05 |1.18 | 1.00 | 1.00 |
|
|
|
|
|
-|1 |131072 |ON |918555.37 |771119.78 |0.84 | 1.00 | 1.00 |
|
|
|
|
|
-|8 |4096 |ON |1130031.99 |935848.52 |0.83 | 2.01 | 1.75 |
|
|
|
|
|
-|8 |8192 |ON |2246441.94 |1885511.32 |0.84 | 3.64 | 2.48 |
|
|
|
|
|
-|8 |16384 |ON |4000071.31 |3303417.5 |0.83 | 5.07 | 3.59 |
|
|
|
|
|
-|8 |32768 |ON |5479754.01 |5762298.42 |1.05 | 6.41 | 5.56 |
|
|
|
|
|
-|8 |65536 |ON |6736333.91 |7869825.77 |1.17 | 7.33 | 7.28 |
|
|
|
|
|
-|8 |131072 |ON |7598665.72 |9002545.49 |1.18 | 8.27 | 11.67 |
|
|
|
|
|
|
|
+| Batch Size | XLA | Throughput - TF32 (samples/s) | Throughput - mixed precision (samples/s) | Throughput speedup (mixed precision / TF32) |
|
|
|
|
|
+|--------------------:|------:|--------------------------------:|-------------------------------------------:|----------------------------------------------:|
|
|
|
|
|
+| 4096 | ON | 618547.45 | 669640.65 | 1.08 |
|
|
|
|
|
+| 8192 | ON | 722801.14 | 849101.88 | 1.17 |
|
|
|
|
|
+| 16384 | ON | 859418.77 | 1051361.67 | 1.22 |
|
|
|
|
|
+| 32768 | ON | 976771.70 | 1269000.97 | 1.30 |
|
|
|
|
|
+| 65536 | ON | 1082688.51 | 1444729.52 | 1.33 |
|
|
|
|
|
+| 131072 | ON | 1094733.64 | 1483542.86 | 1.36 |
|
|
|
|
|
|
|
|
<details>
|
|
<details>
|
|
|
<summary><b> Complete table of DGX A100 inference performance results </b></summary>
|
|
<summary><b> Complete table of DGX A100 inference performance results </b></summary>
|
|
|
|
|
|
|
|
-|GPUSs|Global Batch Size |XLA |Precision |Throughput (samples/s) |
|
|
|
|
|
-|-----|--------------------|-------|---------------|-----------------------|
|
|
|
|
|
-|1 |4096 |OFF |TF32 |585246.51 ± 10513.06 |
|
|
|
|
|
-|1 |8192 |OFF |TF32 |750729.14 ± 17029.41 |
|
|
|
|
|
-|1 |16384 |OFF |TF32 |803593.59 ± 11207.58 |
|
|
|
|
|
-|1 |32768 |OFF |TF32 |822162.85 ± 5071.85 |
|
|
|
|
|
-|1 |65536 |OFF |TF32 |775748.42 ± 36821.04 |
|
|
|
|
|
-|1 |131072 |OFF |TF32 |644740.49 ± 31148.79 |
|
|
|
|
|
-|1 |4096 |OFF |AMP |516164.09 ± 9916.80 |
|
|
|
|
|
-|1 |8192 |OFF |AMP |778740.41 ± 19384.36 |
|
|
|
|
|
-|1 |16384 |OFF |AMP |932211.18 ± 20331.07 |
|
|
|
|
|
-|1 |32768 |OFF |AMP |990696.89 ± 11554.34 |
|
|
|
|
|
-|1 |65536 |OFF |AMP |715678.16 ± 30944.63 |
|
|
|
|
|
-|1 |131072 |OFF |AMP |611740.50 ± 21392.81 |
|
|
|
|
|
-|1 |4096 |ON |TF32 |561967.10 ± 18100.55 |
|
|
|
|
|
-|1 |8192 |ON |TF32 |670885.47 ± 11149.51 |
|
|
|
|
|
-|1 |16384 |ON |TF32 |788890.79 ± 10058.99 |
|
|
|
|
|
-|1 |32768 |ON |TF32 |855056.39 ± 14349.13 |
|
|
|
|
|
-|1 |65536 |ON |TF32 |918649.98 ± 7571.32 |
|
|
|
|
|
-|1 |131072 |ON |TF32 |918555.37 ± 15036.89 |
|
|
|
|
|
-|1 |4096 |ON |AMP |535674.63 ± 14003.35 |
|
|
|
|
|
-|1 |8192 |ON |AMP |758801.43 ± 15225.76 |
|
|
|
|
|
-|1 |16384 |ON |AMP |920695.88 ± 15325.29 |
|
|
|
|
|
-|1 |32768 |ON |AMP |1035530.23 ± 16055.40 |
|
|
|
|
|
-|1 |65536 |ON |AMP |1081408.05 ± 41906.29 |
|
|
|
|
|
-|1 |131072 |ON |AMP |771119.78 ± 79589.50 |
|
|
|
|
|
-|8 |4096 |OFF |TF32 |765154.17 ± 30582.87 |
|
|
|
|
|
-|8 |8192 |OFF |TF32 |1396414.24 ± 99987.01 |
|
|
|
|
|
-|8 |16384 |OFF |TF32 |2281597.86 ± 77483.79 |
|
|
|
|
|
-|8 |32768 |OFF |TF32 |3555014.42 ± 145944.33 |
|
|
|
|
|
-|8 |65536 |OFF |TF32 |4792413.60 ± 203285.21 |
|
|
|
|
|
-|8 |131072 |OFF |TF32 |5941195.01 ± 182519.72 |
|
|
|
|
|
-|8 |4096 |OFF |AMP |642706.11 ± 28063.45 |
|
|
|
|
|
-|8 |8192 |OFF |AMP |1197789.38 ± 47262.95 |
|
|
|
|
|
-|8 |16384 |OFF |AMP |1961353.19 ± 49818.70 |
|
|
|
|
|
-|8 |32768 |OFF |AMP |3267263.60 ± 130680.70 |
|
|
|
|
|
-|8 |65536 |OFF |AMP |4847783.16 ± 257991.99 |
|
|
|
|
|
-|8 |131072 |OFF |AMP |6413842.15 ± 289543.64 |
|
|
|
|
|
-|8 |4096 |ON |TF32 |1130031.99 ± 75271.24 |
|
|
|
|
|
-|8 |8192 |ON |TF32 |2246441.94 ± 26132.90 |
|
|
|
|
|
-|8 |16384 |ON |TF32 |4000071.31 ± 48054.68 |
|
|
|
|
|
-|8 |32768 |ON |TF32 |5479754.01 ± 170421.20 |
|
|
|
|
|
-|8 |65536 |ON |TF32 |6736333.91 ± 153745.68 |
|
|
|
|
|
-|8 |131072 |ON |TF32 |7598665.72 ± 174188.78 |
|
|
|
|
|
-|8 |4096 |ON |AMP |935848.52 ± 14583.48 |
|
|
|
|
|
-|8 |8192 |ON |AMP |1885511.32 ± 22206.00 |
|
|
|
|
|
-|8 |16384 |ON |AMP |3303417.50 ± 210306.61 |
|
|
|
|
|
-|8 |32768 |ON |AMP |5762298.42 ± 140412.56 |
|
|
|
|
|
-|8 |65536 |ON |AMP |7869825.77 ± 305838.69 |
|
|
|
|
|
-|8 |131072 |ON |AMP |9002545.49 ± 438204.32 |
|
|
|
|
|
|
|
+| Batch Size | XLA | Precision | Throughput (samples/s) |
|
|
|
|
|
+|-------------:|:------|:------------|:--------------------------|
|
|
|
|
|
+| 4096 | OFF | TF32 | 708349.73 ± 14161.58 |
|
|
|
|
|
+| 8192 | OFF | TF32 | 873335.82 ± 8539.56 |
|
|
|
|
|
+| 16384 | OFF | TF32 | 937987.79 ± 12114.34 |
|
|
|
|
|
+| 32768 | OFF | TF32 | 943313.07 ± 8631.81 |
|
|
|
|
|
+| 65536 | OFF | TF32 | 960794.46 ± 7388.45 |
|
|
|
|
|
+| 131072 | OFF | TF32 | 966245.27 ± 8637.82 |
|
|
|
|
|
+| 4096 | OFF | AMP | 645394.94 ± 14844.27 |
|
|
|
|
|
+| 8192 | OFF | AMP | 919410.07 ± 11355.28 |
|
|
|
|
|
+| 16384 | OFF | AMP | 1136346.66 ± 14529.91 |
|
|
|
|
|
+| 32768 | OFF | AMP | 1216810.45 ± 21013.12 |
|
|
|
|
|
+| 65536 | OFF | AMP | 1287305.05 ± 19373.18 |
|
|
|
|
|
+| 131072 | OFF | AMP | 1298478.97 ± 10733.67 |
|
|
|
|
|
+| 4096 | ON | TF32 | 618547.45 ± 6569.97 |
|
|
|
|
|
+| 8192 | ON | TF32 | 722801.14 ± 9448.19 |
|
|
|
|
|
+| 16384 | ON | TF32 | 859418.77 ± 10012.61 |
|
|
|
|
|
+| 32768 | ON | TF32 | 976771.70 ± 13377.36 |
|
|
|
|
|
+| 65536 | ON | TF32 | 1082688.51 ± 8523.55 |
|
|
|
|
|
+| 131072 | ON | TF32 | 1094733.64 ± 11157.18 |
|
|
|
|
|
+| 4096 | ON | AMP | 669640.65 ± 9319.68 |
|
|
|
|
|
+| 8192 | ON | AMP | 849101.88 ± 14068.04 |
|
|
|
|
|
+| 16384 | ON | AMP | 1051361.67 ± 15310.42 |
|
|
|
|
|
+| 32768 | ON | AMP | 1269000.97 ± 23971.56 |
|
|
|
|
|
+| 65536 | ON | AMP | 1444729.52 ± 18011.54 |
|
|
|
|
|
+| 131072 | ON | AMP | 1483542.86 ± 6751.29 |
|
|
|
|
|
|
|
|
</details>
|
|
</details>
|
|
|
|
|
|
|
@@ -1138,32 +1166,20 @@ DGX A100 XLA-ON / XLA-OFF inference Speedup
|
|
|
|
|
|
|
|
For each configuration of parameters present in the table, the `Speedup` column shows the speedup achieved by turning on XLA.
|
|
For each configuration of parameters present in the table, the `Speedup` column shows the speedup achieved by turning on XLA.
|
|
|
|
|
|
|
|
-|GPUs |Global Batch Size |Precision |Speedup |
|
|
|
|
|
-|-----|--------------------|---------------|--------|
|
|
|
|
|
-|1 |4096 |TF32 |0.960 |
|
|
|
|
|
-|1 |8192 |TF32 |0.894 |
|
|
|
|
|
-|1 |16384 |TF32 |0.982 |
|
|
|
|
|
-|1 |32768 |TF32 |1.040 |
|
|
|
|
|
-|1 |65536 |TF32 |1.184 |
|
|
|
|
|
-|1 |131072 |TF32 |1.425 |
|
|
|
|
|
-|1 |4096 |AMP |1.038 |
|
|
|
|
|
-|1 |8192 |AMP |0.974 |
|
|
|
|
|
-|1 |16384 |AMP |0.988 |
|
|
|
|
|
-|1 |32768 |AMP |1.045 |
|
|
|
|
|
-|1 |65536 |AMP |1.511 |
|
|
|
|
|
-|1 |131072 |AMP |1.261 |
|
|
|
|
|
-|8 |4096 |TF32 |1.477 |
|
|
|
|
|
-|8 |8192 |TF32 |1.609 |
|
|
|
|
|
-|8 |16384 |TF32 |1.753 |
|
|
|
|
|
-|8 |32768 |TF32 |1.541 |
|
|
|
|
|
-|8 |65536 |TF32 |1.406 |
|
|
|
|
|
-|8 |131072 |TF32 |1.279 |
|
|
|
|
|
-|8 |4096 |AMP |1.456 |
|
|
|
|
|
-|8 |8192 |AMP |1.574 |
|
|
|
|
|
-|8 |16384 |AMP |1.684 |
|
|
|
|
|
-|8 |32768 |AMP |1.764 |
|
|
|
|
|
-|8 |65536 |AMP |1.623 |
|
|
|
|
|
-|8 |131072 |AMP |1.404 |
|
|
|
|
|
|
|
+|Batch Size |Precision |Speedup |
|
|
|
|
|
+|--------------------|---------------|--------|
|
|
|
|
|
+|4096 |TF32 |0.873 |
|
|
|
|
|
+|8192 |TF32 |0.828 |
|
|
|
|
|
+|16384 |TF32 |0.916 |
|
|
|
|
|
+|32768 |TF32 |1.035 |
|
|
|
|
|
+|65536 |TF32 |1.127 |
|
|
|
|
|
+|131072 |TF32 |1.133 |
|
|
|
|
|
+|4096 |AMP |1.038 |
|
|
|
|
|
+|8192 |AMP |0.924 |
|
|
|
|
|
+|16384 |AMP |0.925 |
|
|
|
|
|
+|32768 |AMP |1.043 |
|
|
|
|
|
+|65536 |AMP |1.187 |
|
|
|
|
|
+|131072 |AMP |1.143 |
|
|
|
|
|
|
|
|
</details>
|
|
</details>
|
|
|
|
|
|
|
@@ -1171,153 +1187,69 @@ For each configuration of parameters present in the table, the `Speedup` column
|
|
|
|
|
|
|
|
##### Inference performance: NVIDIA DGX-2 (16x V100 32GB)
|
|
##### Inference performance: NVIDIA DGX-2 (16x V100 32GB)
|
|
|
|
|
|
|
|
-|GPUs |Global batch size|XLA |Throughput - FP32 (samples/s)|Throughput - mixed precision (samples/s)|Throughput speedup (mixed precision / FP32) | Strong scaling - FP32 | Strong scaling - mixed precision |
|
|
|
|
|
-|-----|----------|-----|---------------|----------------------------|---------------------------------------------|--------|--------|
|
|
|
|
|
-|1 |4096 |ON |403479.95 |479051.62 |1.19 | 1.00 | 1.00 |
|
|
|
|
|
-|1 |8192 |ON |480491.12 |600002.95 |1.25 | 1.00 | 1.00 |
|
|
|
|
|
-|1 |16384 |ON |538737.44 |713203.59 |1.32 | 1.00 | 1.00 |
|
|
|
|
|
-|1 |32768 |ON |580958.93 |790782.1 |1.36 | 1.00 | 1.00 |
|
|
|
|
|
-|1 |65536 |ON |586275.07 |818038.44 |1.40 | 1.00 | 1.00 |
|
|
|
|
|
-|1 |131072 |ON |613524.11 |734034.26 |1.20 | 1.00 | 1.00 |
|
|
|
|
|
-|8 |4096 |ON |1059775.22 |909719.3 |0.86 | 2.63 | 1.90 |
|
|
|
|
|
-|8 |8192 |ON |1845819.99 |1752510.62 |0.95 | 3.84 | 2.92 |
|
|
|
|
|
-|8 |16384 |ON |2801114.77 |2898423.08 |1.03 | 5.20 | 4.06 |
|
|
|
|
|
-|8 |32768 |ON |3396766.27 |4102026.01 |1.21 | 5.85 | 5.19 |
|
|
|
|
|
-|8 |65536 |ON |3911994.39 |4725023.23 |1.21 | 6.67 | 5.78 |
|
|
|
|
|
-|8 |131072 |ON |4197603.74 |5413542.58 |1.29 | 6.84 | 7.38 |
|
|
|
|
|
-|16 |4096 |ON |1142272.86 |924525.38 |0.81 | 2.83 | 1.93 |
|
|
|
|
|
-|16 |8192 |ON |2068920.7 |1917814.81 |0.93 | 4.31 | 3.20 |
|
|
|
|
|
-|16 |16384 |ON |3091676.83 |3496153.45 |1.13 | 5.74 | 4.90 |
|
|
|
|
|
-|16 |32768 |ON |5132772.75 |5063615.77 |0.99 | 8.84 | 6.40 |
|
|
|
|
|
-|16 |65536 |ON |6553882.87 |8247475.75 |1.26 | 11.18 | 10.08 |
|
|
|
|
|
-|16 |131072 |ON |7555906.17 |9571965.84 |1.27 | 12.32 | 13.04 |
|
|
|
|
|
|
|
+| Batch Size | XLA | Throughput - FP32 (samples/s) | Throughput - mixed precision (samples/s) | Throughput speedup (mixed precision / FP32) |
|
|
|
|
|
+|--------------------:|------:|--------------------------------:|-------------------------------------------:|----------------------------------------------:|
|
|
|
|
|
+| 4096 | ON | 444532.22 | 541975.24 | 1.22 |
|
|
|
|
|
+| 8192 | ON | 505047.64 | 642784.48 | 1.27 |
|
|
|
|
|
+| 16384 | ON | 549325.54 | 727077.63 | 1.32 |
|
|
|
|
|
+| 32768 | ON | 587452.73 | 788606.35 | 1.34 |
|
|
|
|
|
+| 65536 | ON | 605187.67 | 832651.59 | 1.38 |
|
|
|
|
|
+| 131072 | ON | 599557.03 | 840602.90 | 1.40 |
|
|
|
|
|
|
|
|
<details>
|
|
<details>
|
|
|
<summary><b>
|
|
<summary><b>
|
|
|
-Complete table of DGX2 inference performance results
|
|
|
|
|
|
|
+Complete table of DGX-2 inference performance results
|
|
|
</b></summary>
|
|
</b></summary>
|
|
|
|
|
|
|
|
-|GPUs |Global Batch Size |XLA |Precision |Throughput (samples/s) |
|
|
|
|
|
-|-----|--------------------|-------|---------------|-----------------------|
|
|
|
|
|
-|1 |4096 |OFF |FP32 |459149.07 ± 20971.34 |
|
|
|
|
|
-|1 |8192 |OFF |FP32 |488763.98 ± 15037.09 |
|
|
|
|
|
-|1 |16384 |OFF |FP32 |516804.05 ± 8355.49 |
|
|
|
|
|
-|1 |32768 |OFF |FP32 |534387.97 ± 4763.49 |
|
|
|
|
|
-|1 |65536 |OFF |FP32 |536215.89 ± 5794.77 |
|
|
|
|
|
-|1 |131072 |OFF |FP32 |538646.76 ± 6359.47 |
|
|
|
|
|
-|1 |4096 |OFF |AMP |488475.14 ± 6226.30 |
|
|
|
|
|
-|1 |8192 |OFF |AMP |632098.48 ± 27370.49 |
|
|
|
|
|
-|1 |16384 |OFF |AMP |705878.12 ± 7852.19 |
|
|
|
|
|
-|1 |32768 |OFF |AMP |739740.73 ± 6866.73 |
|
|
|
|
|
-|1 |65536 |OFF |AMP |618291.18 ± 26749.52 |
|
|
|
|
|
-|1 |131072 |OFF |AMP |544071.41 ± 19200.23 |
|
|
|
|
|
-|1 |4096 |ON |FP32 |403479.95 ± 4079.19 |
|
|
|
|
|
-|1 |8192 |ON |FP32 |480491.12 ± 6828.93 |
|
|
|
|
|
-|1 |16384 |ON |FP32 |538737.44 ± 10932.49 |
|
|
|
|
|
-|1 |32768 |ON |FP32 |580958.93 ± 9544.37 |
|
|
|
|
|
-|1 |65536 |ON |FP32 |586275.07 ± 7640.59 |
|
|
|
|
|
-|1 |131072 |ON |FP32 |613524.11 ± 7931.04 |
|
|
|
|
|
-|1 |4096 |ON |AMP |479051.62 ± 6076.26 |
|
|
|
|
|
-|1 |8192 |ON |AMP |600002.95 ± 16380.88 |
|
|
|
|
|
-|1 |16384 |ON |AMP |713203.59 ± 9515.25 |
|
|
|
|
|
-|1 |32768 |ON |AMP |790782.10 ± 10788.69 |
|
|
|
|
|
-|1 |65536 |ON |AMP |818038.44 ± 14132.80 |
|
|
|
|
|
-|1 |131072 |ON |AMP |734034.26 ± 34664.74 |
|
|
|
|
|
-|8 |4096 |OFF |FP32 |502947.25 ± 105758.96 |
|
|
|
|
|
-|8 |8192 |OFF |FP32 |809285.58 ± 112765.45 |
|
|
|
|
|
-|8 |16384 |OFF |FP32 |1974085.95 ± 476616.90 |
|
|
|
|
|
-|8 |32768 |OFF |FP32 |2990517.14 ± 645490.89 |
|
|
|
|
|
-|8 |65536 |OFF |FP32 |3662830.22 ± 191010.11 |
|
|
|
|
|
-|8 |131072 |OFF |FP32 |3978985.17 ± 142801.19 |
|
|
|
|
|
-|8 |4096 |OFF |AMP |596945.98 ± 92977.56 |
|
|
|
|
|
-|8 |8192 |OFF |AMP |730694.36 ± 67972.28 |
|
|
|
|
|
-|8 |16384 |OFF |AMP |1758189.25 ± 340547.41 |
|
|
|
|
|
-|8 |32768 |OFF |AMP |3873856.45 ± 528746.35 |
|
|
|
|
|
-|8 |65536 |OFF |AMP |4863371.50 ± 297299.34 |
|
|
|
|
|
-|8 |131072 |OFF |AMP |5134261.52 ± 473726.31 |
|
|
|
|
|
-|8 |4096 |ON |FP32 |1059775.22 ± 24386.54 |
|
|
|
|
|
-|8 |8192 |ON |FP32 |1845819.99 ± 250767.40 |
|
|
|
|
|
-|8 |16384 |ON |FP32 |2801114.77 ± 210397.18 |
|
|
|
|
|
-|8 |32768 |ON |FP32 |3396766.27 ± 221795.61 |
|
|
|
|
|
-|8 |65536 |ON |FP32 |3911994.39 ± 239259.17 |
|
|
|
|
|
-|8 |131072 |ON |FP32 |4197603.74 ± 158110.80 |
|
|
|
|
|
-|8 |4096 |ON |AMP |909719.30 ± 135634.13 |
|
|
|
|
|
-|8 |8192 |ON |AMP |1752510.62 ± 87042.91 |
|
|
|
|
|
-|8 |16384 |ON |AMP |2898423.08 ± 231659.28 |
|
|
|
|
|
-|8 |32768 |ON |AMP |4102026.01 ± 254242.94 |
|
|
|
|
|
-|8 |65536 |ON |AMP |4725023.23 ± 322597.53 |
|
|
|
|
|
-|8 |131072 |ON |AMP |5413542.58 ± 364633.26 |
|
|
|
|
|
-|16 |4096 |OFF |FP32 |865109.29 ± 40032.58 |
|
|
|
|
|
-|16 |8192 |OFF |FP32 |1565843.18 ± 305582.99 |
|
|
|
|
|
-|16 |16384 |OFF |FP32 |3109303.21 ± 240314.57 |
|
|
|
|
|
-|16 |32768 |OFF |FP32 |5750753.42 ± 898435.09 |
|
|
|
|
|
-|16 |65536 |OFF |FP32 |6456324.48 ± 730326.61 |
|
|
|
|
|
-|16 |131072 |OFF |FP32 |7415730.04 ± 434928.14 |
|
|
|
|
|
-|16 |4096 |OFF |AMP |742890.53 ± 27541.80 |
|
|
|
|
|
-|16 |8192 |OFF |AMP |1468615.49 ± 67548.46 |
|
|
|
|
|
-|16 |16384 |OFF |AMP |2591245.05 ± 394504.75 |
|
|
|
|
|
-|16 |32768 |OFF |AMP |4671719.91 ± 721705.81 |
|
|
|
|
|
-|16 |65536 |OFF |AMP |7982733.55 ± 1242742.25|
|
|
|
|
|
-|16 |131072 |OFF |AMP |9867894.78 ± 679119.71 |
|
|
|
|
|
-|16 |4096 |ON |FP32 |1142272.86 ± 43154.49 |
|
|
|
|
|
-|16 |8192 |ON |FP32 |2068920.70 ± 130214.35 |
|
|
|
|
|
-|16 |16384 |ON |FP32 |3091676.83 ± 991449.61 |
|
|
|
|
|
-|16 |32768 |ON |FP32 |5132772.75 ± 525201.10 |
|
|
|
|
|
-|16 |65536 |ON |FP32 |6553882.87 ± 400638.86 |
|
|
|
|
|
-|16 |131072 |ON |FP32 |7555906.17 ± 626110.02 |
|
|
|
|
|
-|16 |4096 |ON |AMP |924525.38 ± 163488.57 |
|
|
|
|
|
-|16 |8192 |ON |AMP |1917814.81 ± 59114.71 |
|
|
|
|
|
-|16 |16384 |ON |AMP |3496153.45 ± 190771.71 |
|
|
|
|
|
-|16 |32768 |ON |AMP |5063615.77 ± 1281699.58|
|
|
|
|
|
-|16 |65536 |ON |AMP |8247475.75 ± 539827.60 |
|
|
|
|
|
-|16 |131072 |ON |AMP |9571965.84 ± 764075.50 |
|
|
|
|
|
|
|
+| Batch Size | XLA | Precision | Throughput (samples/s) |
|
|
|
|
|
+|-------------:|:------|:------------|:--------------------------|
|
|
|
|
|
+| 4096 | OFF | FP32 | 459175.30 ± 23184.33 |
|
|
|
|
|
+| 8192 | OFF | FP32 | 499179.20 ± 15967.26 |
|
|
|
|
|
+| 16384 | OFF | FP32 | 525180.72 ± 2521.56 |
|
|
|
|
|
+| 32768 | OFF | FP32 | 532042.10 ± 4020.44 |
|
|
|
|
|
+| 65536 | OFF | FP32 | 534307.20 ± 7276.26 |
|
|
|
|
|
+| 131072 | OFF | FP32 | 532311.44 ± 6195.16 |
|
|
|
|
|
+| 4096 | OFF | AMP | 581771.66 ± 6163.50 |
|
|
|
|
|
+| 8192 | OFF | AMP | 665048.04 ± 4607.95 |
|
|
|
|
|
+| 16384 | OFF | AMP | 716355.19 ± 7174.98 |
|
|
|
|
|
+| 32768 | OFF | AMP | 741642.61 ± 4981.04 |
|
|
|
|
|
+| 65536 | OFF | AMP | 755141.25 ± 6175.05 |
|
|
|
|
|
+| 131072 | OFF | AMP | 744459.46 ± 8183.17 |
|
|
|
|
|
+| 4096 | ON | FP32 | 444532.22 ± 6239.01 |
|
|
|
|
|
+| 8192 | ON | FP32 | 505047.64 ± 6543.06 |
|
|
|
|
|
+| 16384 | ON | FP32 | 549325.54 ± 2841.21 |
|
|
|
|
|
+| 32768 | ON | FP32 | 587452.73 ± 2366.43 |
|
|
|
|
|
+| 65536 | ON | FP32 | 605187.67 ± 3740.07 |
|
|
|
|
|
+| 131072 | ON | FP32 | 599557.03 ± 11811.28 |
|
|
|
|
|
+| 4096 | ON | AMP | 541975.24 ± 4441.93 |
|
|
|
|
|
+| 8192 | ON | AMP | 642784.48 ± 4721.08 |
|
|
|
|
|
+| 16384 | ON | AMP | 727077.63 ± 5332.80 |
|
|
|
|
|
+| 32768 | ON | AMP | 788606.35 ± 11705.36 |
|
|
|
|
|
+| 65536 | ON | AMP | 832651.59 ± 10401.17 |
|
|
|
|
|
+| 131072 | ON | AMP | 840602.90 ± 16358.73 |
|
|
|
</details>
|
|
</details>
|
|
|
|
|
|
|
|
<details>
|
|
<details>
|
|
|
<summary><b>
|
|
<summary><b>
|
|
|
-DGX A100 XLA-ON / XLA-OFF inference speedup
|
|
|
|
|
|
|
+DGX-2 XLA-ON / XLA-OFF inference speedup
|
|
|
</b></summary>
|
|
</b></summary>
|
|
|
|
|
|
|
|
For each configuration of parameters present in the table, the `Speedup` column shows the speedup achieved by turning on XLA.
|
|
For each configuration of parameters present in the table, the `Speedup` column shows the speedup achieved by turning on XLA.
|
|
|
|
|
|
|
|
-|GPUs |Global Batch Size |Precision |Speedup |
|
|
|
|
|
-|-----|--------------------|---------------|--------|
|
|
|
|
|
-|1 |4096 |FP32 |0.879 |
|
|
|
|
|
-|1 |8192 |FP32 |0.983 |
|
|
|
|
|
-|1 |16384 |FP32 |1.042 |
|
|
|
|
|
-|1 |32768 |FP32 |1.087 |
|
|
|
|
|
-|1 |65536 |FP32 |1.093 |
|
|
|
|
|
-|1 |131072 |FP32 |1.139 |
|
|
|
|
|
-|1 |4096 |AMP |0.981 |
|
|
|
|
|
-|1 |8192 |AMP |0.949 |
|
|
|
|
|
-|1 |16384 |AMP |1.010 |
|
|
|
|
|
-|1 |32768 |AMP |1.069 |
|
|
|
|
|
-|1 |65536 |AMP |1.323 |
|
|
|
|
|
-|1 |131072 |AMP |1.349 |
|
|
|
|
|
-|8 |4096 |FP32 |2.107 |
|
|
|
|
|
-|8 |8192 |FP32 |2.281 |
|
|
|
|
|
-|8 |16384 |FP32 |1.419 |
|
|
|
|
|
-|8 |32768 |FP32 |1.136 |
|
|
|
|
|
-|8 |65536 |FP32 |1.068 |
|
|
|
|
|
-|8 |131072 |FP32 |1.055 |
|
|
|
|
|
-|8 |4096 |AMP |1.524 |
|
|
|
|
|
-|8 |8192 |AMP |2.398 |
|
|
|
|
|
-|8 |16384 |AMP |1.649 |
|
|
|
|
|
-|8 |32768 |AMP |1.059 |
|
|
|
|
|
-|8 |65536 |AMP |0.972 |
|
|
|
|
|
-|8 |131072 |AMP |1.054 |
|
|
|
|
|
-|16 |4096 |FP32 |1.320 |
|
|
|
|
|
-|16 |8192 |FP32 |1.321 |
|
|
|
|
|
-|16 |16384 |FP32 |0.994 |
|
|
|
|
|
-|16 |32768 |FP32 |0.893 |
|
|
|
|
|
-|16 |65536 |FP32 |1.015 |
|
|
|
|
|
-|16 |131072 |FP32 |1.019 |
|
|
|
|
|
-|16 |4096 |AMP |1.244 |
|
|
|
|
|
-|16 |8192 |AMP |1.306 |
|
|
|
|
|
-|16 |16384 |AMP |1.349 |
|
|
|
|
|
-|16 |32768 |AMP |1.084 |
|
|
|
|
|
-|16 |65536 |AMP |1.033 |
|
|
|
|
|
-|16 |131072 |AMP |0.970 |
|
|
|
|
|
|
|
+|Batch Size |Precision |Speedup |
|
|
|
|
|
+|--------------------|---------------|--------|
|
|
|
|
|
+|4096 |TF32 |0.968 |
|
|
|
|
|
+|8192 |TF32 |1.012 |
|
|
|
|
|
+|16384 |TF32 |1.046 |
|
|
|
|
|
+|32768 |TF32 |1.104 |
|
|
|
|
|
+|65536 |TF32 |1.133 |
|
|
|
|
|
+|131072 |TF32 |1.126 |
|
|
|
|
|
+|4096 |AMP |0.932 |
|
|
|
|
|
+|8192 |AMP |0.967 |
|
|
|
|
|
+|16384 |AMP |1.384 |
|
|
|
|
|
+|32768 |AMP |1.063 |
|
|
|
|
|
+|65536 |AMP |1.103 |
|
|
|
|
|
+|131072 |AMP |1.129 |
|
|
|
</details>
|
|
</details>
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1327,56 +1259,32 @@ For each configuration of parameters present in the table, the `Speedup` column
|
|
|
NVIDIA A100 / DGX-2 (Ampere / Volta) inference speedup
|
|
NVIDIA A100 / DGX-2 (Ampere / Volta) inference speedup
|
|
|
</b></summary>
|
|
</b></summary>
|
|
|
|
|
|
|
|
-|GPUs |Global Batch Size |XLA |Precision |Speedup |
|
|
|
|
|
-|-----|--------------------|-------|---------------|--------|
|
|
|
|
|
-|1 |4096 |OFF |TF32/FP32 |1.275 |
|
|
|
|
|
-|1 |8192 |OFF |TF32/FP32 |1.536 |
|
|
|
|
|
-|1 |16384 |OFF |TF32/FP32 |1.555 |
|
|
|
|
|
-|1 |32768 |OFF |TF32/FP32 |1.539 |
|
|
|
|
|
-|1 |65536 |OFF |TF32/FP32 |1.447 |
|
|
|
|
|
-|1 |131072 |OFF |TF32/FP32 |1.197 |
|
|
|
|
|
-|1 |4096 |OFF |AMP |1.057 |
|
|
|
|
|
-|1 |8192 |OFF |AMP |1.232 |
|
|
|
|
|
-|1 |16384 |OFF |AMP |1.321 |
|
|
|
|
|
-|1 |32768 |OFF |AMP |1.339 |
|
|
|
|
|
-|1 |65536 |OFF |AMP |1.158 |
|
|
|
|
|
-|1 |131072 |OFF |AMP |1.124 |
|
|
|
|
|
-|1 |4096 |ON |TF32/FP32 |1.393 |
|
|
|
|
|
-|1 |8192 |ON |TF32/FP32 |1.396 |
|
|
|
|
|
-|1 |16384 |ON |TF32/FP32 |1.464 |
|
|
|
|
|
-|1 |32768 |ON |TF32/FP32 |1.472 |
|
|
|
|
|
-|1 |65536 |ON |TF32/FP32 |1.567 |
|
|
|
|
|
-|1 |131072 |ON |TF32/FP32 |1.497 |
|
|
|
|
|
-|1 |4096 |ON |AMP |1.118 |
|
|
|
|
|
-|1 |8192 |ON |AMP |1.265 |
|
|
|
|
|
-|1 |16384 |ON |AMP |1.291 |
|
|
|
|
|
-|1 |32768 |ON |AMP |1.310 |
|
|
|
|
|
-|1 |65536 |ON |AMP |1.322 |
|
|
|
|
|
-|1 |131072 |ON |AMP |1.051 |
|
|
|
|
|
-|8 |4096 |OFF |TF32/FP32 |1.521 |
|
|
|
|
|
-|8 |8192 |OFF |TF32/FP32 |1.725 |
|
|
|
|
|
-|8 |16384 |OFF |TF32/FP32 |1.156 |
|
|
|
|
|
-|8 |32768 |OFF |TF32/FP32 |1.189 |
|
|
|
|
|
-|8 |65536 |OFF |TF32/FP32 |1.308 |
|
|
|
|
|
-|8 |131072 |OFF |TF32/FP32 |1.493 |
|
|
|
|
|
-|8 |4096 |OFF |AMP |1.077 |
|
|
|
|
|
-|8 |8192 |OFF |AMP |1.639 |
|
|
|
|
|
-|8 |16384 |OFF |AMP |1.116 |
|
|
|
|
|
-|8 |32768 |OFF |AMP |0.843 |
|
|
|
|
|
-|8 |65536 |OFF |AMP |0.997 |
|
|
|
|
|
-|8 |131072 |OFF |AMP |1.249 |
|
|
|
|
|
-|8 |4096 |ON |TF32/FP32 |1.066 |
|
|
|
|
|
-|8 |8192 |ON |TF32/FP32 |1.217 |
|
|
|
|
|
-|8 |16384 |ON |TF32/FP32 |1.428 |
|
|
|
|
|
-|8 |32768 |ON |TF32/FP32 |1.613 |
|
|
|
|
|
-|8 |65536 |ON |TF32/FP32 |1.722 |
|
|
|
|
|
-|8 |131072 |ON |TF32/FP32 |1.810 |
|
|
|
|
|
-|8 |4096 |ON |AMP |1.029 |
|
|
|
|
|
-|8 |8192 |ON |AMP |1.076 |
|
|
|
|
|
-|8 |16384 |ON |AMP |1.140 |
|
|
|
|
|
-|8 |32768 |ON |AMP |1.405 |
|
|
|
|
|
-|8 |65536 |ON |AMP |1.666 |
|
|
|
|
|
-|8 |131072 |ON |AMP |1.663 |
|
|
|
|
|
|
|
+| Batch Size | XLA | Precision | Speedup |
|
|
|
|
|
+|-------------:|:------|:------------|----------:|
|
|
|
|
|
+| 4096 | OFF | TF32/FP32 | 1.54 |
|
|
|
|
|
+| 8192 | OFF | TF32/FP32 | 1.75 |
|
|
|
|
|
+| 16384 | OFF | TF32/FP32 | 1.79 |
|
|
|
|
|
+| 32768 | OFF | TF32/FP32 | 1.77 |
|
|
|
|
|
+| 65536 | OFF | TF32/FP32 | 1.80 |
|
|
|
|
|
+| 131072 | OFF | TF32/FP32 | 1.81 |
|
|
|
|
|
+| 4096 | OFF | AMP | 1.11 |
|
|
|
|
|
+| 8192 | OFF | AMP | 1.38 |
|
|
|
|
|
+| 16384 | OFF | AMP | 1.59 |
|
|
|
|
|
+| 32768 | OFF | AMP | 1.64 |
|
|
|
|
|
+| 65536 | OFF | AMP | 1.71 |
|
|
|
|
|
+| 131072 | OFF | AMP | 1.74 |
|
|
|
|
|
+| 4096 | ON | TF32/FP32 | 1.39 |
|
|
|
|
|
+| 8192 | ON | TF32/FP32 | 1.43 |
|
|
|
|
|
+| 16384 | ON | TF32/FP32 | 1.56 |
|
|
|
|
|
+| 32768 | ON | TF32/FP32 | 1.66 |
|
|
|
|
|
+| 65536 | ON | TF32/FP32 | 1.79 |
|
|
|
|
|
+| 131072 | ON | TF32/FP32 | 1.83 |
|
|
|
|
|
+| 4096 | ON | AMP | 1.24 |
|
|
|
|
|
+| 8192 | ON | AMP | 1.32 |
|
|
|
|
|
+| 16384 | ON | AMP | 1.45 |
|
|
|
|
|
+| 32768 | ON | AMP | 1.61 |
|
|
|
|
|
+| 65536 | ON | AMP | 1.74 |
|
|
|
|
|
+| 131072 | ON | AMP | 1.76 |
|
|
|
</details>
|
|
</details>
|
|
|
|
|
|
|
|
|
|
|
|
@@ -1388,10 +1296,12 @@ NVIDIA A100 / DGX-2 (Ampere / Volta) inference speedup
|
|
|
May 2022
|
|
May 2022
|
|
|
- Initial release
|
|
- Initial release
|
|
|
|
|
|
|
|
-### Known issues
|
|
|
|
|
|
|
+November 2022
|
|
|
|
|
+- Moved batching and padding operations to preprocessing
|
|
|
|
|
+- Added support for prebatched samples during dataloading
|
|
|
|
|
+- Reduced throughput variance (previously appearing mainly during inference)
|
|
|
|
|
|
|
|
-- While benchmarking inference on a single GPU, sometimes throughput drops drastically in the middle of the epoch and remains low until the end of the epoch.
|
|
|
|
|
-- On a multi-GPU setup, the summary of throughput (in the last line of the logfile) is lower than it would result from each step`s throughput (sample/s). It is probably the case when a single GPU is slower than the one on the logging node. In this case, the overhead for synchronization before the final throughput calculation is higher than usual.
|
|
|
|
|
|
|
+### Known issues
|
|
|
- The SIM model results are non-deterministic, even using the same random seed. The reason for this non-determinism is the [tf.math.unsorted_segment_sum](https://www.tensorflow.org/api_docs/python/tf/math/unsorted_segment_sum) operation called within an optimization step. Its influence depends on categorical data distribution within a batch, and this issue is more severe for momentum-based optimizers. A potential solution is to use a deterministic version of this op which allows perfect reproduction, but is up to six times slower training.
|
|
- The SIM model results are non-deterministic, even using the same random seed. The reason for this non-determinism is the [tf.math.unsorted_segment_sum](https://www.tensorflow.org/api_docs/python/tf/math/unsorted_segment_sum) operation called within an optimization step. Its influence depends on categorical data distribution within a batch, and this issue is more severe for momentum-based optimizers. A potential solution is to use a deterministic version of this op which allows perfect reproduction, but is up to six times slower training.
|
|
|
|
|
|
|
|
|
|
|