|
|
@@ -31,11 +31,11 @@ achieve state-of-the-art accuracy, and is tested and maintained by NVIDIA.
|
|
|
* [Inference performance benchmark](#inference-performance-benchmark)
|
|
|
* [Results](#results)
|
|
|
* [Training accuracy results](#training-accuracy-results)
|
|
|
- * [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
|
|
|
+ * [Training accuracy: NVIDIA DGX A100 (8x A100 80GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-80gb)
|
|
|
* [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
|
|
|
* [Example plots](#example-plots)
|
|
|
* [Training performance results](#training-performance-results)
|
|
|
- * [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
|
|
|
+ * [Training performance: NVIDIA DGX A100 (8x A100 80GB)](#training-performance-nvidia-dgx-a100-8x-a100-80gb)
|
|
|
* [Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)](#training-performance-nvidia-dgx-1-16gb-8x-v100-16gb)
|
|
|
* [Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)](#training-performance-nvidia-dgx-1-32gb-8x-v100-32gb)
|
|
|
* [Inference performance results](#inference-performance-results)
|
|
|
@@ -45,7 +45,6 @@ achieve state-of-the-art accuracy, and is tested and maintained by NVIDIA.
|
|
|
* [Changelog](#changelog)
|
|
|
* [Known issues](#known-issues)
|
|
|
|
|
|
-
|
|
|
## Model overview
|
|
|
|
|
|
The SE-ResNeXt101-32x4d is a [ResNeXt101-32x4d](https://arxiv.org/pdf/1611.05431.pdf)
|
|
|
@@ -106,13 +105,14 @@ This model uses the following data augmentation:
|
|
|
* Scale to 256x256
|
|
|
* Center crop to 224x224
|
|
|
|
|
|
+
|
|
|
### Feature support matrix
|
|
|
|
|
|
The following features are supported by this model:
|
|
|
|
|
|
-| Feature | ResNeXt101-32x4d
|
|
|
+| Feature | SE-ResNeXt101-32x4d
|
|
|
|-----------------------|--------------------------
|
|
|
-|[DALI](https://docs.nvidia.com/deeplearning/dali/release-notes/index.html) | Yes
|
|
|
+|[DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html) | Yes
|
|
|
|[APEX AMP](https://nvidia.github.io/apex/amp.html) | Yes |
|
|
|
|
|
|
#### Features
|
|
|
@@ -129,11 +129,11 @@ which speeds up data loading when CPU becomes a bottleneck.
|
|
|
DALI can use CPU or GPU, and outperforms the PyTorch native dataloader.
|
|
|
|
|
|
Run training with `--data-backends dali-gpu` or `--data-backends dali-cpu` to enable DALI.
|
|
|
-For ResNeXt101-32x4d, for DGX1 and DGX2 we recommend `--data-backends dali-cpu`.
|
|
|
+For DGXA100 and DGX1 we recommend `--data-backends dali-cpu`.
|
|
|
|
|
|
### Mixed precision training
|
|
|
|
|
|
-Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
|
|
|
+Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
|
|
|
1. Porting the model to use the FP16 data type where appropriate.
|
|
|
2. Adding loss scaling to preserve small gradient values.
|
|
|
|
|
|
@@ -191,7 +191,7 @@ The following section lists the requirements that you need to meet in order to s
|
|
|
This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
|
|
|
|
|
|
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
|
|
|
-* [PyTorch 20.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
|
|
|
+* [PyTorch 20.12-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
|
|
|
* Supported GPUs:
|
|
|
* [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
|
|
|
* [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
|
|
|
@@ -216,7 +216,7 @@ cd DeepLearningExamples/PyTorch/Classification/
|
|
|
|
|
|
### 2. Download and preprocess the dataset.
|
|
|
|
|
|
-The ResNeXt101-32x4d script operates on ImageNet 1k, a widely popular image classification dataset from the ILSVRC challenge.
|
|
|
+The SE-ResNeXt101-32x4d script operates on ImageNet 1k, a widely popular image classification dataset from the ILSVRC challenge.
|
|
|
|
|
|
PyTorch can work directly on JPEGs, therefore, preprocessing/augmentation is not needed.
|
|
|
|
|
|
@@ -243,27 +243,28 @@ For the specifics concerning training and inference, see the [Advanced](#advance
|
|
|
|
|
|
The directory in which the `train/` and `val/` directories are placed, is referred to as `<path to imagenet>` in this document.
|
|
|
|
|
|
-### 3. Build the SE-RNXT101-32x4d PyTorch NGC container.
|
|
|
+### 3. Build the SE-ResNeXt101-32x4d PyTorch NGC container.
|
|
|
|
|
|
```
|
|
|
-docker build . -t nvidia_se-rnxt101-32x4d
|
|
|
+docker build . -t nvidia_se-resnext101-32x4d
|
|
|
```
|
|
|
|
|
|
### 4. Start an interactive session in the NGC container to run training/inference.
|
|
|
```
|
|
|
-nvidia-docker run --rm -it -v <path to imagenet>:/imagenet --ipc=host nvidia_se-rnxt101-32x4d
|
|
|
+nvidia-docker run --rm -it -v <path to imagenet>:/imagenet --ipc=host nvidia_se-resnext101-32x4d
|
|
|
```
|
|
|
|
|
|
+
|
|
|
### 5. Start training
|
|
|
|
|
|
-To run training for a standard configuration (DGXA100/DGX1/DGX2, AMP/TF32/FP32, 90/250 Epochs),
|
|
|
+To run training for a standard configuration (DGXA100/DGX1V, AMP/TF32/FP32, 90/250 Epochs),
|
|
|
run one of the scripts in the `./se-resnext101-32x4d/training` directory
|
|
|
-called `./se-resnext101-32x4d/training/{AMP, TF32, FP32}/{DGXA100, DGX1, DGX2}_SE-RNXT101-32x4d_{AMP, TF32, FP32}_{90,250}E.sh`.
|
|
|
+called `./se-resnext101-32x4d/training/{AMP, TF32, FP32}/{ DGXA100, DGX1V }_se-resnext101-32x4d_{AMP, TF32, FP32}_{ 90, 250 }E.sh`.
|
|
|
|
|
|
Ensure ImageNet is mounted in the `/imagenet` directory.
|
|
|
|
|
|
Example:
|
|
|
- `bash ./se-resnext101-32x4d/training/AMP/DGX1_SE-RNXT101-32x4d_AMP_250E.sh <path were to store checkpoints and logs>`
|
|
|
+ `bash ./se-resnext101-32x4d/training/AMP/DGX1_se-resnext101-32x4d_AMP_250E.sh <path were to store checkpoints and logs>`
|
|
|
|
|
|
### 6. Start inference
|
|
|
|
|
|
@@ -281,7 +282,7 @@ To run inference on ImageNet, run:
|
|
|
|
|
|
To run inference on JPEG image using pretrained weights:
|
|
|
|
|
|
-`python classify.py --arch se-resnext101-32x4d -c fanin --weights nvidia_se-resnext101-32x4d_200821.pth.tar --precision AMP|FP32 --image <path to JPEG image>`
|
|
|
+`python classify.py --arch se-resnext101-32x4d -c fanin --weights nvidia_se-resnext101-32x4d_200821.pth.tar --precision AMP|FP32 --image <path to JPEG image>`
|
|
|
|
|
|
|
|
|
## Advanced
|
|
|
@@ -320,7 +321,7 @@ usage: main.py [-h] [--data-backend BACKEND] [--arch ARCH]
|
|
|
[--lr-schedule SCHEDULE] [--warmup E] [--label-smoothing S]
|
|
|
[--mixup ALPHA] [--momentum M] [--weight-decay W]
|
|
|
[--bn-weight-decay] [--nesterov] [--print-freq N]
|
|
|
- [--resume PATH] [--pretrained-weights PATH] [--fp16]
|
|
|
+ [--resume PATH] [--pretrained-weights PATH]
|
|
|
[--static-loss-scale STATIC_LOSS_SCALE] [--dynamic-loss-scale]
|
|
|
[--prof N] [--amp] [--seed SEED] [--gather-checkpoints]
|
|
|
[--raport-file RAPORT_FILE] [--evaluate] [--training-only]
|
|
|
@@ -339,8 +340,10 @@ optional arguments:
|
|
|
data backend: pytorch | syntetic | dali-gpu | dali-cpu
|
|
|
(default: dali-cpu)
|
|
|
--arch ARCH, -a ARCH model architecture: resnet18 | resnet34 | resnet50 |
|
|
|
- resnet101 | resnet152 | resnext101-32x4d | se-
|
|
|
- resnext101-32x4d (default: resnet50)
|
|
|
+ resnet101 | resnet152 | resnext50-32x4d |
|
|
|
+ resnext101-32x4d | resnext101-32x8d |
|
|
|
+ resnext101-32x8d-basic | se-resnext101-32x4d (default:
|
|
|
+ resnet50)
|
|
|
--model-config CONF, -c CONF
|
|
|
model configs: classic | fanin | grp-fanin | grp-
|
|
|
fanout(default: classic)
|
|
|
@@ -369,10 +372,9 @@ optional arguments:
|
|
|
--resume PATH path to latest checkpoint (default: none)
|
|
|
--pretrained-weights PATH
|
|
|
load weights from here
|
|
|
- --fp16 Run model fp16 mode.
|
|
|
--static-loss-scale STATIC_LOSS_SCALE
|
|
|
Static loss scale, positive power of 2 values can
|
|
|
- improve fp16 convergence.
|
|
|
+ improve amp convergence.
|
|
|
--dynamic-loss-scale Use dynamic loss scaling. If supplied, this argument
|
|
|
supersedes --static-loss-scale.
|
|
|
--prof N Run only N iterations
|
|
|
@@ -390,6 +392,7 @@ optional arguments:
|
|
|
--workspace DIR path to directory where checkpoints will be stored
|
|
|
--memory-format {nchw,nhwc}
|
|
|
memory layout, nchw or nhwc
|
|
|
+
|
|
|
```
|
|
|
|
|
|
|
|
|
@@ -400,25 +403,7 @@ To use your own dataset, divide it in directories as in the following scheme:
|
|
|
- Training images - `train/<class id>/<image>`
|
|
|
- Validation images - `val/<class id>/<image>`
|
|
|
|
|
|
-If your dataset's has number of classes different than 1000, you need to add a custom config
|
|
|
-in the `image_classification/resnet.py` file.
|
|
|
-
|
|
|
-```python
|
|
|
-resnet_versions = {
|
|
|
- ...
|
|
|
- 'se-resnext101-32x4d-custom' : {
|
|
|
- 'net' : ResNet,
|
|
|
- 'block' : SEBottleneck,
|
|
|
- 'cardinality' : 32,
|
|
|
- 'layers' : [3, 4, 23, 3],
|
|
|
- 'widths' : [128, 256, 512, 1024],
|
|
|
- 'expansion' : 2,
|
|
|
- 'num_classes' : <custom number of classes>,
|
|
|
- }
|
|
|
-}
|
|
|
-```
|
|
|
-
|
|
|
-After adding the config, run the training script with `--arch resnext101-32x4d-custom` flag.
|
|
|
+If your dataset's has number of classes different than 1000, you need to pass `--num-classes N` flag to the training script.
|
|
|
|
|
|
### Training process
|
|
|
|
|
|
@@ -441,7 +426,7 @@ To restart training from checkpoint use `--resume` option.
|
|
|
To start training from pretrained weights (e.g. downloaded from NGC) use `--pretrained-weights` option.
|
|
|
|
|
|
The difference between those two is that the pretrained weights contain only model weights,
|
|
|
-and checkpoints, apart from model weights, contain optimizer state, LR scheduler state, RNG state.
|
|
|
+and checkpoints, apart from model weights, contain optimizer state, LR scheduler state.
|
|
|
|
|
|
Checkpoints are suitable for dividing the training into parts, for example in order
|
|
|
to divide the training job into shorter stages, or restart training after infrastructure fail.
|
|
|
@@ -487,14 +472,13 @@ wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/seresnext
|
|
|
|
|
|
unzip seresnext101_32x4d_pyt_amp_20.06.0.zip
|
|
|
```
|
|
|
-
|
|
|
To run inference on ImageNet, run:
|
|
|
|
|
|
`python ./main.py --arch se-resnext101-32x4d --evaluate --epochs 1 --pretrained-weights nvidia_se-resnext101-32x4d_200821.pth.tar -b <batch size> <path to imagenet>`
|
|
|
|
|
|
To run inference on JPEG image using pretrained weights:
|
|
|
|
|
|
-`python classify.py --arch se-resnext101-32x4d -c fanin --weights nvidia_se-resnext101-32x4d_200821.pth.tar --precision AMP|FP32 --image <path to JPEG image>`
|
|
|
+`python classify.py --arch se-resnext101-32x4d --weights nvidia_se-resnext101-32x4d_200821.pth.tar --precision AMP|FP32 --image <path to JPEG image>`
|
|
|
|
|
|
|
|
|
## Performance
|
|
|
@@ -508,71 +492,62 @@ The following section shows how to run benchmarks measuring the model performanc
|
|
|
To benchmark training, run:
|
|
|
|
|
|
* For 1 GPU
|
|
|
- * FP32
|
|
|
-`python ./main.py --arch se-resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
|
|
|
+ * FP32 (V100 GPUs only)
|
|
|
+ `python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode benchmark_training --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
|
|
|
+ * TF32 (A100 GPUs only)
|
|
|
+ `python ./launch.py --model se-resnext101-32x4d --precision TF32 --mode benchmark_training --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
|
|
|
* AMP
|
|
|
-`python ./main.py --arch se-resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --amp --static-loss-scale 256 --memory-format nhwc <path to imagenet>`
|
|
|
+ `python ./launch.py --model se-resnext101-32x4d --precision AMP --mode benchmark_training --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
|
|
|
* For multiple GPUs
|
|
|
- * FP32
|
|
|
-`python ./multiproc.py --nproc_per_node 8 ./main.py --arch se-resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
|
|
|
+ * FP32 (V100 GPUs only)
|
|
|
+ `python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode benchmark_training --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
|
|
|
+ * TF32 (A100 GPUs only)
|
|
|
+ `python ./multiproc.py --nproc_per_node 8 ./launch.py --model se-resnext101-32x4d --precision TF32 --mode benchmark_training --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
|
|
|
* AMP
|
|
|
-`python ./multiproc.py --nproc_per_node 8 ./main.py --arch se-resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --amp --static-loss-scale 256 --memory-format nhwc --epochs 1 --prof 100 <path to imagenet>`
|
|
|
+ `python ./multiproc.py --nproc_per_node 8 ./launch.py --model se-resnext101-32x4d --precision AMP --mode benchmark_training --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
|
|
|
|
|
|
Each of these scripts will run 100 iterations and save results in the `benchmark.json` file.
|
|
|
|
|
|
-Batch size should be picked appropriately depending on the hardware configuration.
|
|
|
-
|
|
|
-| *Platform* | *Precision* | *Batch Size* |
|
|
|
-|:----------:|:-----------:|:------------:|
|
|
|
-| DGXA100 | AMP | 128 |
|
|
|
-| DGXA100 | TF32 | 128 |
|
|
|
-| DGX-1 | AMP | 128 |
|
|
|
-| DGX-1 | FP32 | 64 |
|
|
|
-
|
|
|
#### Inference performance benchmark
|
|
|
|
|
|
To benchmark inference, run:
|
|
|
|
|
|
-* FP32
|
|
|
+* FP32 (V100 GPUs only)
|
|
|
|
|
|
-`python ./main.py --arch se-resnext101-32x4d -b <batch_size> -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate <path to imagenet>`
|
|
|
+`python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode benchmark_inference --platform DGX1V <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
|
|
|
|
|
|
-* AMP
|
|
|
+* TF32 (A100 GPUs only)
|
|
|
|
|
|
-`python ./main.py --arch se-resnext101-32x4d -b <batch_size> -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate --amp --memory-format nhwc <path to imagenet>`
|
|
|
+`python ./launch.py --model se-resnext101-32x4d --precision FP32 --mode benchmark_inference --platform DGXA100 <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
|
|
|
|
|
|
-Each of these scripts will run 100 iterations and save results in the `benchmark.json` file.
|
|
|
+* AMP
|
|
|
|
|
|
-Batch size should be picked appropriately depending on the hardware configuration.
|
|
|
+`python ./launch.py --model se-resnext101-32x4d --precision AMP --mode benchmark_inference --platform <DGX1V|DGXA100> <path to imagenet> --raport-file benchmark.json --epochs 1 --prof 100`
|
|
|
|
|
|
-| *Platform* | *Precision* | *Batch Size* |
|
|
|
-|:----------:|:-----------:|:------------:|
|
|
|
-| DGXA100 | AMP | 128 |
|
|
|
-| DGXA100 | TF32 | 128 |
|
|
|
-| DGX-1 | AMP | 128 |
|
|
|
-| DGX-1 | FP32 | 64 |
|
|
|
+Each of these scripts will run 100 iterations and save results in the `benchmark.json` file.
|
|
|
|
|
|
### Results
|
|
|
|
|
|
-Our results were obtained by running the applicable training script in the pytorch-20.06 NGC container.
|
|
|
+Our results were obtained by running the applicable training script in the pytorch-20.12 NGC container.
|
|
|
|
|
|
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
|
|
|
|
|
|
#### Training accuracy results
|
|
|
|
|
|
-##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
|
|
|
+##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
|
|
|
|
|
|
-| **epochs** | **Mixed Precision Top1** | **TF32 Top1** |
|
|
|
-|:------:|:--------------------:|:--------------:|
|
|
|
-| 90 | 79.95 +/- 0.09 | 79.97 +/- 0.08 |
|
|
|
+| **Epochs** | **Mixed Precision Top1** | **TF32 Top1** |
|
|
|
+|:----------:|:------------------------:|:--------------:|
|
|
|
+| 90 | 80.03 +/- 0.11 | 79.92 +/- 0.07 |
|
|
|
+| 250 | 80.9 +/- 0.08 | 80.98 +/- 0.07 |
|
|
|
|
|
|
-##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
|
|
|
|
|
|
-| **epochs** | **Mixed Precision Top1** | **FP32 Top1** |
|
|
|
-|:-:|:-:|:-:|
|
|
|
-| 90 | 80.04 +/- 0.10 | 79.93 +/- 0.10 |
|
|
|
-| 250 | 80.96 +/- 0.04 | 80.97 +/- 0.09 |
|
|
|
+##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
|
|
|
|
|
|
+| **Epochs** | **Mixed Precision Top1** | **FP32 Top1** |
|
|
|
+|:----------:|:------------------------:|:--------------:|
|
|
|
+| 90 | 80.04 +/- 0.07 | 79.93 +/- 0.10 |
|
|
|
+| 250 | 80.92 +/- 0.09 | 80.97 +/- 0.09 |
|
|
|
|
|
|
|
|
|
##### Example plots
|
|
|
@@ -587,26 +562,29 @@ The following images show a 250 epochs configuration on a DGX-1V.
|
|
|
|
|
|
#### Training performance results
|
|
|
|
|
|
-##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
|
|
|
+##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
|
|
|
+
|
|
|
+| **GPUs** | **Mixed Precision** | **TF32** | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **TF32 Strong Scaling** | **TF32 Training Time (90E)** |
|
|
|
+|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
|
|
|
+| 1 | 804 img/s | 360 img/s | 2.22 x | 1.0 x | ~42 hours | 1.0 x | ~94 hours |
|
|
|
+| 8 | 5248 img/s | 2665 img/s | 1.96 x | 6.52 x | ~7 hours | 7.38 x | ~13 hours |
|
|
|
|
|
|
-|**GPUs**|**Mixed Precision**| **TF32** |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**TF32 Strong Scaling**|**TF32 Training Time (90E)**|
|
|
|
-|:------:|:-----------------:|:-----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
|
|
|
-| 1 | 641.57 img/s |258.75 img/s | 2.48x | 1.00x | ~52 hours | 1.00x | ~129 hours |
|
|
|
-| 8 | 4758.40 img/s |2038.03 img/s| 2.33x | 7.42x | ~7 hours | 7.88x | ~17 hours |
|
|
|
|
|
|
##### Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)
|
|
|
|
|
|
-|**GPUs**|**Mixed Precision**| **FP32** |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**FP32 Strong Scaling**|**FP32 Training Time (90E)**|
|
|
|
-|:------:|:-----------------:|:----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
|
|
|
-| 1 | 383.15 img/s |130.48 img/s| 2.94x | 1.00x | ~87 hours | 1.00x | ~255 hours |
|
|
|
-| 8 | 2695.10 img/s |996.04 img/s| 2.71x | 7.03x | ~13 hours | 7.63x | ~34 hours |
|
|
|
+| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
|
|
|
+|:--------:|:-------------------:|:---------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
|
|
|
+| 1 | 430 img/s | 133 img/s | 3.21 x | 1.0 x | ~79 hours | 1.0 x | ~252 hours |
|
|
|
+| 8 | 2716 img/s | 994 img/s | 2.73 x | 6.31 x | ~13 hours | 7.42 x | ~34 hours |
|
|
|
+
|
|
|
|
|
|
##### Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)
|
|
|
|
|
|
-|**GPUs**|**Mixed Precision**| **FP32** |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**FP32 Strong Scaling**|**FP32 Training Time (90E)**|
|
|
|
-|:------:|:-----------------:|:----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
|
|
|
-| 1 | 364.65 img/s |123.46 img/s| 2.95x | 1.00x | ~92 hours | 1.00x | ~270 hours |
|
|
|
-| 8 | 2540.49 img/s |959.94 img/s| 2.65x | 6.97x | ~13 hours | 7.78x | ~35 hours |
|
|
|
+| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision Speedup** | **Mixed Precision Strong Scaling** | **Mixed Precision Training Time (90E)** | **FP32 Strong Scaling** | **FP32 Training Time (90E)** |
|
|
|
+|:--------:|:-------------------:|:----------:|:---------------------------:|:----------------------------------:|:---------------------------------------:|:-----------------------:|:----------------------------:|
|
|
|
+| 1 | 413 img/s | 134 img/s | 3.08 x | 1.0 x | ~82 hours | 1.0 x | ~251 hours |
|
|
|
+| 8 | 2572 img/s | 1011 img/s | 2.54 x | 6.22 x | ~14 hours | 7.54 x | ~34 hours |
|
|
|
+
|
|
|
|
|
|
#### Inference performance results
|
|
|
|
|
|
@@ -614,62 +592,65 @@ The following images show a 250 epochs configuration on a DGX-1V.
|
|
|
|
|
|
###### FP32 Inference Latency
|
|
|
|
|
|
-| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
|
|
|
-|:-:|:-:|:-:|:-:|:-:|:-:|
|
|
|
-| 1 | 33.58 img/s | 29.72ms | 30.92ms | 31.77ms | 34.65ms |
|
|
|
-| 2 | 66.47 img/s | 29.94ms | 31.30ms | 32.74ms | 34.79ms |
|
|
|
-| 4 | 135.31 img/s | 29.36ms | 29.78ms | 32.61ms | 33.90ms |
|
|
|
-| 8 | 261.52 img/s | 30.42ms | 32.73ms | 33.99ms | 35.61ms |
|
|
|
-| 16 | 356.05 img/s | 44.61ms | 44.93ms | 45.17ms | 46.90ms |
|
|
|
-| 32 | 391.83 img/s | 80.91ms | 81.28ms | 81.64ms | 82.69ms |
|
|
|
-| 64 | 443.91 img/s | 142.70ms | 142.99ms | 143.46ms | 145.01ms |
|
|
|
-| 128 | N/A | N/A | N/A | N/A | N/A |
|
|
|
+| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|
|
|
+|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
|
|
|
+| 1 | 37 img/s | 26.81 ms | 27.89 ms | 31.44 ms |
|
|
|
+| 2 | 75 img/s | 27.01 ms | 28.89 ms | 31.17 ms |
|
|
|
+| 4 | 144 img/s | 28.09 ms | 30.14 ms | 32.47 ms |
|
|
|
+| 8 | 259 img/s | 31.23 ms | 33.65 ms | 38.4 ms |
|
|
|
+| 16 | 332 img/s | 48.7 ms | 48.35 ms | 48.8 ms |
|
|
|
+| 32 | 394 img/s | 83.02 ms | 81.55 ms | 81.9 ms |
|
|
|
+| 64 | 471 img/s | 138.88 ms | 136.24 ms | 136.54 ms |
|
|
|
+| 128 | 505 img/s | 261.4 ms | 253.07 ms | 254.29 ms |
|
|
|
+| 256 | 513 img/s | 516.66 ms | 496.06 ms | 497.05 ms |
|
|
|
|
|
|
-###### Mixed Precision Inference Latency
|
|
|
|
|
|
-| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
|
|
|
-|:-:|:-:|:-:|:-:|:-:|:-:|
|
|
|
-| 1 | 35.08 img/s | 28.40ms | 29.75ms | 31.77ms | 35.85ms |
|
|
|
-| 2 | 68.85 img/s | 28.92ms | 30.24ms | 31.46ms | 37.07ms |
|
|
|
-| 4 | 131.78 img/s | 30.17ms | 31.39ms | 32.66ms | 37.17ms |
|
|
|
-| 8 | 260.21 img/s | 30.52ms | 31.20ms | 32.92ms | 34.46ms |
|
|
|
-| 16 | 506.62 img/s | 31.36ms | 32.48ms | 34.13ms | 36.49ms |
|
|
|
-| 32 | 778.92 img/s | 40.69ms | 40.90ms | 41.07ms | 43.67ms |
|
|
|
-| 64 | 880.49 img/s | 72.10ms | 72.29ms | 72.34ms | 76.46ms |
|
|
|
-| 128 | 977.86 img/s | 130.19ms | 130.34ms | 130.41ms | 131.12ms |
|
|
|
+###### Mixed Precision Inference Latency
|
|
|
|
|
|
+| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|
|
|
+|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
|
|
|
+| 1 | 29 img/s | 34.24 ms | 36.67 ms | 39.4 ms |
|
|
|
+| 2 | 53 img/s | 37.81 ms | 43.03 ms | 45.1 ms |
|
|
|
+| 4 | 103 img/s | 39.1 ms | 43.05 ms | 46.16 ms |
|
|
|
+| 8 | 226 img/s | 35.66 ms | 38.39 ms | 41.13 ms |
|
|
|
+| 16 | 458 img/s | 35.4 ms | 37.38 ms | 39.97 ms |
|
|
|
+| 32 | 882 img/s | 37.37 ms | 40.12 ms | 42.64 ms |
|
|
|
+| 64 | 1356 img/s | 49.31 ms | 47.21 ms | 49.87 ms |
|
|
|
+| 112 | 1448 img/s | 81.27 ms | 77.35 ms | 78.28 ms |
|
|
|
+| 128 | 1486 img/s | 90.59 ms | 86.15 ms | 87.04 ms |
|
|
|
+| 256 | 1534 img/s | 176.72 ms | 166.2 ms | 167.53 ms |
|
|
|
|
|
|
|
|
|
##### Inference performance: NVIDIA T4
|
|
|
|
|
|
###### FP32 Inference Latency
|
|
|
|
|
|
-| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
|
|
|
-|:-:|:-:|:-:|:-:|:-:|:-:|
|
|
|
-| 1 | 40.47 img/s | 24.72ms | 26.94ms | 29.33ms | 33.03ms |
|
|
|
-| 2 | 84.16 img/s | 23.66ms | 24.53ms | 25.96ms | 29.42ms |
|
|
|
-| 4 | 165.10 img/s | 24.08ms | 24.59ms | 25.75ms | 27.57ms |
|
|
|
-| 8 | 266.04 img/s | 29.90ms | 30.51ms | 30.84ms | 33.07ms |
|
|
|
-| 16 | 325.89 img/s | 48.57ms | 48.91ms | 49.02ms | 51.01ms |
|
|
|
-| 32 | 365.99 img/s | 86.94ms | 87.15ms | 87.41ms | 90.74ms |
|
|
|
-| 64 | 410.43 img/s | 155.30ms | 156.07ms | 156.36ms | 164.74ms |
|
|
|
-| 128 | N/A | N/A | N/A | N/A | N/A |
|
|
|
-
|
|
|
-###### Mixed Precision Inference Latency
|
|
|
-
|
|
|
-| **batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** | **Latency 95%** | **Latency 99%** |
|
|
|
-|:-:|:-:|:-:|:-:|:-:|:-:|
|
|
|
-| 1 | 38.80 img/s | 25.74ms | 26.10ms | 29.28ms | 31.72ms |
|
|
|
-| 2 | 78.79 img/s | 25.29ms | 25.83ms | 27.18ms | 33.07ms |
|
|
|
-| 4 | 160.22 img/s | 24.81ms | 25.58ms | 26.25ms | 27.93ms |
|
|
|
-| 8 | 298.01 img/s | 26.69ms | 27.59ms | 29.13ms | 32.69ms |
|
|
|
-| 16 | 567.48 img/s | 28.05ms | 28.36ms | 31.28ms | 34.44ms |
|
|
|
-| 32 | 709.56 img/s | 44.58ms | 44.69ms | 44.98ms | 47.99ms |
|
|
|
-| 64 | 799.72 img/s | 79.32ms | 79.40ms | 79.49ms | 84.34ms |
|
|
|
-| 128 | 856.19 img/s | 147.92ms | 149.02ms | 149.13ms | 151.90ms |
|
|
|
+| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|
|
|
+|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
|
|
|
+| 1 | 52 img/s | 19.39 ms | 20.39 ms | 21.18 ms |
|
|
|
+| 2 | 102 img/s | 19.98 ms | 21.4 ms | 23.75 ms |
|
|
|
+| 4 | 134 img/s | 30.12 ms | 30.14 ms | 30.54 ms |
|
|
|
+| 8 | 136 img/s | 59.07 ms | 60.63 ms | 61.49 ms |
|
|
|
+| 16 | 154 img/s | 104.38 ms | 105.21 ms | 105.81 ms |
|
|
|
+| 32 | 169 img/s | 190.12 ms | 189.64 ms | 190.24 ms |
|
|
|
+| 64 | 171 img/s | 376.19 ms | 374.16 ms | 375.6 ms |
|
|
|
+| 128 | 168 img/s | 771.4 ms | 761.64 ms | 764.7 ms |
|
|
|
+| 256 | 159 img/s | 1639.15 ms | 1603.45 ms | 1605.47 ms |
|
|
|
|
|
|
|
|
|
+###### Mixed Precision Inference Latency
|
|
|
|
|
|
+| **Batch Size** | **Throughput Avg** | **Latency Avg** | **Latency 95%** | **Latency 99%** |
|
|
|
+|:--------------:|:------------------:|:---------------:|:---------------:|:---------------:|
|
|
|
+| 1 | 42 img/s | 24.17 ms | 27.26 ms | 29.98 ms |
|
|
|
+| 2 | 87 img/s | 23.24 ms | 24.66 ms | 26.77 ms |
|
|
|
+| 4 | 170 img/s | 23.87 ms | 24.89 ms | 29.59 ms |
|
|
|
+| 8 | 334 img/s | 24.49 ms | 27.92 ms | 35.66 ms |
|
|
|
+| 16 | 472 img/s | 34.45 ms | 34.29 ms | 35.72 ms |
|
|
|
+| 32 | 502 img/s | 64.93 ms | 64.47 ms | 65.16 ms |
|
|
|
+| 64 | 517 img/s | 126.24 ms | 125.03 ms | 125.86 ms |
|
|
|
+| 128 | 522 img/s | 250.99 ms | 245.87 ms | 247.1 ms |
|
|
|
+| 256 | 523 img/s | 502.41 ms | 487.58 ms | 489.69 ms |
|
|
|
|
|
|
|
|
|
## Release notes
|
|
|
@@ -681,9 +662,10 @@ The following images show a 250 epochs configuration on a DGX-1V.
|
|
|
2. July 2020
|
|
|
* Added A100 scripts
|
|
|
* Updated README
|
|
|
+3. February 2021
|
|
|
+ * Moved from APEX AMP to Native AMP
|
|
|
|
|
|
### Known issues
|
|
|
|
|
|
There are no known issues with this model.
|
|
|
|
|
|
-
|