|
|
@@ -10,6 +10,7 @@ This repository provides a script and recipe to train the SSD300 v1.1 model to a
|
|
|
* [Features](#features)
|
|
|
* [Mixed precision training](#mixed-precision-training)
|
|
|
* [Enabling mixed precision](#enabling-mixed-precision)
|
|
|
+ * [Enabling TF32](#enabling-tf32)
|
|
|
- [Setup](#setup)
|
|
|
* [Requirements](#requirements)
|
|
|
- [Quick Start Guide](#quick-start-guide)
|
|
|
@@ -22,6 +23,7 @@ This repository provides a script and recipe to train the SSD300 v1.1 model to a
|
|
|
* [Data preprocessing](#data-preprocessing)
|
|
|
* [Data augmentation](#data-augmentation)
|
|
|
* [Training process](#training-process)
|
|
|
+ * [Evaluation process](#evaluation-process)
|
|
|
* [Inference process](#inference-process)
|
|
|
- [Performance](#performance)
|
|
|
* [Benchmarking](#benchmarking)
|
|
|
@@ -29,11 +31,16 @@ This repository provides a script and recipe to train the SSD300 v1.1 model to a
|
|
|
* [Inference performance benchmark](#inference-performance-benchmark)
|
|
|
* [Results](#results)
|
|
|
* [Training accuracy results](#training-accuracy-results)
|
|
|
- * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
|
|
|
+ * [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
|
|
|
+ * [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
|
|
|
+ * [Training loss plot](#training-loss-plot)
|
|
|
+ * [Training stability test](#training-stability-test)
|
|
|
* [Training performance results](#training-performance-results)
|
|
|
- * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g-1)
|
|
|
+ * [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
|
|
|
+ * [Training performance: NVIDIA DGX-1 (8x V100 16G)](#training-performance-nvidia-dgx-1-8x-v100-16gb)
|
|
|
* [Inference performance results](#inference-performance-results)
|
|
|
- * [NVIDIA DGX-1 (1x V100 16G)](#nvidia-dgx-1-1x-v100-16g)
|
|
|
+ * [Inference performance: NVIDIA DGX A100 (1x A100 40GB)](#inference-performance-nvidia-dgx-a100-1x-a100-40gb)
|
|
|
+ * [Inference performance: NVIDIA DGX-1 (1x V100 16GB)](#inference-performance-nvidia-dgx-1-1x-v100-16gb)
|
|
|
- [Release notes](#release-notes)
|
|
|
* [Changelog](#changelog)
|
|
|
* [Known issues](#known-issues)
|
|
|
@@ -67,9 +74,9 @@ To fully utilize GPUs during training we are using the
|
|
|
[NVIDIA DALI](https://github.com/NVIDIA/DALI) library
|
|
|
to accelerate data preparation pipelines.
|
|
|
|
|
|
-This model is trained with mixed precision using Tensor Cores on NVIDIA
|
|
|
-Volta and Turing GPUs. Therefore, researchers can get results 2x faster
|
|
|
-than training without Tensor Cores, while experiencing the benefits of
|
|
|
+This model is trained with mixed precision using Tensor Cores on Volta, Turing,
|
|
|
+and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results
|
|
|
+2x faster than training without Tensor Cores, while experiencing the benefits of
|
|
|
mixed precision training. This model is tested against each NGC monthly
|
|
|
container release to ensure consistent accuracy and performance over time.
|
|
|
|
|
|
@@ -109,31 +116,27 @@ To enable warmup provide argument the `--warmup 300`
|
|
|
by the number of GPUs and multiplied by the batch size divided by 32).
|
|
|
|
|
|
### Feature support matrix
|
|
|
-
|
|
|
-The following features are supported by this model.
|
|
|
-
|
|
|
-| Feature | SSD300 v1.1 PyTorch |
|
|
|
-|-----------------------|--------------------------
|
|
|
-|Multi-GPU training with [Distributed Data Parallel (DDP)](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) | Yes |
|
|
|
-|[NVIDIA DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html) | Yes |
|
|
|
-
|
|
|
+
|
|
|
+The following features are supported by this model.
|
|
|
+
|
|
|
+| **Feature** | **SSD300 v1.1 PyTorch** |
|
|
|
+|:---------:|:----------:|
|
|
|
+|[APEX AMP](https://github.com/NVIDIA/apex) | Yes |
|
|
|
+|[APEX DDP](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) | Yes |
|
|
|
+|[NVIDIA DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html) | Yes |
|
|
|
|
|
|
#### Features
|
|
|
+
|
|
|
+[APEX](https://github.com/NVIDIA/apex) is a PyTorch extension with NVIDIA-maintained utilities to streamline mixed precision and distributed training, whereas [AMP](https://nvidia.github.io/apex/amp.html) is an abbreviation used for automatic mixed precision training.
|
|
|
+
|
|
|
+[DDP](https://nvidia.github.io/apex/parallel.html) stands for DistributedDataParallel and is used for multi-GPU training.
|
|
|
|
|
|
-Multi-GPU training with Distributed Data Parallel - Our model uses Apex's
|
|
|
-DDP to implement efficient multi-GPU training with NCCL.
|
|
|
-To enable multi-GPU training with DDP, you have to wrap your model
|
|
|
-with a proper class, and change the way you launch training.
|
|
|
-For details, see example sources in this repo or see
|
|
|
-the [PyTorch tutorial](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).
|
|
|
-
|
|
|
-NVIDIA DALI - DALI is a library accelerating data preparation pipeline.
|
|
|
+[NVIDIA DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html) - DALI is a library accelerating data preparation pipeline.
|
|
|
To accelerate your input pipeline, you only need to define your data loader
|
|
|
with the DALI library.
|
|
|
For details, see example sources in this repo or see
|
|
|
the [DALI documentation](https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/index.html)
|
|
|
|
|
|
-
|
|
|
### Mixed precision training
|
|
|
|
|
|
Mixed precision is the combined use of different numerical precisions in
|
|
|
@@ -142,7 +145,7 @@ training offers significant computational speedup by performing operations
|
|
|
in half-precision format, while storing minimal information in single-precision
|
|
|
to retain as much information as possible in critical parts of the network.
|
|
|
Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores)
|
|
|
-in the Volta and Turing architecture, significant training speedups are
|
|
|
+in Volta, and following with both the Turing and Ampere architectures, significant training speedups are
|
|
|
experienced by switching to mixed precision -- up to 3x overall speedup
|
|
|
on the most arithmetically intense model architectures. Using mixed precision
|
|
|
training requires two steps:
|
|
|
@@ -160,8 +163,6 @@ documentation.
|
|
|
- Techniques used for mixed precision training, see the [Mixed-Precision
|
|
|
Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/)
|
|
|
blog.
|
|
|
-- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp)
|
|
|
-from the TensorFlow User Guide.
|
|
|
- APEX tools for mixed precision training, see the [NVIDIA Apex: Tools
|
|
|
for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
|
|
|
|
|
|
@@ -201,7 +202,7 @@ To enable mixed precision, you can:
|
|
|
optimizer = amp_handle.wrap_optimizer(optimizer)
|
|
|
```
|
|
|
- Scale loss before backpropagation (assuming loss is stored in a variable called `losses`)
|
|
|
- - Default backpropagate for FP32:
|
|
|
+ - Default backpropagate for FP32/TF32:
|
|
|
|
|
|
```
|
|
|
losses.backward()
|
|
|
@@ -213,6 +214,18 @@ To enable mixed precision, you can:
|
|
|
scaled_losses.backward()
|
|
|
```
|
|
|
|
|
|
+#### Enabling TF32
|
|
|
+
|
|
|
+This section is model specific and needs to show how to enable TF32. How is TF32 being implemented? Tweaking layers, preprocessing data, etc…
|
|
|
+
|
|
|
+TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](#https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs.
|
|
|
+
|
|
|
+TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
|
|
|
+
|
|
|
+For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](#https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post.
|
|
|
+
|
|
|
+TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
|
|
|
+
|
|
|
### Glossary
|
|
|
|
|
|
backbone
|
|
|
@@ -242,12 +255,15 @@ The following section lists the requirements in order to start training the SSD3
|
|
|
|
|
|
|
|
|
### Requirements
|
|
|
-This repository contains `Dockerfile` which extends the PyTorch 19.08 NGC container
|
|
|
+This repository contains `Dockerfile` which extends the PyTorch 20.06 NGC container
|
|
|
and encapsulates some dependencies. Aside from these dependencies,
|
|
|
ensure you have the following software:
|
|
|
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
|
|
|
-* [PyTorch 19.08-py3+ NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
|
|
|
-* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
|
|
|
+* [PyTorch 20.06-py3+ NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
|
|
|
+* GPU-based architecture:
|
|
|
+ * [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
|
|
|
+ * [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/)
|
|
|
+ * [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
|
|
|
|
|
|
For more information about how to get started with NGC containers, see the
|
|
|
following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
|
|
|
@@ -256,14 +272,14 @@ Documentation:
|
|
|
* [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
|
|
|
* [Running PyTorch](https://docs.nvidia.com/deeplearning/dgx/pytorch-release-notes/running.html#running)
|
|
|
|
|
|
-For those unable to use the [PyTorch 19.08-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch),
|
|
|
+For those unable to use the [PyTorch 20.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch),
|
|
|
to set up the required environment or create your own container,
|
|
|
see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
|
|
|
|
|
|
|
|
|
## Quick Start Guide
|
|
|
|
|
|
-To train your model using mixed precision with Tensor Cores or using FP32,
|
|
|
+To train your model using mixed or TF32 precision with Tensor Cores or using FP32,
|
|
|
perform the following steps using the default parameters of the SSD v1.1 model
|
|
|
on the [COCO 2017](http://cocodataset.org/#download) dataset.
|
|
|
For the specifics concerning training and inference,
|
|
|
@@ -304,8 +320,8 @@ The example scripts need two arguments:
|
|
|
|
|
|
Remaining arguments are passed to the `main.py` script.
|
|
|
|
|
|
-The `--save` flag, saves the model after each epoch.
|
|
|
-The checkpoints are stored as `./models/epoch_*.pt`.
|
|
|
+The `--save save_dir` flag, saves the model after each epoch in `save_dir` directory.
|
|
|
+The checkpoints are stored as `<save_dir>/epoch_*.pt`.
|
|
|
|
|
|
Use `python main.py -h` to obtain the list of available options in the `main.py` script.
|
|
|
For example, if you want to run 8 GPU training with Tensor Core acceleration and
|
|
|
@@ -320,26 +336,6 @@ bash ./examples/SSD300_FP16_8GPU.sh . /coco --save
|
|
|
The `main.py` training script automatically runs validation during training.
|
|
|
The results from the validation are printed to `stdout`.
|
|
|
|
|
|
-Pycocotools’ open-sourced scripts provides a consistent way
|
|
|
-to evaluate models on the COCO dataset. We are using these scripts
|
|
|
-during validation to measure a models performance in AP metric.
|
|
|
-Metrics below are evaluated using pycocotools’ methodology, in the following format:
|
|
|
-```
|
|
|
- Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.250
|
|
|
- Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.423
|
|
|
- Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.257
|
|
|
- Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.076
|
|
|
- Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.269
|
|
|
- Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.399
|
|
|
- Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.237
|
|
|
- Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.342
|
|
|
- Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.358
|
|
|
- Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.118
|
|
|
- Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.394
|
|
|
- Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.548
|
|
|
-```
|
|
|
-The metric reported in our results is present in the first row.
|
|
|
-
|
|
|
To evaluate a checkpointed model saved in the previous point, run:
|
|
|
|
|
|
```
|
|
|
@@ -360,29 +356,6 @@ Start with running a Docker container with a Jupyter notebook server:
|
|
|
nvidia-docker run --rm -it --ulimit memlock=-1 --ulimit stack=67108864 -v $SSD_CHECKPINT_PATH:/checkpoints/SSD300v1.1.pt -v $COCO_PATH:/datasets/coco2017 --ipc=host -p 8888:8888 nvidia_ssd jupyter-notebook --ip 0.0.0.0 --allow-root
|
|
|
```
|
|
|
|
|
|
-The container prints Jupyter notebook logs like this:
|
|
|
-```
|
|
|
-[I 16:17:58.935 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
|
|
|
-[I 16:17:59.769 NotebookApp] JupyterLab extension loaded from /opt/conda/lib/python3.6/site-packages/jupyterlab
|
|
|
-[I 16:17:59.769 NotebookApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
|
|
|
-[I 16:17:59.770 NotebookApp] Serving notebooks from local directory: /workspace
|
|
|
-[I 16:17:59.770 NotebookApp] The Jupyter Notebook is running at:
|
|
|
-[I 16:17:59.770 NotebookApp] http://(65935d756c71 or 127.0.0.1):8888/?token=04c78049c67f45a4d759c8f6ddd0b2c28ac4eab60d81be4e
|
|
|
-[I 16:17:59.770 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
|
|
|
-[W 16:17:59.774 NotebookApp] No web browser found: could not locate runnable browser.
|
|
|
-[C 16:17:59.774 NotebookApp]
|
|
|
-
|
|
|
- To access the notebook, open this file in a browser:
|
|
|
- file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
|
|
|
- Or copy and paste one of these URLs:
|
|
|
- http://(65935d756c71 or 127.0.0.1):8888/?token=04c78049c67f45a4d759c8f6ddd0b2c28ac4eab60d81be4e
|
|
|
-```
|
|
|
-
|
|
|
-Use the token printed in the last line to start your notebook session.
|
|
|
-The notebook is in `examples/inference.ipynb`, for example:
|
|
|
-
|
|
|
-http://127.0.0.1:8888/notebooks/examples/inference.ipynb?token=04c78049c67f45a4d759c8f6ddd0b2c28ac4eab60d81be4e
|
|
|
-
|
|
|
## Advanced
|
|
|
|
|
|
The following sections provide greater details of the dataset,
|
|
|
@@ -423,7 +396,7 @@ under the `/coco` directory.
|
|
|
: allows you to specify the path to the pre-trained model.
|
|
|
|
|
|
`--save`
|
|
|
-: when the flag is turned on, the script will save the trained model to the disc.
|
|
|
+: when the flag is turned on, the script will save the trained model checkpoints in the specified directory
|
|
|
|
|
|
`--seed`
|
|
|
: Use it to specify the seed for RNGs.
|
|
|
@@ -530,7 +503,29 @@ the COCO dataset.
|
|
|
Which epochs should be evaluated can be reconfigured with the `--evaluation` argument.
|
|
|
|
|
|
To run training with Tensor Cores, use the `--amp` flag when running the `main.py` script.
|
|
|
-The flag `--save` flag enables storing checkpoints after each epoch under `./models/epoch_*.pt`.
|
|
|
+The flag `--save ./models` flag enables storing checkpoints after each epoch under `./models/epoch_*.pt`.
|
|
|
+
|
|
|
+### Evaluation process
|
|
|
+
|
|
|
+Pycocotools’ open-sourced scripts provides a consistent way
|
|
|
+to evaluate models on the COCO dataset. We are using these scripts
|
|
|
+during validation to measure a models performance in AP metric.
|
|
|
+Metrics below are evaluated using pycocotools’ methodology, in the following format:
|
|
|
+```
|
|
|
+ Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.250
|
|
|
+ Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.423
|
|
|
+ Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.257
|
|
|
+ Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.076
|
|
|
+ Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.269
|
|
|
+ Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.399
|
|
|
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.237
|
|
|
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.342
|
|
|
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.358
|
|
|
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.118
|
|
|
+ Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.394
|
|
|
+ Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.548
|
|
|
+```
|
|
|
+The metric reported in our results is present in the first row.
|
|
|
|
|
|
### Inference process
|
|
|
|
|
|
@@ -539,10 +534,37 @@ To get meaningful results, you need a pre-trained model checkpoint.
|
|
|
|
|
|
One way is to run an interactive session on Jupyter notebook, as described in a 8th step of the [Quick Start Guide](#quick-start-guide).
|
|
|
|
|
|
+The container prints Jupyter notebook logs like this:
|
|
|
+```
|
|
|
+[I 16:17:58.935 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
|
|
|
+[I 16:17:59.769 NotebookApp] JupyterLab extension loaded from /opt/conda/lib/python3.6/site-packages/jupyterlab
|
|
|
+[I 16:17:59.769 NotebookApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
|
|
|
+[I 16:17:59.770 NotebookApp] Serving notebooks from local directory: /workspace
|
|
|
+[I 16:17:59.770 NotebookApp] The Jupyter Notebook is running at:
|
|
|
+[I 16:17:59.770 NotebookApp] http://(65935d756c71 or 127.0.0.1):8888/?token=04c78049c67f45a4d759c8f6ddd0b2c28ac4eab60d81be4e
|
|
|
+[I 16:17:59.770 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
|
|
|
+[W 16:17:59.774 NotebookApp] No web browser found: could not locate runnable browser.
|
|
|
+[C 16:17:59.774 NotebookApp]
|
|
|
+
|
|
|
+ To access the notebook, open this file in a browser:
|
|
|
+ file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
|
|
|
+ Or copy and paste one of these URLs:
|
|
|
+ http://(65935d756c71 or 127.0.0.1):8888/?token=04c78049c67f45a4d759c8f6ddd0b2c28ac4eab60d81be4e
|
|
|
+```
|
|
|
+
|
|
|
+Use the token printed in the last line to start your notebook session.
|
|
|
+The notebook is in `examples/inference.ipynb`, for example:
|
|
|
+
|
|
|
+http://127.0.0.1:8888/notebooks/examples/inference.ipynb?token=04c78049c67f45a4d759c8f6ddd0b2c28ac4eab60d81be4e
|
|
|
+
|
|
|
Another way is to run a script `examples/SSD300_inference.py`. It contains the logic from the notebook, wrapped into a Python script. The script contains sample usage.
|
|
|
|
|
|
To use the inference example script in your own code, you can call the `main` function, providing input image URIs as an argument. The result will be a list of detections for each input image.
|
|
|
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
## Performance
|
|
|
|
|
|
### Benchmarking
|
|
|
@@ -551,7 +573,7 @@ The following section shows how to run benchmarks measuring the model performanc
|
|
|
|
|
|
#### Training performance benchmark
|
|
|
|
|
|
-The training benchmark was run in various scenarios on V100 16G GPU. For each scenario, the batch size was set to 32. The benchmark does not require a checkpoint from a fully trained model.
|
|
|
+The training benchmark was run in various scenarios on A100 40GB and V100 16G GPUs. The benchmark does not require a checkpoint from a fully trained model.
|
|
|
|
|
|
To benchmark training, run:
|
|
|
```
|
|
|
@@ -573,7 +595,7 @@ Tensor Cores, and the `{data}` is the location of the COCO 2017 dataset.
|
|
|
|
|
|
#### Inference performance benchmark
|
|
|
|
|
|
-Inference benchmark was run on 1x V100 16G GPU. To benchmark inference, run:
|
|
|
+Inference benchmark was run on 1x A100 40GB GPU and 1x V100 16G GPU. To benchmark inference, run:
|
|
|
```
|
|
|
python main.py --eval-batch-size {bs} \
|
|
|
--mode benchmark-inference \
|
|
|
@@ -593,66 +615,130 @@ The following sections provide details on how we achieved our performance and ac
|
|
|
|
|
|
#### Training accuracy results
|
|
|
|
|
|
+##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
|
|
|
|
|
|
-##### NVIDIA DGX-1 (8x V100 16G)
|
|
|
+Our results were obtained by running the `./examples/SSD300_A100_{FP16,TF32}_{1,4,8}GPU.sh`
|
|
|
+script in the `pytorch-20.06-py3` NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs.
|
|
|
+
|
|
|
+|GPUs |Batch size / GPU|Accuracy - TF32|Accuracy - mixed precision|Time to train - TF32|Time to train - mixed precision|Time to train speedup (TF32 to mixed precision)|
|
|
|
+|-----------|----------------|---------------|---------------------------|--------------------|--------------------------------|------------------------------------------------|
|
|
|
+|1 |64 |0.251 |0.252 |16:00:00 |8:00:00 |200.00% |
|
|
|
+|4 |64 |0.250 |0.251 |3:00:00 |1:36:00 |187.50% |
|
|
|
+|8 |64 |0.252 |0.251 |1:40:00 |1:00:00 |167.00% |
|
|
|
+|1 |128 |0.251 |0.251 |13:05:00 |7:00:00 |189.05% |
|
|
|
+|4 |128 |0.252 |0.253 |2:45:00 |1:30:00 |183.33% |
|
|
|
+|8 |128 |0.248 |0.249 |1:20:00 |0:43:00 |186.00% |
|
|
|
+
|
|
|
+##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
|
|
|
|
|
|
Our results were obtained by running the `./examples/SSD300_FP{16,32}_{1,4,8}GPU.sh`
|
|
|
-script in the `pytorch-19.08-py3` NGC container on NVIDIA DGX-1 with 8x
|
|
|
-V100 16G GPUs. Performance numbers (in items/images per second) were averaged
|
|
|
-over an entire training epoch.
|
|
|
+script in the `pytorch-20.06-py3` NGC container on NVIDIA DGX-1 with 8x
|
|
|
+V100 16GB GPUs.
|
|
|
|
|
|
|GPUs |Batch size / GPU|Accuracy - FP32|Accuracy - mixed precision|Time to train - FP32|Time to train - mixed precision|Time to train speedup (FP32 to mixed precision)|
|
|
|
|-----------|----------------|---------------|---------------------------|--------------------|--------------------------------|------------------------------------------------|
|
|
|
|1 |32 |0.250 |0.250 |20:20:13 |10:23:46 |195.62% |
|
|
|
|4 |32 |0.249 |0.250 |5:11:17 |2:39:28 |195.20% |
|
|
|
-|8 |32 |0.250 |0.250 |2:37:35 |1:25:38 |184.01% |
|
|
|
+|8 |32 |0.250 |0.250 |2:37:00 |1:32:00 |170.60% |
|
|
|
|1 |64 |<N/A> |0.252 |<N/A> |9:27:33 |215.00% |
|
|
|
|4 |64 |<N/A> |0.251 |<N/A> |2:24:43 |215.10% |
|
|
|
-|8 |64 |<N/A> |0.252 |<N/A> |1:13:01 |215.85% |
|
|
|
+|8 |64 |<N/A> |0.252 |<N/A> |1:31:00 |172.50% |
|
|
|
+
|
|
|
+Due to smaller size, mixed precision models can be trained with bigger batches. In such cases mixed precision speedup is calculated versus FP32 training with maximum batch size for that precision
|
|
|
|
|
|
-Here are example graphs of FP32 and FP16 training on 8 GPU configuration:
|
|
|
+##### Training loss plot
|
|
|
+
|
|
|
+Here are example graphs of FP32, TF32 and AMP training on 8 GPU configuration:
|
|
|
|
|
|

|
|
|
|
|
|
-
|
|
|
+##### Training stability test
|
|
|
+
|
|
|
+The SSD300 v1.1 model was trained for 65 epochs, starting
|
|
|
+from 15 different initial random seeds. The training was performed in the `pytorch-20.06-py3` NGC container on
|
|
|
+NVIDIA DGX A100 8x A100 40GB GPUs with batch size per GPU = 128.
|
|
|
+After training, the models were evaluated on the test dataset. The following
|
|
|
+table summarizes the final mAP on the test set.
|
|
|
+
|
|
|
+|**Precision**|**Average mAP**|**Standard deviation**|**Minimum**|**Maximum**|**Median**|
|
|
|
+|------------:|--------------:|---------------------:|----------:|----------:|---------:|
|
|
|
+| AMP | 0.2491314286 | 0.001498316675 | 0.24456 | 0.25182 | 0.24907 |
|
|
|
+| TF32 | 0.2489106667 | 0.001749463047 | 0.24487 | 0.25148 | 0.24848 |
|
|
|
|
|
|
|
|
|
#### Training performance results
|
|
|
|
|
|
-##### NVIDIA DGX-1 (8x V100 16G)
|
|
|
+##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
|
|
|
|
|
|
Our results were obtained by running the `main.py` script with the `--mode
|
|
|
-benchmark-training` flag in the `pytorch-19.08-py3` NGC container on NVIDIA
|
|
|
-DGX-1 with 8x V100 16G GPUs. Performance numbers (in items/images per second)
|
|
|
+benchmark-training` flag in the `pytorch-20.06-py3` NGC container on NVIDIA
|
|
|
+DGX A100 (8x A100 40GB) GPUs. Performance numbers (in items/images per second)
|
|
|
+were averaged over an entire training epoch.
|
|
|
+
|
|
|
+|GPUs |Batch size / GPU|Throughput - TF32|Throughput - mixed precision|Throughput speedup (TF32 - mixed precision)|Weak scaling - TF32 |Weak scaling - mixed precision |
|
|
|
+|-----------|----------------|-----------------|-----------------------------|-------------------------------------------|--------------------------------|------------------------------------------------|
|
|
|
+|1 |64 |201.43 |367.15 |182.27% |100.00% |100.00% |
|
|
|
+|4 |64 |791.50 |1,444.00 |182.44% |392.94% |393.30% |
|
|
|
+|8 |64 |1,582.72 |2,872.48 |181.49% |785.74% |782.37% |
|
|
|
+|1 |128 |206.28 |387.95 |188.07% |100.00% |100.00% |
|
|
|
+|4 |128 |822.39 |1,530.15 |186.06% |398.68% |397.73% |
|
|
|
+|8 |128 |1,647.00 |3,092.00 |187.74% |798.43% |773.00% |
|
|
|
+
|
|
|
+To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
|
|
|
+
|
|
|
+##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
|
|
|
+
|
|
|
+Our results were obtained by running the `main.py` script with the `--mode
|
|
|
+benchmark-training` flag in the `pytorch-20.06-py3` NGC container on NVIDIA
|
|
|
+DGX-1 with 8x V100 16GB GPUs. Performance numbers (in items/images per second)
|
|
|
were averaged over an entire training epoch.
|
|
|
|
|
|
|GPUs |Batch size / GPU|Throughput - FP32|Throughput - mixed precision|Throughput speedup (FP32 - mixed precision)|Weak scaling - FP32 |Weak scaling - mixed precision |
|
|
|
|-----------|----------------|-----------------|-----------------------------|-------------------------------------------|--------------------------------|------------------------------------------------|
|
|
|
|1 |32 |133.67 |215.30 |161.07% |100.00% |100.00% |
|
|
|
|4 |32 |532.05 |828.63 |155.74% |398.04% |384.88% |
|
|
|
-|8 |32 |1,060.33 |1,647.74 |155.40% |793.27% |765.33% |
|
|
|
+|8 |32 |820.70 |1,647.74 |200.77% |614.02% |802.00% |
|
|
|
|1 |64 |<N/A> |232.22 |173.73% |<N/A> |100.00% |
|
|
|
|4 |64 |<N/A> |910.77 |171.18% |<N/A> |392.20% |
|
|
|
-|8 |64 |<N/A> |1,769.48 |166.88% |<N/A> |761.99% |
|
|
|
+|8 |64 |<N/A> |1,728.00 |210.55% |<N/A> |761.99% |
|
|
|
+
|
|
|
+Due to smaller size, mixed precision models can be trained with bigger batches. In such cases mixed precision speedup is calculated versus FP32 training with maximum batch size for that precision
|
|
|
|
|
|
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
|
|
|
|
|
|
#### Inference performance results
|
|
|
|
|
|
+##### Inference performance: NVIDIA DGX A100 (1x A100 40GB)
|
|
|
|
|
|
-##### NVIDIA DGX-1 (1x V100 16G)
|
|
|
+Our results were obtained by running the `main.py` script with `--mode
|
|
|
+benchmark-inference` flag in the pytorch-20.06-py3 NGC container on NVIDIA
|
|
|
+DGX A100 (1x A100 40GB) GPU.
|
|
|
+
|
|
|
+|Batch size |Throughput - TF32|Throughput - mixed precision|Throughput speedup (TF32 - mixed precision)|Weak scaling - TF32 |Weak scaling - mixed precision |
|
|
|
+|-----------|-----------------|-----------------------------|-------------------------------------------|--------------------|--------------------------------|
|
|
|
+|1 |113.51 |109.93 | 96.85% |100.00% |100.00% |
|
|
|
+|2 |203.07 |214.43 |105.59% |178.90% |195.06% |
|
|
|
+|4 |338.76 |368.45 |108.76% |298.30% |335.17% |
|
|
|
+|8 |485.65 |526.97 |108.51% |427.85% |479.37% |
|
|
|
+|16 |493.64 |867.42 |175.72% |434.89% |789.07% |
|
|
|
+|32 |548.75 |910.17 |165.86% |483.44% |827.95%
|
|
|
+
|
|
|
+To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
|
|
|
+
|
|
|
+##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
|
|
|
|
|
|
Our results were obtained by running the `main.py` script with `--mode
|
|
|
-benchmark-inference` flag in the pytorch-19.08-py3 NGC container on NVIDIA
|
|
|
-DGX-1 with (1x V100 16G) GPUs.
|
|
|
+benchmark-inference` flag in the pytorch-20.06-py3 NGC container on NVIDIA
|
|
|
+DGX-1 with (1x V100 16GB) GPU.
|
|
|
|
|
|
|Batch size |Throughput - FP32|Throughput - mixed precision|Throughput speedup (FP32 - mixed precision)|Weak scaling - FP32 |Weak scaling - mixed precision |
|
|
|
|-----------|-----------------|-----------------------------|-------------------------------------------|--------------------|--------------------------------|
|
|
|
-|2 |148.99 |186.60 |125.24% |100.00% |100.00% |
|
|
|
-|4 |203.35 |326.69 |160.66% |136.48% |175.08% |
|
|
|
-|8 |227.32 |433.45 |190.68% |152.57% |232.29% |
|
|
|
-|16 |278.02 |493.19 |177.39% |186.60% |264.31% |
|
|
|
-|32 |299.81 |545.84 |182.06% |201.23% |292.53% |
|
|
|
+|1 |82.50 |80.50 | 97.58% |100.00% |100.00% |
|
|
|
+|2 |124.05 |147.46 |118.87% |150.36% |183.18% |
|
|
|
+|4 |155.51 |255.16 |164.08% |188.50% |316.97% |
|
|
|
+|8 |182.37 |334.94 |183.66% |221.05% |416.07% |
|
|
|
+|16 |222.83 |358.25 |160.77% |270.10% |445.03% |
|
|
|
+|32 |271.73 |438.85 |161.50% |329.37% |545.16% |
|
|
|
|
|
|
To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
|
|
|
|
|
|
@@ -660,6 +746,11 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
|
|
|
|
|
|
### Changelog
|
|
|
|
|
|
+June 2020
|
|
|
+ * upgrade the PyTorch container to 20.06
|
|
|
+ * update performance tables to include A100 results
|
|
|
+ * update examples with A100 configs
|
|
|
+
|
|
|
August 2019
|
|
|
* upgrade the PyTorch container to 19.08
|
|
|
* update Results section in the README
|