Explorar o código

Merge pull request #591 from NVIDIA/ssds-ampere

Ssds ampere
nv-kkudrynski %!s(int64=5) %!d(string=hai) anos
pai
achega
7f4ea44729
Modificáronse 30 ficheiros con 792 adicións e 197 borrados
  1. 3 0
      PyTorch/Detection/SSD/.gitmodules
  2. 3 6
      PyTorch/Detection/SSD/Dockerfile
  3. 193 102
      PyTorch/Detection/SSD/README.md
  4. 4 0
      PyTorch/Detection/SSD/examples/SSD300_A100_FP16_1GPU.sh
  5. 4 0
      PyTorch/Detection/SSD/examples/SSD300_A100_FP16_4GPU.sh
  6. 4 0
      PyTorch/Detection/SSD/examples/SSD300_A100_FP16_8GPU.sh
  7. 4 0
      PyTorch/Detection/SSD/examples/SSD300_A100_FP32_8GPU.sh
  8. BIN=BIN
      PyTorch/Detection/SSD/img/training_loss.png
  9. BIN=BIN
      PyTorch/Detection/SSD/img/validation_accuracy.png
  10. 49 11
      PyTorch/Detection/SSD/main.py
  11. 6 6
      PyTorch/Detection/SSD/src/coco_pipeline.py
  12. 53 8
      PyTorch/Detection/SSD/src/logger.py
  13. 6 2
      PyTorch/Detection/SSD/src/train.py
  14. 1 1
      PyTorch/Detection/SSD/src/utils.py
  15. 10 8
      TensorFlow/Detection/SSD/Dockerfile
  16. 316 34
      TensorFlow/Detection/SSD/README.md
  17. 1 2
      TensorFlow/Detection/SSD/examples/SSD320_FP16_1GPU.sh
  18. 1 2
      TensorFlow/Detection/SSD/examples/SSD320_FP16_1GPU_BENCHMARK.sh
  19. 1 2
      TensorFlow/Detection/SSD/examples/SSD320_FP16_4GPU.sh
  20. 1 2
      TensorFlow/Detection/SSD/examples/SSD320_FP16_4GPU_BENCHMARK.sh
  21. 1 2
      TensorFlow/Detection/SSD/examples/SSD320_FP16_8GPU.sh
  22. 1 2
      TensorFlow/Detection/SSD/examples/SSD320_FP16_8GPU_BENCHMARK.sh
  23. 24 4
      TensorFlow/Detection/SSD/examples/SSD320_inference.py
  24. BIN=BIN
      TensorFlow/Detection/SSD/img/training_loss.png
  25. BIN=BIN
      TensorFlow/Detection/SSD/img/validation_accuracy.png
  26. 3 0
      TensorFlow/Detection/SSD/models/research/object_detection/builders/dataset_builder.py
  27. 3 0
      TensorFlow/Detection/SSD/models/research/object_detection/metrics/coco_tools.py
  28. 3 1
      TensorFlow/Detection/SSD/models/research/object_detection/model_lib.py
  29. 41 2
      TensorFlow/Detection/SSD/models/research/object_detection/model_main.py
  30. 56 0
      TensorFlow/Detection/SSD/models/research/object_detection/utils/exp_utils.py

+ 3 - 0
PyTorch/Detection/SSD/.gitmodules

@@ -0,0 +1,3 @@
+[submodule "submodules/dllogger"]
+	path = submodules/dllogger
+	url = ssh://[email protected]:12051/dl/JoC/dllogger.git

+ 3 - 6
PyTorch/Detection/SSD/Dockerfile

@@ -1,16 +1,13 @@
-FROM nvcr.io/nvidia/pytorch:19.08-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.06-py3
+FROM ${FROM_IMAGE_NAME}
 
 # Set working directory
 WORKDIR /workspace
 
 ENV PYTHONPATH "${PYTHONPATH}:/workspace"
 
-RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y python3-tk python-pip git tmux htop tree
-
-# Necessary pip packages
-RUN pip install --upgrade pip
-
 COPY requirements.txt .
+RUN pip install --no-cache-dir git+https://github.com/NVIDIA/dllogger.git#egg=dllogger
 RUN pip install -r requirements.txt
 RUN python3 -m pip install pycocotools==2.0.0
 

+ 193 - 102
PyTorch/Detection/SSD/README.md

@@ -10,6 +10,7 @@ This repository provides a script and recipe to train the SSD300 v1.1 model to a
         * [Features](#features)
     * [Mixed precision training](#mixed-precision-training)
         * [Enabling mixed precision](#enabling-mixed-precision)
+        * [Enabling TF32](#enabling-tf32)
 - [Setup](#setup)
     * [Requirements](#requirements)
 - [Quick Start Guide](#quick-start-guide)
@@ -22,6 +23,7 @@ This repository provides a script and recipe to train the SSD300 v1.1 model to a
             * [Data preprocessing](#data-preprocessing)
             * [Data augmentation](#data-augmentation)
     * [Training process](#training-process)
+    * [Evaluation process](#evaluation-process)
     * [Inference process](#inference-process)
 - [Performance](#performance)
     * [Benchmarking](#benchmarking)
@@ -29,11 +31,16 @@ This repository provides a script and recipe to train the SSD300 v1.1 model to a
         * [Inference performance benchmark](#inference-performance-benchmark)
     * [Results](#results)
         * [Training accuracy results](#training-accuracy-results)
-            * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
+            * [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
+            * [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
+            * [Training loss plot](#training-loss-plot)
+            * [Training stability test](#training-stability-test)
         * [Training performance results](#training-performance-results)
-            * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g-1)
+            * [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb) 
+            * [Training performance: NVIDIA DGX-1 (8x V100 16G)](#training-performance-nvidia-dgx-1-8x-v100-16gb)
         * [Inference performance results](#inference-performance-results)
-            * [NVIDIA DGX-1 (1x V100 16G)](#nvidia-dgx-1-1x-v100-16g)
+            * [Inference performance: NVIDIA DGX A100 (1x A100 40GB)](#inference-performance-nvidia-dgx-a100-1x-a100-40gb)
+            * [Inference performance: NVIDIA DGX-1 (1x V100 16GB)](#inference-performance-nvidia-dgx-1-1x-v100-16gb)
 - [Release notes](#release-notes)
     * [Changelog](#changelog)
     * [Known issues](#known-issues)
@@ -67,9 +74,9 @@ To fully utilize GPUs during training we are using the
 [NVIDIA DALI](https://github.com/NVIDIA/DALI) library
 to accelerate data preparation pipelines.
 
-This model is trained with mixed precision using Tensor Cores on NVIDIA
-Volta and Turing GPUs. Therefore, researchers can get results 2x faster
-than training without Tensor Cores, while experiencing the benefits of
+This model is trained with mixed precision using Tensor Cores on Volta, Turing,
+and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results
+2x faster than training without Tensor Cores, while experiencing the benefits of
 mixed precision training. This model is tested against each NGC monthly
 container release to ensure consistent accuracy and performance over time.
 
@@ -109,31 +116,27 @@ To enable warmup provide argument the `--warmup 300`
 by the number of GPUs and multiplied by the batch size divided by 32).
 
 ### Feature support matrix
-
-The following features are supported by this model.
-
-| Feature               | SSD300 v1.1 PyTorch             |
-|-----------------------|--------------------------
-|Multi-GPU training with [Distributed Data Parallel (DDP)](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)  |  Yes |
-|[NVIDIA DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html)                |  Yes |
-
+ 
+The following features are supported by this model.  
+ 
+| **Feature** | **SSD300 v1.1 PyTorch** |
+|:---------:|:----------:|
+|[APEX AMP](https://github.com/NVIDIA/apex)                                             |  Yes |
+|[APEX DDP](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)               |  Yes |
+|[NVIDIA DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html)  |  Yes |
 
 #### Features
+ 
+[APEX](https://github.com/NVIDIA/apex) is a PyTorch extension with NVIDIA-maintained utilities to streamline mixed precision and distributed training, whereas [AMP](https://nvidia.github.io/apex/amp.html) is an abbreviation used for automatic mixed precision training.
+ 
+[DDP](https://nvidia.github.io/apex/parallel.html) stands for DistributedDataParallel and is used for multi-GPU training.
 
-Multi-GPU training with Distributed Data Parallel - Our model uses Apex's
-DDP to implement efficient multi-GPU training with NCCL.
-To enable multi-GPU training with DDP, you have to wrap your model
-with a proper class, and change the way you launch training.
-For details, see example sources in this repo or see
-the [PyTorch tutorial](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).
-
-NVIDIA DALI - DALI is a library accelerating data preparation pipeline.
+[NVIDIA DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html) - DALI is a library accelerating data preparation pipeline.
 To accelerate your input pipeline, you only need to define your data loader
 with the DALI library.
 For details, see example sources in this repo or see
 the [DALI documentation](https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/index.html)
 
-
 ### Mixed precision training
 
 Mixed precision is the combined use of different numerical precisions in
@@ -142,7 +145,7 @@ training offers significant computational speedup by performing operations
 in half-precision format, while storing minimal information in single-precision
 to retain as much information as possible in critical parts of the network.
 Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores)
-in the Volta and Turing architecture, significant training speedups are
+in Volta, and following with both the Turing and Ampere architectures, significant training speedups are
 experienced by switching to mixed precision -- up to 3x overall speedup
 on the most arithmetically intense model architectures. Using mixed precision
 training requires two steps:
@@ -160,8 +163,6 @@ documentation.
 -   Techniques used for mixed precision training, see the [Mixed-Precision
 Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/)
 blog.
--   How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp)
-from the TensorFlow User Guide.
 -   APEX tools for mixed precision training, see the [NVIDIA Apex: Tools
 for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
 
@@ -201,7 +202,7 @@ To enable mixed precision, you can:
   optimizer = amp_handle.wrap_optimizer(optimizer)
   ```
 - Scale loss before backpropagation (assuming loss is stored in a variable called `losses`)
-  - Default backpropagate for FP32:
+  - Default backpropagate for FP32/TF32:
 
     ```
     losses.backward()
@@ -213,6 +214,18 @@ To enable mixed precision, you can:
        scaled_losses.backward()
     ```
 
+#### Enabling TF32
+
+This section is model specific and needs to show how to enable TF32.  How is TF32 being implemented? Tweaking layers, preprocessing data, etc… 
+
+TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](#https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. 
+
+TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
+
+For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](#https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post.
+
+TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
+
 ### Glossary
 
 backbone
@@ -242,12 +255,15 @@ The following section lists the requirements in order to start training the SSD3
 
 
 ### Requirements
-This repository contains `Dockerfile` which extends the PyTorch 19.08 NGC container
+This repository contains `Dockerfile` which extends the PyTorch 20.06 NGC container
 and encapsulates some dependencies.  Aside from these dependencies,
 ensure you have the following software:
 * [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PyTorch 19.08-py3+ NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
-* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
+* [PyTorch 20.06-py3+ NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
+* GPU-based architecture:
+    * [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
+    * [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/)
+    * [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
 
 For more information about how to get started with NGC containers, see the
 following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
@@ -256,14 +272,14 @@ Documentation:
 * [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
 * [Running PyTorch](https://docs.nvidia.com/deeplearning/dgx/pytorch-release-notes/running.html#running)
 
-For those unable to use the [PyTorch 19.08-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch),
+For those unable to use the [PyTorch 20.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch),
 to set up the required environment or create your own container,
 see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
 
 
 ## Quick Start Guide
 
-To train your model using mixed precision with Tensor Cores or using FP32,
+To train your model using mixed or TF32 precision with Tensor Cores or using FP32,
 perform the following steps using the default parameters of the SSD v1.1 model
 on the [COCO 2017](http://cocodataset.org/#download) dataset.
 For the specifics concerning training and inference,
@@ -304,8 +320,8 @@ The example scripts need two arguments:
 
 Remaining arguments are passed to the `main.py` script.
 
-The `--save` flag, saves the model after each epoch.
-The checkpoints are stored as `./models/epoch_*.pt`.
+The `--save save_dir` flag, saves the model after each epoch in `save_dir` directory.
+The checkpoints are stored as `<save_dir>/epoch_*.pt`.
 
 Use `python main.py -h` to obtain the list of available options in the `main.py` script.
 For example, if you want to run 8 GPU training with Tensor Core acceleration and
@@ -320,26 +336,6 @@ bash ./examples/SSD300_FP16_8GPU.sh . /coco --save
 The `main.py` training script automatically runs validation during training.
 The results from the validation are printed to `stdout`.
 
-Pycocotools’ open-sourced scripts provides a consistent way
-to evaluate models on the COCO dataset. We are using these scripts
-during validation to measure a models performance in AP metric.
-Metrics below are evaluated using pycocotools’ methodology, in the following format:
-```
- Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.250
- Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.423
- Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.257
- Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.076
- Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.269
- Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.399
- Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.237
- Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.342
- Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.358
- Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.118
- Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.394
- Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.548
-```
-The metric reported in our results is present in the first row.
-
 To evaluate a checkpointed model saved in the previous point, run:
 
 ```
@@ -360,29 +356,6 @@ Start with running a Docker container with a Jupyter notebook server:
 nvidia-docker run --rm -it --ulimit memlock=-1 --ulimit stack=67108864 -v $SSD_CHECKPINT_PATH:/checkpoints/SSD300v1.1.pt -v $COCO_PATH:/datasets/coco2017 --ipc=host -p 8888:8888 nvidia_ssd jupyter-notebook --ip 0.0.0.0 --allow-root
 ```
 
-The container prints Jupyter notebook logs like this:
-```
-[I 16:17:58.935 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
-[I 16:17:59.769 NotebookApp] JupyterLab extension loaded from /opt/conda/lib/python3.6/site-packages/jupyterlab
-[I 16:17:59.769 NotebookApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
-[I 16:17:59.770 NotebookApp] Serving notebooks from local directory: /workspace
-[I 16:17:59.770 NotebookApp] The Jupyter Notebook is running at: 
-[I 16:17:59.770 NotebookApp] http://(65935d756c71 or 127.0.0.1):8888/?token=04c78049c67f45a4d759c8f6ddd0b2c28ac4eab60d81be4e
-[I 16:17:59.770 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
-[W 16:17:59.774 NotebookApp] No web browser found: could not locate runnable browser.
-[C 16:17:59.774 NotebookApp] 
-        
-    To access the notebook, open this file in a browser:
-        file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
-    Or copy and paste one of these URLs:
-        http://(65935d756c71 or 127.0.0.1):8888/?token=04c78049c67f45a4d759c8f6ddd0b2c28ac4eab60d81be4e
-```
-
-Use the token printed in the last line to start your notebook session.
-The notebook is in `examples/inference.ipynb`, for example:
-
-http://127.0.0.1:8888/notebooks/examples/inference.ipynb?token=04c78049c67f45a4d759c8f6ddd0b2c28ac4eab60d81be4e
-
 ## Advanced
 
 The following sections provide greater details of the dataset,
@@ -423,7 +396,7 @@ under the `/coco` directory.
 : allows you to specify the path to the pre-trained model.
 
 `--save`
-: when the flag is turned on, the script will save the trained model to the disc.
+: when the flag is turned on, the script will save the trained model checkpoints in the specified directory
 
 `--seed`
 : Use it to specify the seed for RNGs.
@@ -530,7 +503,29 @@ the COCO dataset.
  Which epochs should be evaluated can be reconfigured with the `--evaluation` argument.
 
 To run training with Tensor Cores, use the `--amp` flag when running the `main.py` script.
-The flag `--save` flag enables storing checkpoints after each epoch under `./models/epoch_*.pt`.
+The flag `--save ./models` flag enables storing checkpoints after each epoch under `./models/epoch_*.pt`.
+
+### Evaluation process
+
+Pycocotools’ open-sourced scripts provides a consistent way
+to evaluate models on the COCO dataset. We are using these scripts
+during validation to measure a models performance in AP metric.
+Metrics below are evaluated using pycocotools’ methodology, in the following format:
+```
+ Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.250
+ Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.423
+ Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.257
+ Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.076
+ Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.269
+ Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.399
+ Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.237
+ Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.342
+ Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.358
+ Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.118
+ Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.394
+ Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.548
+```
+The metric reported in our results is present in the first row.
 
 ### Inference process
 
@@ -539,10 +534,37 @@ To get meaningful results, you need a pre-trained model checkpoint.
 
 One way is to run an interactive session on Jupyter notebook, as described in a 8th step of the [Quick Start Guide](#quick-start-guide).
 
+The container prints Jupyter notebook logs like this:
+```
+[I 16:17:58.935 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
+[I 16:17:59.769 NotebookApp] JupyterLab extension loaded from /opt/conda/lib/python3.6/site-packages/jupyterlab
+[I 16:17:59.769 NotebookApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
+[I 16:17:59.770 NotebookApp] Serving notebooks from local directory: /workspace
+[I 16:17:59.770 NotebookApp] The Jupyter Notebook is running at: 
+[I 16:17:59.770 NotebookApp] http://(65935d756c71 or 127.0.0.1):8888/?token=04c78049c67f45a4d759c8f6ddd0b2c28ac4eab60d81be4e
+[I 16:17:59.770 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
+[W 16:17:59.774 NotebookApp] No web browser found: could not locate runnable browser.
+[C 16:17:59.774 NotebookApp] 
+        
+    To access the notebook, open this file in a browser:
+        file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
+    Or copy and paste one of these URLs:
+        http://(65935d756c71 or 127.0.0.1):8888/?token=04c78049c67f45a4d759c8f6ddd0b2c28ac4eab60d81be4e
+```
+
+Use the token printed in the last line to start your notebook session.
+The notebook is in `examples/inference.ipynb`, for example:
+
+http://127.0.0.1:8888/notebooks/examples/inference.ipynb?token=04c78049c67f45a4d759c8f6ddd0b2c28ac4eab60d81be4e
+
 Another way is to run a script `examples/SSD300_inference.py`. It contains the logic from the notebook, wrapped into a Python script. The script contains sample usage.
 
 To use the inference example script in your own code, you can call the `main` function, providing input image URIs as an argument. The result will be a list of detections for each input image.
 
+
+
+
+
 ## Performance
 
 ### Benchmarking
@@ -551,7 +573,7 @@ The following section shows how to run benchmarks measuring the model performanc
 
 #### Training performance benchmark
 
-The training benchmark was run in various scenarios on V100 16G GPU. For each scenario, the batch size was set to 32. The benchmark does not require a checkpoint from a fully trained model.
+The training benchmark was run in various scenarios on A100 40GB and V100 16G GPUs. The benchmark does not require a checkpoint from a fully trained model.
 
 To benchmark training, run:
 ```
@@ -573,7 +595,7 @@ Tensor Cores, and the `{data}` is the location of the COCO 2017 dataset.
 
 #### Inference performance benchmark
 
-Inference benchmark was run on 1x V100 16G GPU.  To benchmark inference, run:
+Inference benchmark was run on 1x A100 40GB GPU and 1x V100 16G GPU. To benchmark inference, run:
 ```
 python main.py --eval-batch-size {bs} \
                --mode benchmark-inference \
@@ -593,66 +615,130 @@ The following sections provide details on how we achieved our performance and ac
 
 #### Training accuracy results
 
+##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
 
-##### NVIDIA DGX-1 (8x V100 16G)
+Our results were obtained by running the `./examples/SSD300_A100_{FP16,TF32}_{1,4,8}GPU.sh`
+script in the `pytorch-20.06-py3` NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs.
+
+|GPUs       |Batch size / GPU|Accuracy - TF32|Accuracy  - mixed precision|Time to train - TF32|Time to train  - mixed precision|Time to train speedup  (TF32 to mixed precision)|
+|-----------|----------------|---------------|---------------------------|--------------------|--------------------------------|------------------------------------------------|
+|1          |64              |0.251          |0.252                      |16:00:00            |8:00:00                         |200.00%                                         |
+|4          |64              |0.250          |0.251                      |3:00:00             |1:36:00                         |187.50%                                         |
+|8          |64              |0.252          |0.251                      |1:40:00             |1:00:00                         |167.00%                                         |
+|1          |128             |0.251          |0.251                      |13:05:00            |7:00:00                         |189.05%                                         |               
+|4          |128             |0.252          |0.253                      |2:45:00             |1:30:00                         |183.33%                                         |
+|8          |128             |0.248          |0.249                      |1:20:00             |0:43:00                         |186.00%                                         | 
+
+##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
 
 Our results were obtained by running the `./examples/SSD300_FP{16,32}_{1,4,8}GPU.sh`
-script in the `pytorch-19.08-py3` NGC container on NVIDIA DGX-1 with 8x
-V100 16G GPUs. Performance numbers (in items/images per second) were averaged
-over an entire training epoch.
+script in the `pytorch-20.06-py3` NGC container on NVIDIA DGX-1 with 8x
+V100 16GB GPUs.
 
 |GPUs       |Batch size / GPU|Accuracy - FP32|Accuracy  - mixed precision|Time to train - FP32|Time to train  - mixed precision|Time to train speedup  (FP32 to mixed precision)|
 |-----------|----------------|---------------|---------------------------|--------------------|--------------------------------|------------------------------------------------|
 |1          |32              |0.250          |0.250                      |20:20:13            |10:23:46                        |195.62%                                         |
 |4          |32              |0.249          |0.250                      |5:11:17             |2:39:28                         |195.20%                                         |
-|8          |32              |0.250          |0.250                      |2:37:35             |1:25:38                         |184.01%                                         |
+|8          |32              |0.250          |0.250                      |2:37:00             |1:32:00                         |170.60%                                         |
 |1          |64              |<N/A>          |0.252                      |<N/A>               |9:27:33                         |215.00%                                         |
 |4          |64              |<N/A>          |0.251                      |<N/A>               |2:24:43                         |215.10%                                         |
-|8          |64              |<N/A>          |0.252                      |<N/A>               |1:13:01                         |215.85%                                         |
+|8          |64              |<N/A>          |0.252                      |<N/A>               |1:31:00                         |172.50%                                         |
+
+Due to smaller size, mixed precision models can be trained with bigger batches. In such cases mixed precision speedup is calculated versus FP32 training with maximum batch size for that precision
 
-Here are example graphs of FP32 and FP16 training on 8 GPU configuration:
+##### Training loss plot
+
+Here are example graphs of FP32, TF32 and AMP training on 8 GPU configuration:
 
 ![TrainingLoss](./img/training_loss.png)
 
-![ValidationAccuracy](./img/validation_accuracy.png)
+##### Training stability test
+
+The SSD300 v1.1 model was trained for 65 epochs, starting
+from 15 different initial random seeds. The training was performed in the `pytorch-20.06-py3` NGC container on
+NVIDIA DGX A100 8x A100 40GB GPUs with batch size per GPU = 128.
+After training, the models were evaluated on the test dataset. The following
+table summarizes the final mAP on the test set.
+
+|**Precision**|**Average mAP**|**Standard deviation**|**Minimum**|**Maximum**|**Median**|
+|------------:|--------------:|---------------------:|----------:|----------:|---------:|
+| AMP         | 0.2491314286  | 0.001498316675       | 0.24456   | 0.25182   | 0.24907  |
+| TF32        | 0.2489106667  | 0.001749463047       | 0.24487   | 0.25148   | 0.24848  |
 
 
 #### Training performance results
 
-##### NVIDIA DGX-1 (8x V100 16G)
+##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
 
 Our results were obtained by running the `main.py` script with the `--mode
-benchmark-training` flag in the `pytorch-19.08-py3` NGC container on NVIDIA
-DGX-1 with 8x V100 16G GPUs. Performance numbers (in items/images per second)
+benchmark-training` flag in the `pytorch-20.06-py3` NGC container on NVIDIA
+DGX A100 (8x A100 40GB) GPUs. Performance numbers (in items/images per second)
+were averaged over an entire training epoch.
+
+|GPUs       |Batch size / GPU|Throughput - TF32|Throughput  - mixed precision|Throughput speedup (TF32 - mixed precision)|Weak scaling - TF32             |Weak scaling  - mixed precision                 |
+|-----------|----------------|-----------------|-----------------------------|-------------------------------------------|--------------------------------|------------------------------------------------|
+|1          |64              |201.43           |367.15                       |182.27%                                    |100.00%                         |100.00%                                         |
+|4          |64              |791.50           |1,444.00                     |182.44%                                    |392.94%                         |393.30%                                         |
+|8          |64              |1,582.72         |2,872.48                     |181.49%                                    |785.74%                         |782.37%                                         |
+|1          |128             |206.28           |387.95                       |188.07%                                    |100.00%                         |100.00%                                         |
+|4          |128             |822.39           |1,530.15                     |186.06%                                    |398.68%                         |397.73%                                         |
+|8          |128             |1,647.00         |3,092.00                     |187.74%                                    |798.43%                         |773.00%                                         |
+
+To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
+
+##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
+
+Our results were obtained by running the `main.py` script with the `--mode
+benchmark-training` flag in the `pytorch-20.06-py3` NGC container on NVIDIA
+DGX-1 with 8x V100 16GB GPUs. Performance numbers (in items/images per second)
 were averaged over an entire training epoch.
 
 |GPUs       |Batch size / GPU|Throughput - FP32|Throughput  - mixed precision|Throughput speedup (FP32 - mixed precision)|Weak scaling - FP32             |Weak scaling  - mixed precision                 |
 |-----------|----------------|-----------------|-----------------------------|-------------------------------------------|--------------------------------|------------------------------------------------|
 |1          |32              |133.67           |215.30                       |161.07%                                    |100.00%                         |100.00%                                         |
 |4          |32              |532.05           |828.63                       |155.74%                                    |398.04%                         |384.88%                                         |
-|8          |32              |1,060.33         |1,647.74                     |155.40%                                    |793.27%                         |765.33%                                         |
+|8          |32              |820.70           |1,647.74                     |200.77%                                    |614.02%                         |802.00%                                         |
 |1          |64              |<N/A>            |232.22                       |173.73%                                    |<N/A>                           |100.00%                                         |
 |4          |64              |<N/A>            |910.77                       |171.18%                                    |<N/A>                           |392.20%                                         |
-|8          |64              |<N/A>            |1,769.48                     |166.88%                                    |<N/A>                           |761.99%                                         |
+|8          |64              |<N/A>            |1,728.00                     |210.55%                                    |<N/A>                           |761.99%                                         |
+
+Due to smaller size, mixed precision models can be trained with bigger batches. In such cases mixed precision speedup is calculated versus FP32 training with maximum batch size for that precision
 
 To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
 
 #### Inference performance results
 
+##### Inference performance: NVIDIA DGX A100 (1x A100 40GB)
 
-##### NVIDIA DGX-1 (1x V100 16G)
+Our results were obtained by running the `main.py` script with `--mode
+benchmark-inference` flag in the pytorch-20.06-py3 NGC container on NVIDIA
+DGX A100 (1x A100 40GB) GPU.
+
+|Batch size |Throughput - TF32|Throughput  - mixed precision|Throughput speedup (TF32 - mixed precision)|Weak scaling - TF32 |Weak scaling  - mixed precision |
+|-----------|-----------------|-----------------------------|-------------------------------------------|--------------------|--------------------------------|
+|1          |113.51           |109.93                       | 96.85%	                                |100.00%             |100.00%                         |
+|2          |203.07           |214.43                       |105.59%	                                |178.90%             |195.06%                         |
+|4          |338.76           |368.45                       |108.76%	                                |298.30%	         |335.17%                         |
+|8          |485.65           |526.97                       |108.51%	                                |427.85%	         |479.37%                         |
+|16         |493.64           |867.42                       |175.72%	                                |434.89%             |789.07%                         |
+|32         |548.75           |910.17                       |165.86%	                                |483.44%             |827.95%            
+
+To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
+
+##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
 
 Our results were obtained by running the `main.py` script with `--mode
-benchmark-inference` flag in the pytorch-19.08-py3 NGC container on NVIDIA
-DGX-1 with (1x V100 16G) GPUs.
+benchmark-inference` flag in the pytorch-20.06-py3 NGC container on NVIDIA
+DGX-1 with (1x V100 16GB) GPU.
 
 |Batch size |Throughput - FP32|Throughput  - mixed precision|Throughput speedup (FP32 - mixed precision)|Weak scaling - FP32 |Weak scaling  - mixed precision |
 |-----------|-----------------|-----------------------------|-------------------------------------------|--------------------|--------------------------------|
-|2          |148.99           |186.60                       |125.24%                                    |100.00%             |100.00%                         |
-|4          |203.35           |326.69                       |160.66%                                    |136.48%             |175.08%                         |
-|8          |227.32           |433.45                       |190.68%                                    |152.57%             |232.29%                         |
-|16         |278.02           |493.19                       |177.39%                                    |186.60%             |264.31%                         |
-|32         |299.81           |545.84                       |182.06%                                    |201.23%             |292.53%                         |
+|1          |82.50            |80.50                        | 97.58%	                                |100.00%             |100.00%                         |
+|2          |124.05           |147.46                       |118.87%	                                |150.36%             |183.18%                         |
+|4          |155.51           |255.16                       |164.08%	                                |188.50%	         |316.97%                         |
+|8          |182.37           |334.94                       |183.66%	                                |221.05%	         |416.07%                         |
+|16         |222.83           |358.25                       |160.77%	                                |270.10%             |445.03%                         |
+|32         |271.73           |438.85                       |161.50%	                                |329.37%             |545.16%                         |
 
 To achieve these same results, follow the [Quick Start Guide](#quick-start-guide) outlined above.
 
@@ -660,6 +746,11 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
 
 ### Changelog
 
+June 2020
+ * upgrade the PyTorch container to 20.06
+ * update performance tables to include A100 results
+ * update examples with A100 configs
+
 August 2019
  * upgrade the PyTorch container to 19.08
  * update Results section in the README

+ 4 - 0
PyTorch/Detection/SSD/examples/SSD300_A100_FP16_1GPU.sh

@@ -0,0 +1,4 @@
+# This script launches SSD300 training in FP16 on 1 GPUs using 256 batch size
+# Usage bash SSD300_FP16_1GPU.sh <path to this repository> <path to dataset> <additional flags>
+
+python $1/main.py --backbone resnet50 --warmup 300 --bs 256 --amp --data $2 ${@:3}

+ 4 - 0
PyTorch/Detection/SSD/examples/SSD300_A100_FP16_4GPU.sh

@@ -0,0 +1,4 @@
+# This script launches SSD300 training in FP16 on 4 GPUs using 1024 batch size (256 per GPU)
+# Usage ./SSD300_FP16_4GPU.sh <path to this repository> <path to dataset> <additional flags>
+
+python -m torch.distributed.launch --nproc_per_node=4 $1/main.py --backbone resnet50 --learning-rate 2.7e-3 --warmup 1200 --bs 256 --amp --data $2 ${@:3}

+ 4 - 0
PyTorch/Detection/SSD/examples/SSD300_A100_FP16_8GPU.sh

@@ -0,0 +1,4 @@
+# This script launches SSD300 training in FP16 on 8 GPUs using 1024 batch size (128 per GPU)
+# Usage ./SSD300_FP16_8GPU.sh <path to this repository> <path to dataset> <additional flags>
+
+python -m torch.distributed.launch --nproc_per_node=8 $1/main.py --backbone resnet50 --learning-rate 2.7e-3 --warmup 1200 --bs 128 --amp --data $2 ${@:3}

+ 4 - 0
PyTorch/Detection/SSD/examples/SSD300_A100_FP32_8GPU.sh

@@ -0,0 +1,4 @@
+# This script launches SSD300 training in FP32 on 8 GPUs using 1024 batch size (128 per GPU)
+# Usage ./SSD300_FP32_8GPU.sh <path to this repository> <path to dataset> <additional flags>
+
+python -m torch.distributed.launch --nproc_per_node=8 $1/main.py --backbone resnet50 --learning-rate 2.7e-3 --warmup 1200 --bs 128 --data $2 ${@:3}

BIN=BIN
PyTorch/Detection/SSD/img/training_loss.png


BIN=BIN
PyTorch/Detection/SSD/img/validation_accuracy.png


+ 49 - 11
PyTorch/Detection/SSD/main.py

@@ -27,6 +27,9 @@ from src.evaluate import evaluate
 from src.train import train_loop, tencent_trick, load_checkpoint, benchmark_train_loop, benchmark_inference_loop
 from src.data import get_train_loader, get_val_dataset, get_val_dataloader, get_coco_ground_truth
 
+import dllogger as DLLogger
+
+
 # Apex imports
 try:
     from apex.parallel.LARC import LARC
@@ -72,8 +75,8 @@ def make_parser():
                         help='manually set random seed for torch')
     parser.add_argument('--checkpoint', type=str, default=None,
                         help='path to model checkpoint file')
-    parser.add_argument('--save', action='store_true',
-                        help='save model checkpoints')
+    parser.add_argument('--save', type=str, default=None,
+                        help='save model checkpoints in the specified directory')
     parser.add_argument('--mode', type=str, default='training',
                         choices=['training', 'evaluation', 'benchmark-training', 'benchmark-inference'])
     parser.add_argument('--evaluation', nargs='*', type=int, default=[21, 31, 37, 42, 48, 53, 59, 64],
@@ -89,7 +92,6 @@ def make_parser():
     parser.add_argument('--weight-decay', '--wd', type=float, default=0.0005,
                         help='momentum argument for SGD optimizer')
 
-    parser.add_argument('--profile', type=int, default=None)
     parser.add_argument('--warmup', type=int, default=None)
     parser.add_argument('--benchmark-iterations', type=int, default=20, metavar='N',
                         help='Run N iterations while benchmarking (ignored when training and validation)')
@@ -104,10 +106,14 @@ def make_parser():
                              ' When it is not provided, pretrained model from torchvision'
                              ' will be downloaded.')
     parser.add_argument('--num-workers', type=int, default=4)
-    parser.add_argument('--amp', action='store_true')
+    parser.add_argument('--amp', action='store_true',
+                        help='Whether to enable AMP ops. When false, uses TF32 on A100 and FP32 on V100 GPUS.')
+    parser.add_argument('--json-summary', type=str, default=None,
+                        help='If provided, the json summary will be written to'
+                             'the specified file.')
 
     # Distributed
-    parser.add_argument('--local_rank', default=0, type=int,
+    parser.add_argument('--local_rank', default=os.getenv('LOCAL_RANK',0), type=int,
                         help='Used for multi-process training. Can either be manually set ' +
                              'or automatically set by using \'python -m multiproc\'.')
 
@@ -222,29 +228,61 @@ def train(train_loop_func, logger, args):
                 obj['model'] = ssd300.module.state_dict()
             else:
                 obj['model'] = ssd300.state_dict()
-            torch.save(obj, './models/epoch_{}.pt'.format(epoch))
+            save_path = os.path.join(args.save, f'epoch_{epoch}.pt')
+            torch.save(obj, save_path)
+            logger.log('model path', save_path)
         train_loader.reset()
-    print('total training time: {}'.format(total_time))
-
+    DLLogger.log((), { 'total time': total_time })
+    logger.log_summary()
+
+
+def log_params(logger, args):
+    logger.log_params({
+        "dataset path": args.data,
+        "epochs": args.epochs,
+        "batch size": args.batch_size,
+        "eval batch size": args.eval_batch_size,
+        "no cuda": args.no_cuda,
+        "seed": args.seed,
+        "checkpoint path": args.checkpoint,
+        "mode": args.mode,
+        "eval on epochs": args.evaluation,
+        "lr decay epochs": args.multistep,
+        "learning rate": args.learning_rate,
+        "momentum": args.momentum,
+        "weight decay": args.weight_decay,
+        "lr warmup": args.warmup,
+        "backbone": args.backbone,
+        "backbone path": args.backbone_path,
+        "num workers": args.num_workers,
+        "AMP": args.amp,
+        "precision": 'amp' if args.amp else 'fp32',
+    })
 
 if __name__ == "__main__":
     parser = make_parser()
     args = parser.parse_args()
+    args.local_rank = int(os.environ.get('LOCAL_RANK', args.local_rank))
     if args.local_rank == 0:
         os.makedirs('./models', exist_ok=True)
 
     torch.backends.cudnn.benchmark = True
 
+    # write json only on the main thread
+    args.json_summary = args.json_summary if args.local_rank == 0 else None
+
     if args.mode == 'benchmark-training':
         train_loop_func = benchmark_train_loop
-        logger = BenchLogger('Training benchmark')
+        logger = BenchLogger('Training benchmark', json_output=args.json_summary)
         args.epochs = 1
     elif args.mode == 'benchmark-inference':
         train_loop_func = benchmark_inference_loop
-        logger = BenchLogger('Inference benchmark')
+        logger = BenchLogger('Inference benchmark', json_output=args.json_summary)
         args.epochs = 1
     else:
         train_loop_func = train_loop
-        logger = Logger('Training logger', print_freq=1)
+        logger = Logger('Training logger', print_freq=1, json_output=args.json_summary)
+
+    log_params(logger, args)
 
     train(train_loop_func, logger, args)

+ 6 - 6
PyTorch/Detection/SSD/src/coco_pipeline.py

@@ -187,7 +187,7 @@ class DALICOCOIterator(object):
             for j in range(len(bboxes)):
                 bboxes_shape.append([])
                 for k in range(len(bboxes[j])):
-                    bboxes_shape[j].append(bboxes[j].at(k).shape())
+                    bboxes_shape[j].append(bboxes[j][k].shape())
 
             # Prepare labels shapes and offsets
             labels_shape = []
@@ -198,14 +198,14 @@ class DALICOCOIterator(object):
                 labels_shape.append([])
                 bbox_offsets.append([0])
                 for k in range(len(labels[j])):
-                    lshape = labels[j].at(k).shape()
+                    lshape = labels[j][k].shape()
                     bbox_offsets[j].append(bbox_offsets[j][k] + lshape[0])
                     labels_shape[j].append(lshape)
 
             # We always need to alocate new memory as bboxes and labels varies in shape
             images_torch_type = to_torch_type[np.dtype(images[0].dtype())]
-            bboxes_torch_type = to_torch_type[np.dtype(bboxes[0].at(0).dtype())]
-            labels_torch_type = to_torch_type[np.dtype(labels[0].at(0).dtype())]
+            bboxes_torch_type = to_torch_type[np.dtype(bboxes[0][0].dtype())]
+            labels_torch_type = to_torch_type[np.dtype(labels[0][0].dtype())]
 
             torch_gpu_device = torch.device('cuda', dev_id)
             torch_cpu_device = torch.device('cpu')
@@ -224,13 +224,13 @@ class DALICOCOIterator(object):
             for j, b_list in enumerate(bboxes):
                 for k in range(len(b_list)):
                     if (pyt_bboxes[j][k].shape[0] != 0):
-                        feed_ndarray(b_list.at(k), pyt_bboxes[j][k])
+                        feed_ndarray(b_list[k], pyt_bboxes[j][k])
                 pyt_bboxes[j] = torch.cat(pyt_bboxes[j])
 
             for j, l_list in enumerate(labels):
                 for k in range(len(l_list)):
                     if (pyt_labels[j][k].shape[0] != 0):
-                        feed_ndarray(l_list.at(k), pyt_labels[j][k])
+                        feed_ndarray(l_list[k], pyt_labels[j][k])
                 pyt_labels[j] = torch.cat(pyt_labels[j]).squeeze(dim=1)
 
             for j in range(len(pyt_offsets)):

+ 53 - 8
PyTorch/Detection/SSD/src/logger.py

@@ -15,6 +15,7 @@
 import math
 import numpy as np
 
+import dllogger as DLLogger
 
 class EpochMeter:
     def __init__(self, name):
@@ -53,26 +54,63 @@ class IterationAverageMeter:
 
 
 class Logger:
-    def __init__(self, name, print_freq=20):
+    def __init__(self, name, json_output=None, print_freq=20):
         self.name = name
         self.train_loss_logger = IterationAverageMeter("Training loss")
         self.train_epoch_time_logger = EpochMeter("Training 1 epoch time")
         self.val_acc_logger = EpochMeter("Validation accuracy")
         self.print_freq = print_freq
 
+        backends = [ DLLogger.StdOutBackend(DLLogger.Verbosity.DEFAULT) ]
+        if json_output:
+            backends.append(DLLogger.JSONStreamBackend(DLLogger.Verbosity.VERBOSE, json_output))
+
+        DLLogger.init(backends)
+
+        self.epoch = 0
+        self.train_iter = 0
+        self.summary = {}
+
+    def step(self):
+        return (
+            self.epoch,
+            self.train_iter,
+        )
+
+    def log_params(self, data):
+        DLLogger.log("PARAMETER", data)
+        DLLogger.flush()
+
+    def log(self, key, value):
+        DLLogger.log(self.step(), { key: value })
+        DLLogger.flush()
+
+    def add_to_summary(self, data):
+        for key, value in data.items():
+            self.summary[key] = value
+
+    def log_summary(self):
+        DLLogger.log((), self.summary)
+
     def update_iter(self, epoch, iteration, loss):
+        self.train_iter = iteration
         self.train_loss_logger.update_iter(loss)
         if iteration % self.print_freq == 0:
-            print('epoch: {}\titeraion: {}\tloss: {}'.format(epoch, iteration, loss))
+            self.log('loss', loss)
 
     def update_epoch(self, epoch, acc):
+        self.epoch = epoch
         self.train_loss_logger.update_epoch(epoch)
         self.val_acc_logger.update(epoch, acc)
-        print('epoch: {}\tmAP accuracy: {}'.format(epoch, acc))
+
+        data = { 'mAP': acc }
+        self.add_to_summary(data)
+        DLLogger.log((self.epoch,), data)
 
     def update_epoch_time(self, epoch, time):
+        self.epoch = epoch
         self.train_epoch_time_logger.update(epoch, time)
-        print('epoch: {}\ttime: {}'.format(epoch, time))
+        DLLogger.log((self.epoch,), { 'time': time })
 
     def print_results(self):
         return self.train_loss_logger.data, self.val_acc_logger.data, self.train_epoch_time_logger
@@ -94,9 +132,8 @@ class BenchmarkMeter:
 
 
 class BenchLogger(Logger):
-    def __init__(self, name):
-        super().__init__(name)
-        self.name = name
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
         self.images_per_ses = BenchmarkMeter(self.name)
 
     def update(self, bs, time):
@@ -106,8 +143,16 @@ class BenchLogger(Logger):
         total_bs = self.images_per_ses.total_images
         total_time = self.images_per_ses.total_time
         avr = self.images_per_ses.avr_images_per_second
-        med = np.median(self.images_per_ses.data)
 
+        data = np.array(self.images_per_ses.data)
+        med = np.median(data)
+
+        DLLogger.log((), {
+            'avg_img/sec': avr,
+            'med_img/sec': np.median(data),
+            'min_img/sec': np.min(data),
+            'max_img/sec': np.max(data),
+        })
         print("Done benchmarking. Total images: {}\ttotal time: {:.3f}\tAverage images/sec: {:.3f}\tMedian images/sec: {:.3f}".format(
             total_bs,
             total_time,

+ 6 - 2
PyTorch/Detection/SSD/src/train.py

@@ -84,6 +84,7 @@ def benchmark_train_loop(model, loss_func, epoch, optim, train_dataloader, val_d
     result = torch.zeros((1,)).cuda()
     for i, data in enumerate(loop(train_dataloader)):
         if i >= args.benchmark_warmup:
+            torch.cuda.synchronize()
             start_time = time.time()
 
         img = data[0][0][0]
@@ -144,6 +145,7 @@ def benchmark_train_loop(model, loss_func, epoch, optim, train_dataloader, val_d
             break
 
         if i >= args.benchmark_warmup:
+            torch.cuda.synchronize()
             logger.update(args.batch_size, time.time() - start_time)
 
 
@@ -155,10 +157,12 @@ def benchmark_train_loop(model, loss_func, epoch, optim, train_dataloader, val_d
 
 
 
-def loop(dataloader):
+def loop(dataloader, reset=True):
     while True:
         for data in dataloader:
             yield data
+        if reset:
+            dataloader.reset()
 
 def benchmark_inference_loop(model, loss_func, epoch, optim, train_dataloader, val_dataloader, encoder, iteration, logger, args, mean, std):
     assert args.N_gpu == 1, 'Inference benchmark only on 1 gpu'
@@ -166,7 +170,7 @@ def benchmark_inference_loop(model, loss_func, epoch, optim, train_dataloader, v
     model.eval()
 
     i = -1
-    val_datas = loop(val_dataloader)
+    val_datas = loop(val_dataloader, False)
 
     while True:
         i += 1

+ 1 - 1
PyTorch/Detection/SSD/src/utils.py

@@ -257,7 +257,7 @@ class DefaultBoxes(object):
                     cx, cy = (j+0.5)/fk[idx], (i+0.5)/fk[idx]
                     self.default_boxes.append((cx, cy, w, h))
 
-        self.dboxes = torch.tensor(self.default_boxes)
+        self.dboxes = torch.tensor(self.default_boxes, dtype=torch.float)
         self.dboxes.clamp_(min=0, max=1)
         # For IoU calculation
         self.dboxes_ltrb = self.dboxes.clone()

+ 10 - 8
TensorFlow/Detection/SSD/Dockerfile

@@ -1,14 +1,16 @@
-FROM nvcr.io/nvidia/tensorflow:19.05-py3 as base
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.06-tf1-py3
+FROM ${FROM_IMAGE_NAME}
 
-FROM base as sha
 
-RUN mkdir /sha
-RUN cat `cat HEAD | cut -d' ' -f2` > /sha/repo_sha
-
-FROM base as final
 
 WORKDIR /workdir
 
+RUN export DEBIAN_FRONTEND=noninteractive \
+ && apt-get update \
+ && apt-get install -y --no-install-recommends \
+        libpmi2-0-dev \
+ && rm -rf /var/lib/apt/lists/*
+
 RUN PROTOC_VERSION=3.0.0 && \
     PROTOC_ZIP=protoc-${PROTOC_VERSION}-linux-x86_64.zip && \
     curl -OL https://github.com/google/protobuf/releases/download/v$PROTOC_VERSION/$PROTOC_ZIP && \
@@ -18,6 +20,7 @@ RUN PROTOC_VERSION=3.0.0 && \
 COPY requirements.txt .
 RUN pip install Cython
 RUN pip install -r requirements.txt
+RUN pip --no-cache-dir --no-cache install 'git+https://github.com/NVIDIA/dllogger'
 
 WORKDIR models/research/
 COPY models/research/ .
@@ -26,6 +29,5 @@ ENV PYTHONPATH="/workdir/models/research/:/workdir/models/research/slim/:$PYTHON
 
 COPY examples/ examples
 COPY configs/ configs/
+COPY qa/ qa/
 COPY download_all.sh download_all.sh
-
-COPY --from=sha /sha .

+ 316 - 34
TensorFlow/Detection/SSD/README.md

@@ -4,11 +4,20 @@ This repository provides a script and recipe to train SSD320 v1.2 to achieve sta
 
 ## Table Of Contents
 * [Model overview](#model-overview)
+  * [Model architecture](#model-architecture)
   * [Default configuration](#default-configuration)
+  * [Feature support matrix](#feature-support-matrix)
+    * [Features](#features)
+  * [Mixed precision training](#mixed-precision-training)
+    * [Enabling mixed precision](#enabling-mixed-precision)
+    * [Enabling TF32](#enabling-tf32)
+  * [Glossary](#glossary)
 * [Setup](#setup)
   * [Requirements](#requirements)
 * [Quick Start Guide](#quick-start-guide)
 * [Advanced](#advanced)
+  * [Scripts and sample code](#scripts-and-sample-code)
+  * [Parameters](#parameters)
   * [Command line options](#command-line-options)
   * [Getting the data](#getting-the-data)
   * [Training process](#training-process)
@@ -21,15 +30,24 @@ This repository provides a script and recipe to train SSD320 v1.2 to achieve sta
     * [Inference performance benchmark](#inference-performance-benchmark)
   * [Results](#results)
     * [Training accuracy results](#training-accuracy-results)
+      * [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb) 
+      * [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
     * [Training performance results](#training-performance-results)
+      * [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb) 
+      * [Training performance: NVIDIA DGX-1 (8x V100 16GB)](#training-performance-nvidia-dgx-1-8x-v100-16gb)
     * [Inference performance results](#inference-performance-results)
+      * [Inference performance: NVIDIA DGX A100 (1x A100 40GB)](#inference-performance-nvidia-dgx-a100-1x-a100-40gb)
+      * [Inference performance: NVIDIA DGX-1 (1x V100 16GB)](#inference-performance-nvidia-dgx-1-1x-v100-16gb)
+      * [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4)
 * [Release notes](#release-notes)
   * [Changelog](#changelog)
   * [Known issues](#known-issues)
 
 ## Model overview
 
-The SSD320 v1.2 model is based on the [SSD: Single Shot MultiBox Detector](https://arxiv.org/abs/1512.02325) paper, which describes SSD as “a method for detecting objects in images using a single deep neural network”.
+The SSD320 v1.2 model is based on the [SSD: Single Shot MultiBox Detector](https://arxiv.org/abs/1512.02325) paper, which describes SSD as "a method for detecting objects in images using a single deep neural network".
+This model is trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 1.5x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
+### Model architecture
 
 Our implementation is based on the existing [model from the TensorFlow models repository](https://github.com/tensorflow/models/blob/master/research/object_detection/samples/configs/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync.config).
 The network was altered in order to improve accuracy and increase throughput. Changes include:
@@ -38,15 +56,6 @@ The network was altered in order to improve accuracy and increase throughput. Ch
 - Replacing the original hard negative mining loss function with [Focal Loss](https://arxiv.org/pdf/1708.02002.pdf).
 - Decreasing the input size to 320 x 320.
 
-This model trains with mixed precision tensor cores on NVIDIA Volta GPUs, therefore you can get results much faster than training without tensor cores. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
-
-The following features were implemented in this model:
-- Data-parallel multi-GPU training with Horovod.
-- Mixed precision support with TensorFlow Automatic Mixed Precision (TF-AMP), which enables mixed precision training without any changes to the code-base by performing automatic graph rewrites and loss scaling controlled by an environmental variable.
-- Tensor Core operations to maximize throughput using NVIDIA Volta GPUs.
-- Dynamic loss scaling for tensor cores (mixed precision) training.
-
-Because of these enhancements, the SSD320 v1.2 model achieves higher accuracy.
 
 ### Default configuration
 We trained the model for 12500 steps (27 epochs) with the following setup:
@@ -58,6 +67,110 @@ We trained the model for 12500 steps (27 epochs) with the following setup:
 - Batch size per GPU = 32
 - Number of GPUs = 8
 
+### Feature support matrix
+
+The following features are supported by this model:
+
+| **Feature** | **Transformer-XL** |
+|:------------|-------------------:|
+|[Automatic mixed precision (AMP)](https://nvidia.github.io/apex/amp.html) | Yes |
+|[Horovod Multi-GPU (NCCL)](https://github.com/horovod/horovod) | Yes |
+
+#### Features
+
+[TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) - a 
+tool that enables Tensor Core-accelerated training. Refer to the [Enabling
+mixed precision](#enabling-mixed-precision) section for more details.
+
+[Horovod](https://github.com/horovod/horovod) - Horovod 
+is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet.
+The goal of Horovod is to make distributed deep learning fast and easy to use.
+For more information about how to get started with Horovod, see the [Horovod:
+Official repository](https://github.com/horovod/horovod).
+
+[Multi-GPU training with Horovod](https://github.com/horovod/horovod/#usage) - our model 
+uses Horovod to implement efficient multi-GPU training with NCCL. For details,
+see example sources in this repository or see the [TensorFlow
+tutorial](https://github.com/horovod/horovod/#usage).
+
+### Mixed precision training
+
+Mixed precision is the combined use of different numerical precisions in a
+computational method.
+[Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant
+computational speedup by performing operations in half-precision format while
+storing minimal information in single-precision to retain as much information
+as possible in critical parts of the network. Since the introduction of [Tensor
+Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with both the
+Turing and Ampere architectures, significant training speedups are experienced by switching to
+mixed precision -- up to 3x overall speedup on the most arithmetically intense
+model architectures. Using mixed precision training previously required two
+steps:
+
+1.  Porting the model to use the FP16 data type where appropriate.    
+2.  Adding loss scaling to preserve small gradient values.
+
+This can now be achieved using Automatic Mixed Precision (AMP) for TensorFlow to enablethe full
+[mixed precision methodology](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#tensorflow)
+in your existing TensorFlow model code.  AMP enables mixed precision training on Volta and Turing GPUs automatically.
+The TensorFlow framework code makes all necessary model changes internally.
+
+In TF-AMP, the computational graph is optimized to use as few casts as necessary and maximize the use of FP16,
+and the loss scaling is automatically applied inside of supported optimizers. AMP can be configured to work
+with the existing tf.contrib loss scaling manager by disabling the AMP scaling with a single environment
+variable to perform only the automatic mixed-precision optimization. It accomplishes this by automatically
+rewriting all computation graphs with the necessary operations to enable mixed precision training and automatic loss scaling.
+
+For information about:
+
+* How to train using mixed precision, see the [Mixed Precision
+  Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed
+  Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html)
+  documentation.
+* Techniques used for mixed precision training, see the [Mixed-Precision
+  Training of Deep Neural
+  Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/)
+  blog.
+* How to access and enable AMP for TensorFlow, see [Using
+  TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp)
+  from the TensorFlow User Guide. 
+
+#### Enabling mixed precision
+
+Mixed precision is enabled in TensorFlow by using the Automatic Mixed Precision (TF-AMP)
+extension which casts variables to half-precision upon retrieval, while storing variables
+in single-precision format. Furthermore, to preserve small gradient magnitudes in backpropagation,
+a [loss scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling)
+step must be included when applying gradients. In TensorFlow, loss scaling can be applied statically by
+using simple multiplication of loss by a constant value or automatically, by TF-AMP. Automatic mixed
+precision makes all the adjustments internally in TensorFlow, providing two benefits over manual operations.
+First, programmers need not modify network model code, reducing development and maintenance effort.
+Second, using AMP maintains forward and backward compatibility with all the APIs for defining and running TensorFlow models.
+
+To enable mixed precision, you can simply add the values to the environmental variables inside your training script:
+- Enable TF-AMP graph rewrite:
+  ```
+  os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1"
+  ```
+  
+- Enable Automated Mixed Precision:
+  ```
+  os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
+  ```
+
+#### Enabling TF32
+
+This section is model specific and needs to show how to enable TF32.  How is TF32 being implemented? Tweaking layers, preprocessing data, etc… 
+
+TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. 
+
+TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
+
+For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post.
+
+TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
+
+
 ## Setup
 
 The following section list the requirements in order to start training the SSD320 v1.2 model.
@@ -65,8 +178,12 @@ The following section list the requirements in order to start training the SSD32
 ### Requirements
 This repository contains `Dockerfile` which extends the TensorFlow NGC container and encapsulates some dependencies.  Aside from these dependencies, ensure you have the following software:
 * [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [TensorFlow 19.03-py3](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow) (or later) NGC container
-* [NVIDIA Volta based GPU](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
+* [TensorFlow 20.06-py3](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow) (or later) NGC container
+* GPU-based architecture:
+    * [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
+    * [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/)
+    * [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
+
 
 For more information about how to get started with NGC containers, see the
 following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
@@ -77,7 +194,7 @@ Documentation:
 
 
 ## Quick Start Guide
-To train your model using mixed precision with tensor cores or using FP32,
+To train your model using mixed precision or TF32 with tensor cores or using TF32, FP32,
 perform the following steps using the default parameters of the SSD320 v1.2 model on the
 [COCO 2017](http://cocodataset.org/#download) dataset.
 
@@ -165,6 +282,52 @@ bash examples/SSD320_evaluate.sh <path to checkpoint>
 
 The following sections provide greater details of the dataset, running training and inference, and the training results.
 
+### Scripts and sample code
+
+* `Dockerfile`: a container with the basic set of dependencies to run SSD
+
+In the `model/research/object_detection` directory, the most important files are:
+
+* `model_main.py`: serves as the entry point to launch the training and inference
+* `models/ssd_resnet_v1_fpn_feature_extractor.py`: implementation of the model
+* `metrics/coco_tools.py`: implementation of mAP metric
+* `utils/exp_utils.py`: utility functions for running training and benchmarking
+
+### Parameters
+
+The complete list of available parameters for the `model/research/object_detection/model_main.py` script contains:
+
+```
+./object_detection/model_main.py:
+  --[no]allow_xla: Enable XLA compilation
+    (default: 'false')
+  --checkpoint_dir: Path to directory holding a checkpoint.  If `checkpoint_dir` is provided, this binary operates in
+    eval-only mode, writing resulting metrics to `model_dir`.
+  --eval_count: How many times the evaluation should be run
+    (default: '1')
+    (an integer)
+  --[no]eval_training_data: If training data should be evaluated for this job. Note that one call only use this in eval-
+    only mode, and `checkpoint_dir` must be supplied.
+    (default: 'false')
+  --hparams_overrides: Hyperparameter overrides, represented as a string containing comma-separated hparam_name=value
+    pairs.
+  --model_dir: Path to output model directory where event and checkpoint files will be written.
+  --num_train_steps: Number of train steps.
+    (an integer)
+  --pipeline_config_path: Path to pipeline config file.
+  --raport_file: Path to dlloger json
+    (default: 'summary.json')
+  --[no]run_once: If running in eval-only mode, whether to run just one round of eval vs running continuously (default).
+    (default: 'false')
+  --sample_1_of_n_eval_examples: Will sample one of every n eval input examples, where n is provided.
+    (default: '1')
+    (an integer)
+  --sample_1_of_n_eval_on_train_examples: Will sample one of every n train input examples for evaluation, where n is
+    provided. This is only used if `eval_training_data` is True.
+    (default: '5')
+    (an integer)
+```
+
 ### Command line options
 The SSD model training is conducted by the script from the object_detection library, `model_main.py`.
 Our experiments were done with settings described in the `examples` directory.
@@ -275,38 +438,76 @@ bash examples/SSD320_FP16_inference.sh --help
 The following sections provide details on how we achieved our performance and accuracy in training and inference.
 
 #### Training accuracy results
-Our results were obtained by running the `./examples/SSD320_FP{16,32}_{1,4,8}GPU.sh` script in the TensorFlow-19.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.
+
+##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
+
+Our results were obtained by running the `./examples/SSD320_FP{16,32}_{1,4,8}GPU.sh` script in the TensorFlow-20.06-py3 NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs.
+
+All the results are obtained with batch size set to 32.
+
+| **Number of GPUs** | **Mixed precision mAP** | **Training time with mixed precision** | **TF32 mAP** | **Training time with TF32** |
+|:------------------:|:-----------------------:|:--------------------------------------:|:------------:|:---------------------------:|
+| 1                  | 0.279                   | 4h 48min                               | 0.280        | 6h 40min                   |
+| 4                  | 0.280                   | 1h 20min                               | 0.279        | 1h 53min                    |
+| 8                  | 0.281                   | 0h 53min                               | 0.282        | 1h 05min                    |
+
+##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
+
+Our results were obtained by running the `./examples/SSD320_FP{16,32}_{1,4,8}GPU.sh` script in the TensorFlow-20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.
 All the results are obtained with batch size set to 32.
 
 | **Number of GPUs** | **Mixed precision mAP** | **Training time with mixed precision** | **FP32 mAP** | **Training time with FP32** |
 |:------------------:|:-----------------------:|:--------------------------------------:|:------------:|:---------------------------:|
-| 1                  | 0.276                   | 7h 17min                               | 0.278        | 10h 20min                   |
-| 4                  | 0.277                   | 2h 15min                               | 0.275        | 2h 53min                    |
-| 8                  | 0.269                   | 1h 19min                               | 0.268        | 1h 37min                    |
+| 1                  | 0.279                   | 7h 36min                               | 0.278        | 10h 38min                   |
+| 4                  | 0.277                   | 2h 18min                               | 0.279        | 2h 58min                    |
+| 8                  | 0.280                   | 1h 28min                               | 0.282        | 1h 55min                    |
 
 
-Here are example graphs of FP32 and FP16 training on 8 GPU configuration:
+Here are example graphs of TF32, FP32 and FP16 training on 8 GPU configuration:
 
 ![TrainingLoss](./img/training_loss.png)
 
-![ValidationAccuracy](./img/validation_accuracy.png)
-
 #### Training performance results
 
+##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
+
+Our results were obtained by running:
+
+```
+python bash examples/SSD320_FP*GPU_BENCHMARK.sh
+```
+
+scripts in the TensorFlow-20.06-py3 NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs.
+
+
+| **Number of GPUs** | **Batch size per GPU** | **Mixed precision img/s** | **TF32 img/s** | **Speed-up with mixed precision** | **Multi-gpu weak scaling with mixed precision** | **Multi-gpu weak scaling with TF32** |
+|:------------------:|:----------------------:|:-------------------------:|:--------------:|:---------------------------------:|:-----------------------------------------------:|:------------------------------------:|
+| 1                  | 32                     |  180.55                   |  123.48        | 1.46                              | 1.00                                            | 1.00                                 |
+| 4                  | 32                     |  624.35                   |  449.17        | 1.39                              | 3.46                                            | 3.64                                 |
+| 8                  | 32                     |  1008.46                  |  779.96        | 1.29                              | 5.59                                            | 6.32                                 |
+
+To achieve same results, follow the [Quick start guide](#quick-start-guide) outlined above.
+
+Those results can be improved when [XLA](https://www.tensorflow.org/xla) is used 
+in conjunction with mixed precision, delivering up to 2x speedup over FP32 on a single GPU (~179 img/s).
+However XLA is still considered experimental.
+
+##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
+
 Our results were obtained by running:
 
 ```
 python bash examples/SSD320_FP*GPU_BENCHMARK.sh
 ```
 
-scripts in the TensorFlow-19.03-py3 NGC container on NVIDIA DGX-1 with V100 16G GPUs. 
+scripts in the TensorFlow-20.06-py3 NGC container on NVIDIA DGX-1 with V100 16G GPUs. 
 
 
 | **Number of GPUs** | **Batch size per GPU** | **Mixed precision img/s** | **FP32 img/s** | **Speed-up with mixed precision** | **Multi-gpu weak scaling with mixed precision** | **Multi-gpu weak scaling with FP32** |
 |:------------------:|:----------------------:|:-------------------------:|:--------------:|:---------------------------------:|:-----------------------------------------------:|:------------------------------------:|
-| 1                  | 32                     |  124.97                   |   87.87        | 1.42                              | 1.00                                            | 1.00                                 |
-| 4                  | 32                     |  430.79                   |  330.35        | 1.39                              | 3.45                                            | 3.76                                 |
-| 8                  | 32                     |  752.04                   |  569.01        | 1.32                              | 6.02                                            | 6.48                                 |
+| 1                  | 32                     |  127.96                   |   84.96        | 1.51                              | 1.00                                            | 1.00                                 |
+| 4                  | 32                     |  396.38                   |  283.30        | 1.40                              | 3.10                                            | 3.33                                 |
+| 8                  | 32                     |  676.83                   |  501.30        | 1.35                              | 5.29                                            | 5.90                                 |
 
 To achieve same results, follow the [Quick start guide](#quick-start-guide) outlined above.
 
@@ -316,25 +517,106 @@ However XLA is still considered experimental.
 
 #### Inference performance results
 
-Our results were obtained by running the `examples/SSD320_FP{16,32}_inference.sh` script in the TensorFlow-19.03-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPUs.
+##### Inference performance: NVIDIA DGX A100 (1x A100 40GB)
+
+Our results were obtained by running the `examples/SSD320_FP{16,32}_inference.sh` script in the TensorFlow-20.06-py3 NGC container on NVIDIA DGX A100 (1x A100 40GB) GPU.
 
+FP16
 
-| **Batch size** | **Mixed precision img/s** | **FP32 img/s** |
-|:--------------:|:-------------------------:|:--------------:|
-|              1 |                    93.37  |        97.29   |
-|              2 |                   135.33  |       134.04   |
-|              4 |                   171.70  |       163.38   |
-|              8 |                   189.25  |       174.47   |
-|             16 |                   187.62  |       175.42   |
-|             32 |                   187.37  |       175.07   |
-|             64 |                   191.40  |       177.75   |
+| **Batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** |**Latency 95%** |**Latency 99%** |
+|------------|----------------|-------|-------|-------|-------|
+|          1 |          40.88 | 24.46 | 25.76 | 26.47 | 27.91 |
+|          2 |          49.26 | 40.60 | 42.09 | 42.61 | 45.26 |
+|          4 |          58.81 | 68.01 | 73.12 | 76.02 | 80.38 |
+|          8 |          69.13 |115.73 |121.58 |123.87 |129.00 |
+|         16 |          78.10 |204.85 |212.40 |216.38 |225.80 |
+|         32 |          76.19 |420.00 |437.24 |443.21 |479.80 |
+|         64 |          77.92 |821.37 |840.82 |867.62 |1204.64|
+
+TF32
+
+| **Batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** |**Latency 95%** |**Latency 99%** |
+|------------|----------------|-------|-------|-------|-------|
+|          1 |          36.93 | 27.08 | 29.10 | 29.89 | 32.24 |
+|          2 |          44.03 | 45.42 | 48.67 | 49.56 | 51.12 |
+|          4 |          54.65 | 73.20 | 77.50 | 78.89 | 85.81 |
+|          8 |          62.96 |127.06 |137.04 |141.64 |152.92 |
+|         16 |          71.48 |223.83 |231.36 |233.35 |247.51 |
+|         32 |          73.11 |437.71 |450.86 |455.14 |467.11 |
+|         64 |          73.74 |867.88 |898.99 |912.07 |1077.13|
+
+To achieve same results, follow the [Quick start guide](#quick-start-guide) outlined above.
+
+##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
+
+Our results were obtained by running the `examples/SSD320_FP{16,32}_inference.sh` script in the TensorFlow-20.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPU.
+
+FP16
+
+| **Batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** |**Latency 95%** |**Latency 99%** |
+|------------|----------------|-------|-------|-------|-------|
+|          1 |          28.34 | 35.29 | 38.09 | 39.06 | 41.07 |
+|          2 |          41.21 | 48.54 | 52.77 | 54.45 | 57.10 |
+|          4 |          55.41 | 72.19 | 75.44 | 76.99 | 84.15 |
+|          8 |          61.83 |129.39 |133.37 |136.89 |145.69 |
+|         16 |          66.36 |241.12 |246.05 |249.47 |259.79 |
+|         32 |          65.01 |492.21 |510.01 |516.45 |526.83 |
+|         64 |          64.75 |988.47 |1012.11|1026.19|1290.54|
+
+FP32
+
+| **Batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** |**Latency 95%** |**Latency 99%** |
+|------------|----------------|-------|-------|-------|-------|
+|          1 |          29.15 | 34.31 | 36.26 | 37.63 | 39.95 |
+|          2 |          41.20 | 48.54 | 53.08 | 54.47 | 57.32 |
+|          4 |          50.72 | 78.86 | 82.49 | 84.08 | 92.15 |
+|          8 |          55.72 |143.57 |147.20 |148.92 |152.44 |
+|         16 |          59.41 |269.32 |278.30 |281.06 |286.54 |
+|         32 |          59.81 |534.99 |542.49 |551.58 |572.16 |
+|         64 |          58.93 |1085.96|1111.20|1118.21|1253.74|
+
+To achieve same results, follow the [Quick start guide](#quick-start-guide) outlined above.
+
+
+##### Inference performance: NVIDIA T4
+
+Our results were obtained by running the `examples/SSD320_FP{16,32}_inference.sh` script in the TensorFlow-20.06-py3 NGC container on NVIDIA T4.
+
+FP16
+
+| **Batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** |**Latency 95%** |**Latency 99%** |
+|------------|----------------|-------|-------|-------|-------|
+|          1 |          19.29 | 51.90 | 53.77 | 54.95 | 59.21 |
+|          2 |          30.36 | 66.04 | 70.13 | 71.49 | 73.97 |
+|          4 |          37.71 |106.21 |111.32 |113.04 |118.03 |
+|          8 |          40.95 |195.49 |201.66 |204.00 |210.32 |
+|         16 |          41.04 |390.05 |399.73 |402.88 |410.02 |
+|         32 |          40.36 |794.48 |815.81 |825.39 |841.45 |
+|         64 |          40.27 |1590.98|1631.00|1642.22|1838.95|
+
+FP32
+
+| **Batch size** | **Throughput Avg** | **Latency Avg** | **Latency 90%** |**Latency 95%** |**Latency 99%** |
+|------------|----------------|-------|-------|-------|-------|
+|          1 |          14.30 | 69.99 | 72.30 | 73.29 | 76.35 |
+|          2 |          20.04 | 99.87 |104.50 |106.03 |108.15 |
+|          4 |          25.01 |159.99 |163.00 |164.13 |168.63 |
+|          8 |          28.42 |281.58 |286.57 |289.01 |294.37 |
+|         16 |          32.56 |492.08 |501.98 |505.29 |509.95 |
+|         32 |          34.14 |939.11 |961.35 |968.26 |983.77 |
+|         64 |          33.47 |1915.36|1971.90|1992.24|2030.54|
 
 To achieve same results, follow the [Quick start guide](#quick-start-guide) outlined above.
 
+
+
 ## Release notes
 
 ### Changelog
 
+June 2020
+ * Updated performance tables to include A100 results
+
 March 2019
  * Initial release
 

+ 1 - 2
TensorFlow/Detection/SSD/examples/SSD320_FP16_1GPU.sh

@@ -15,8 +15,6 @@
 CKPT_DIR=${1:-"/results/SSD320_FP16_1GPU"}
 PIPELINE_CONFIG_PATH=${2:-"/workdir/models/research/configs"}"/ssd320_full_1gpus.config"
 
-export TF_ENABLE_AUTO_MIXED_PRECISION=1
-
 TENSOR_OPS=0
 export TF_ENABLE_CUBLAS_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
 export TF_ENABLE_CUDNN_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
@@ -26,4 +24,5 @@ time python -u ./object_detection/model_main.py \
        --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
        --model_dir=${CKPT_DIR} \
        --alsologtostder \
+       --amp \
        "${@:3}"

+ 1 - 2
TensorFlow/Detection/SSD/examples/SSD320_FP16_1GPU_BENCHMARK.sh

@@ -16,8 +16,6 @@ CKPT_DIR=${1:-"/results/SSD320_FP16_1GPU"}
 PIPELINE_CONFIG_PATH=${2:-"/workdir/models/research/configs"}"/ssd320_bench.config"
 GPUS=1
 
-export TF_ENABLE_AUTO_MIXED_PRECISION=1
-
 TENSOR_OPS=0
 export TF_ENABLE_CUBLAS_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
 export TF_ENABLE_CUDNN_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
@@ -27,6 +25,7 @@ TRAIN_LOG=$(python -u ./object_detection/model_main.py \
        --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
        --model_dir=${CKPT_DIR} \
        --alsologtostder \
+       --amp \
        "${@:3}" 2>&1)
 PERF=$(echo "$TRAIN_LOG" | sed -n 's|.*global_step/sec: \(\S\+\).*|\1|p' | python -c "import sys; x = sys.stdin.readlines(); x = [float(a) for a in x[int(len(x)*3/4):]]; print(32*$GPUS*sum(x)/len(x), 'img/s')")
 

+ 1 - 2
TensorFlow/Detection/SSD/examples/SSD320_FP16_4GPU.sh

@@ -16,8 +16,6 @@ CKPT_DIR=${1:-"/results/SSD320_FP16_4GPU"}
 PIPELINE_CONFIG_PATH=${2:-"/workdir/models/research/configs"}"/ssd320_full_4gpus.config"
 GPUS=4
 
-export TF_ENABLE_AUTO_MIXED_PRECISION=1
-
 TENSOR_OPS=0
 export TF_ENABLE_CUBLAS_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
 export TF_ENABLE_CUDNN_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
@@ -37,4 +35,5 @@ time mpirun --allow-run-as-root \
                --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
                --model_dir=${CKPT_DIR} \
                --alsologtostder \
+               --amp \
                "${@:3}"

+ 1 - 2
TensorFlow/Detection/SSD/examples/SSD320_FP16_4GPU_BENCHMARK.sh

@@ -16,8 +16,6 @@ CKPT_DIR=${1:-"/results/SSD320_FP16_4GPU"}
 PIPELINE_CONFIG_PATH=${2:-"/workdir/models/research/configs"}"/ssd320_bench.config"
 GPUS=4
 
-export TF_ENABLE_AUTO_MIXED_PRECISION=1
-
 TENSOR_OPS=0
 export TF_ENABLE_CUBLAS_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
 export TF_ENABLE_CUDNN_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
@@ -37,6 +35,7 @@ TRAIN_LOG=$(mpirun --allow-run-as-root \
                --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
                --model_dir=${CKPT_DIR} \
                --alsologtostder \
+               --amp \
                "${@:3}" 2>&1)
 PERF=$(echo "$TRAIN_LOG" | sed -n 's|.*global_step/sec: \(\S\+\).*|\1|p' | python -c "import sys; x = sys.stdin.readlines(); x = [float(a) for a in x[int(len(x)*3/4):]]; print(32*$GPUS*sum(x)/len(x), 'img/s')")
 

+ 1 - 2
TensorFlow/Detection/SSD/examples/SSD320_FP16_8GPU.sh

@@ -16,8 +16,6 @@ CKPT_DIR=${1:-"/results/SSD320_FP16_8GPU"}
 PIPELINE_CONFIG_PATH=${2:-"/workdir/models/research/configs"}"/ssd320_full_8gpus.config"
 GPUS=8
 
-export TF_ENABLE_AUTO_MIXED_PRECISION=1
-
 TENSOR_OPS=0
 export TF_ENABLE_CUBLAS_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
 export TF_ENABLE_CUDNN_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
@@ -39,4 +37,5 @@ time mpirun --allow-run-as-root \
                --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
                --model_dir=${CKPT_DIR} \
                --alsologtostder \
+               --amp \
                "${@:3}" 2>&1 | tee $CKPT_DIR/train_log

+ 1 - 2
TensorFlow/Detection/SSD/examples/SSD320_FP16_8GPU_BENCHMARK.sh

@@ -16,8 +16,6 @@ CKPT_DIR=${1:-"/results/SSD320_FP16_8GPU"}
 PIPELINE_CONFIG_PATH=${2:-"/workdir/models/research/configs"}"/ssd320_bench.config"
 GPUS=8
 
-export TF_ENABLE_AUTO_MIXED_PRECISION=1
-
 TENSOR_OPS=0
 export TF_ENABLE_CUBLAS_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
 export TF_ENABLE_CUDNN_TENSOR_OP_MATH_FP32=${TENSOR_OPS}
@@ -37,6 +35,7 @@ TRAIN_LOG=$(mpirun --allow-run-as-root \
                --pipeline_config_path=${PIPELINE_CONFIG_PATH} \
                --model_dir=${CKPT_DIR} \
                --alsologtostder \
+               --amp \
                "${@:3}" 2>&1)
 PERF=$(echo "$TRAIN_LOG" | sed -n 's|.*global_step/sec: \(\S\+\).*|\1|p' | python -c "import sys; x = sys.stdin.readlines(); x = [float(a) for a in x[int(len(x)*3/4):]]; print(32*$GPUS*sum(x)/len(x), 'img/s')")
 

+ 24 - 4
TensorFlow/Detection/SSD/examples/SSD320_inference.py

@@ -19,17 +19,26 @@ from absl import flags
 from time import time
 
 import tensorflow as tf
+import dllogger
 
 from object_detection import model_hparams
 from object_detection import model_lib
+from object_detection.utils.exp_utils import setup_dllogger
 
+import numpy as np
 
 flags.DEFINE_string('checkpoint_dir', None, 'Path to directory holding a checkpoint.  If '
                     '`checkpoint_dir` is not provided, benchmark is running on random model')
 flags.DEFINE_string('pipeline_config_path', None, 'Path to pipeline config file.')
+flags.DEFINE_string("raport_file", default="summary.json",
+                         help="Path to dlloger json")
 flags.DEFINE_integer('warmup_iters', 100, 'Number of iterations skipped during benchmark')
 flags.DEFINE_integer('benchmark_iters', 300, 'Number of iterations measured by benchmark')
 flags.DEFINE_integer('batch_size', 1, 'Number of inputs processed paralelly')
+flags.DEFINE_list("percentiles", default=['90', '95', '99'],
+                  help="percentiles for latency confidence intervals")
+
+
 FLAGS = flags.FLAGS
 
 flags.mark_flag_as_required('pipeline_config_path')
@@ -58,6 +67,7 @@ def build_benchmark_input_fn(input_fn):
 class TimingHook(tf.train.SessionRunHook):
     def __init__(self):
         super(TimingHook, self).__init__()
+        setup_dllogger(enabled=True, filename=FLAGS.raport_file)
         self.times = []
 
     def before_run(self, *args, **kwargs):
@@ -73,13 +83,23 @@ class TimingHook(tf.train.SessionRunHook):
         self.times.append(time() - self.start_time)
         self.log_progress()
 
-    def collect_result(self):
-        return FLAGS.batch_size * FLAGS.benchmark_iters / sum(self.times[FLAGS.benchmark_iters:])
-
     def end(self, *args, **kwargs):
         super(TimingHook, self).end(*args, **kwargs)
+        throughput = sum([1/x for x in self.times[FLAGS.warmup_iters:]]) * FLAGS.batch_size / FLAGS.benchmark_iters
+        latency_avg = 1000 * sum(self.times[FLAGS.warmup_iters:]) / FLAGS.benchmark_iters
+        latency_data = 1000 * np.array(self.times[FLAGS.warmup_iters:])
+        summary = {
+            'infer_throughput': throughput,
+            'eval_avg_latency': latency_avg
+        }
         print()
-        print('Benchmark result:', self.collect_result(), 'img/s')
+        print('Benchmark result:', throughput, 'img/s')
+        for p in FLAGS.percentiles:
+            p = int(p)
+            tf.logging.info("Latency {}%: {:>4.2f} ms".format(
+                p, np.percentile(latency_data, p)))
+            summary[f'eval_{p}%_latency'] = np.percentile(latency_data, p)
+        dllogger.log(step=tuple(), data=summary)
 
 
 def main(unused_argv):

BIN=BIN
TensorFlow/Detection/SSD/img/training_loss.png


BIN=BIN
TensorFlow/Detection/SSD/img/validation_accuracy.png


+ 3 - 0
TensorFlow/Detection/SSD/models/research/object_detection/builders/dataset_builder.py

@@ -74,6 +74,9 @@ def read_dataset(file_read_func, input_files, config):
   """
   # Shard, shuffle, and read files.
   filenames = tf.gfile.Glob(input_files)
+  if not filenames:
+      raise ValueError('Invalid input path specified in '
+                       '`input_reader_config`.')
   num_readers = config.num_readers
   if num_readers > len(filenames):
     num_readers = len(filenames)

+ 3 - 0
TensorFlow/Detection/SSD/models/research/object_detection/metrics/coco_tools.py

@@ -42,6 +42,7 @@ then evaluation (in multi-class mode) can be invoked as follows:
 from collections import OrderedDict
 import copy
 import time
+import dllogger
 import numpy as np
 
 from pycocotools import coco
@@ -251,6 +252,8 @@ class COCOEvalWrapper(cocoeval.COCOeval):
         ('Recall/AR@100 (medium)', self.stats[10]),
         ('Recall/AR@100 (large)', self.stats[11])
     ])
+    dllogger.log(step=tuple(), data=summary_metrics)
+
     if not include_metrics_per_category:
       return summary_metrics, {}
     if not hasattr(self, 'category_stats'):

+ 3 - 1
TensorFlow/Detection/SSD/models/research/object_detection/model_lib.py

@@ -567,6 +567,7 @@ def create_estimator_and_inputs(run_config,
     'predict_input_fn': A prediction input function.
     'train_steps': Number of training steps. Either directly from input or from
       configuration.
+    'train_batch_size': train batch size per GPU
   """
   get_configs_from_pipeline_file = MODEL_BUILD_UTIL_MAP[
       'get_configs_from_pipeline_file']
@@ -666,7 +667,8 @@ def create_estimator_and_inputs(run_config,
       eval_input_names=eval_input_names,
       eval_on_train_input_fn=eval_on_train_input_fn,
       predict_input_fn=predict_input_fn,
-      train_steps=train_steps)
+      train_steps=train_steps,
+      train_batch_size=train_config.batch_size)
 
 
 def create_train_and_eval_specs(train_input_fn,

+ 41 - 2
TensorFlow/Detection/SSD/models/research/object_detection/model_main.py

@@ -36,15 +36,21 @@ from absl import flags
 
 import tensorflow as tf
 import horovod.tensorflow as hvd
+import dllogger
+import time
+import os
 
 from object_detection import model_hparams
 from object_detection import model_lib
+from object_detection.utils.exp_utils import AverageMeter, setup_dllogger
 
 flags.DEFINE_string(
     'model_dir', None, 'Path to output model directory '
     'where event and checkpoint files will be written.')
 flags.DEFINE_string('pipeline_config_path', None, 'Path to pipeline config '
                     'file.')
+flags.DEFINE_string("raport_file", default="summary.json",
+                         help="Path to dlloger json")
 flags.DEFINE_integer('num_train_steps', None, 'Number of train steps.')
 flags.DEFINE_boolean('eval_training_data', False,
                      'If training data should be evaluated for this job. Note '
@@ -67,15 +73,48 @@ flags.DEFINE_string(
     'writing resulting metrics to `model_dir`.')
 flags.DEFINE_boolean(
     'allow_xla', False, 'Enable XLA compilation')
+flags.DEFINE_boolean(
+    'amp', False, 'Whether to enable AMP ops. When false, uses TF32 on A100 and FP32 on V100 GPUS.')
 flags.DEFINE_boolean(
     'run_once', False, 'If running in eval-only mode, whether to run just '
     'one round of eval vs running continuously (default).'
 )
 FLAGS = flags.FLAGS
 
+class DLLoggerHook(tf.estimator.SessionRunHook):
+  def __init__(self, global_batch_size, rank=-1):
+    self.global_batch_size = global_batch_size
+    self.rank = rank
+    setup_dllogger(enabled=True, filename=FLAGS.raport_file, rank=rank)
+
+  def after_create_session(self, session, coord):
+    self.meters = {}
+    warmup = 100
+    self.meters['train_throughput'] = AverageMeter(warmup=warmup)
+
+  def before_run(self, run_context):
+    self.t0 = time.time()
+    return tf.estimator.SessionRunArgs(fetches=['global_step:0', 'learning_rate:0'])
+
+  def after_run(self, run_context, run_values):
+    throughput = self.global_batch_size/(time.time() - self.t0)
+    global_step, lr = run_values.results
+    self.meters['train_throughput'].update(throughput)
+
+  def end(self, session):
+    summary = {
+      'train_throughput': self.meters['train_throughput'].avg,
+    }
+    dllogger.log(step=tuple(), data=summary)
+
+
 
 def main(unused_argv):
   tf.logging.set_verbosity(tf.logging.INFO)
+  if FLAGS.amp:
+      os.environ["TF_ENABLE_AUTO_MIXED_PRECISION"] = "1"
+  else:
+      os.environ["TF_ENABLE_AUTO_MIXED_PRECISION"] = "0"
 
   hvd.init()
 
@@ -130,9 +169,9 @@ def main(unused_argv):
         train_steps,
         eval_on_train_data=False)
 
-    train_hooks = [hvd.BroadcastGlobalVariablesHook(0)]
+    train_hooks = [hvd.BroadcastGlobalVariablesHook(0), DLLoggerHook(hvd.size()*train_and_eval_dict['train_batch_size'], hvd.rank())]
     eval_hooks = []
-    
+
     for x in range(FLAGS.eval_count):
         estimator.train(train_input_fn,
                         hooks=train_hooks,

+ 56 - 0
TensorFlow/Detection/SSD/models/research/object_detection/utils/exp_utils.py

@@ -0,0 +1,56 @@
+# Copyright (c) 2020 NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#       http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import dllogger
+import os
+
+class AverageMeter:
+    """
+    Computes and stores the average and current value
+    """
+    def __init__(self, warmup=0, keep=False):
+        self.reset()
+        self.warmup = warmup
+        self.keep = keep
+
+    def reset(self):
+        self.val = 0
+        self.avg = 0
+        self.sum = 0
+        self.count = 0
+        self.iters = 0
+        self.vals = []
+
+    def update(self, val, n=1):
+        self.iters += 1
+        self.val = val
+
+        if self.iters > self.warmup:
+            self.sum += val * n
+            self.count += n
+            self.avg = self.sum / self.count
+            if self.keep:
+                self.vals.append(val)
+
+def setup_dllogger(enabled=True, filename=os.devnull, rank=0):
+    if enabled and rank == 0:
+        backends = [
+            dllogger.JSONStreamBackend(
+                dllogger.Verbosity.VERBOSE,
+                filename,
+                ),
+            ]
+        dllogger.init(backends)
+    else:
+        dllogger.init([])