Explorar el Código

[ResNet50/PaddlePaddle] Initial Release

Ming Huang hace 3 años
padre
commit
76da8548da
Se han modificado 33 ficheros con 3749 adiciones y 0 borrados
  1. 8 0
      PaddlePaddle/Classification/RN50v1.5/Dockerfile
  2. 950 0
      PaddlePaddle/Classification/RN50v1.5/README.md
  3. 238 0
      PaddlePaddle/Classification/RN50v1.5/dali.py
  4. 75 0
      PaddlePaddle/Classification/RN50v1.5/export_model.py
  5. BIN
      PaddlePaddle/Classification/RN50v1.5/img/loss.png
  6. BIN
      PaddlePaddle/Classification/RN50v1.5/img/top1.png
  7. BIN
      PaddlePaddle/Classification/RN50v1.5/img/top5.png
  8. 213 0
      PaddlePaddle/Classification/RN50v1.5/inference.py
  9. 77 0
      PaddlePaddle/Classification/RN50v1.5/lr_scheduler.py
  10. 15 0
      PaddlePaddle/Classification/RN50v1.5/models/__init__.py
  11. 222 0
      PaddlePaddle/Classification/RN50v1.5/models/resnet.py
  12. 64 0
      PaddlePaddle/Classification/RN50v1.5/optimizer.py
  13. 69 0
      PaddlePaddle/Classification/RN50v1.5/profile.py
  14. 449 0
      PaddlePaddle/Classification/RN50v1.5/program.py
  15. 1 0
      PaddlePaddle/Classification/RN50v1.5/requirements.txt
  16. 21 0
      PaddlePaddle/Classification/RN50v1.5/scripts/inference/export_resnet50_AMP.sh
  17. 19 0
      PaddlePaddle/Classification/RN50v1.5/scripts/inference/export_resnet50_TF32.sh
  18. 21 0
      PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_AMP.sh
  19. 21 0
      PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_TF32.sh
  20. 47 0
      PaddlePaddle/Classification/RN50v1.5/scripts/nsys_profiling.sh
  21. 20 0
      PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_AMP_90E_DGXA100.sh
  22. 26 0
      PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_AMP_ASP_90E_DGXA100.sh
  23. 15 0
      PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_TF32_90E_DGXA100.sh
  24. 167 0
      PaddlePaddle/Classification/RN50v1.5/train.py
  25. 13 0
      PaddlePaddle/Classification/RN50v1.5/utils/__init__.py
  26. 214 0
      PaddlePaddle/Classification/RN50v1.5/utils/affinity.py
  27. 424 0
      PaddlePaddle/Classification/RN50v1.5/utils/config.py
  28. 39 0
      PaddlePaddle/Classification/RN50v1.5/utils/cuda_bind.py
  29. 59 0
      PaddlePaddle/Classification/RN50v1.5/utils/logger.py
  30. 47 0
      PaddlePaddle/Classification/RN50v1.5/utils/misc.py
  31. 26 0
      PaddlePaddle/Classification/RN50v1.5/utils/mode.py
  32. 164 0
      PaddlePaddle/Classification/RN50v1.5/utils/save_load.py
  33. 25 0
      PaddlePaddle/Classification/RN50v1.5/utils/utility.py

+ 8 - 0
PaddlePaddle/Classification/RN50v1.5/Dockerfile

@@ -0,0 +1,8 @@
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/paddlepaddle:22.05-py3
+FROM ${FROM_IMAGE_NAME}
+
+ADD requirements.txt /workspace/
+WORKDIR /workspace/
+RUN pip install --no-cache-dir -r requirements.txt
+ADD . /workspace/rn50
+WORKDIR /workspace/rn50

+ 950 - 0
PaddlePaddle/Classification/RN50v1.5/README.md

@@ -0,0 +1,950 @@
+# ResNet50 v1.5 For PaddlePaddle
+
+This repository provides a script and recipe to train the ResNet50 model to
+achieve state-of-the-art accuracy. The content of this repository is tested and maintained by NVIDIA.
+
+## Table Of Contents
+
+* [Model overview](#model-overview)
+  * [Default configuration](#default-configuration)
+    * [Optimizer](#optimizer)
+    * [Data augmentation](#data-augmentation)
+  * [DALI](#dali)
+  * [Feature support matrix](#feature-support-matrix)
+    * [Features](#features)
+  * [Mixed precision training](#mixed-precision-training)
+    * [Enabling mixed precision](#enabling-mixed-precision)
+    * [Enabling TF32](#enabling-tf32)
+  * [Automatic SParsity](#automatic-sparsity)
+    * [Enable Automatic SParsity](#enable-automatic-sparsity)
+* [Setup](#setup)
+  * [Requirements](#requirements)
+* [Quick Start Guide](#quick-start-guide)
+* [Advanced](#advanced)
+  * [Scripts and sample code](#scripts-and-sample-code)
+  * [Command-line options](#command-line-options)
+  * [Dataset guidelines](#dataset-guidelines)
+  * [Training process](#training-process)
+  * [Automatic SParsity training process](#automatic-sparsity-training-process)
+  * [Inference process](#inference-process)
+* [Performance](#performance)
+  * [Benchmarking](#benchmarking)
+    * [Training performance benchmark](#training-performance-benchmark)
+    * [Inference performance benchmark](#inference-performance-benchmark)
+  * [Results](#results)
+    * [Training accuracy results](#training-accuracy-results)
+      * [Training accuracy: NVIDIA DGX A100 (8x A100 80GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-80gb)
+      * [Example plots](#example-plots)
+      * [Accuracy recovering of Automatic SParsity: NVIDIA DGX A100 (8x A100 80GB)](#accuracy-recovering-of-automatic-sparsity-nvidia-dgx-a100-8x-a100-80gb)
+    * [Training performance results](#training-performance-results)
+      * [Training performance: NVIDIA DGX A100 (8x A100 80GB)](#training-performance-nvidia-dgx-a100-8x-a100-80gb)
+      * [Training performance of Automatic SParsity: NVIDIA DGX A100 (8x A100 80GB)](#training-performance-of-automatic-sparsity-nvidia-dgx-a100-8x-a100-80gb)
+    * [Inference performance results](#inference-performance-results)
+      * [Inference performance: NVIDIA DGX A100 (1x A100 80GB)](#inference-performance-nvidia-dgx-a100-1x-a100-80gb)
+    * [Paddle-TRT performance results](#paddle-trt-performance-results)
+      * [Paddle-TRT performance: NVIDIA DGX A100 (1x A100 80GB)](#paddle-trt-performance-nvidia-dgx-a100-1x-a100-80gb)
+      * [Paddle-TRT performance: NVIDIA A30 (1x A30 24GB)](#paddle-trt-performance-nvidia-a30-1x-a30-24gb)
+      * [Paddle-TRT performance: NVIDIA A10 (1x A10 24GB)](#paddle-trt-performance-nvidia-a10-1x-a10-24gb)
+* [Release notes](#release-notes)
+  * [Changelog](#changelog)
+  * [Known issues](#known-issues)
+
+## Model overview
+The ResNet50 v1.5 model is a modified version of the [original ResNet50 v1 model](https://arxiv.org/abs/1512.03385).
+
+The difference between v1 and v1.5 is that in the bottleneck blocks which requires
+downsampling, v1 has stride = 2 in the first 1x1 convolution, whereas v1.5 has stride = 2 in the 3x3 convolution.
+
+This difference makes ResNet50 v1.5 slightly more accurate (~0.5% top1) than v1 but comes with a small performance drawback (~5% imgs/sec).
+
+The model is initialized as described in [Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification](https://arxiv.org/pdf/1502.01852.pdf)
+
+This model is trained with mixed precision using Tensor Cores on the NVIDIA Ampere GPU architectures. Therefore, researchers can get results over 2x faster than training without Tensor Cores while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
+
+
+### Default configuration
+
+The following sections highlight the default configurations for the ResNet50 model.
+
+#### Optimizer
+
+This model uses SGD with momentum optimizer with the following hyperparameters:
+
+* Momentum (0.875)
+* Learning rate (LR) = 0.256 for 256 global batch size, for other batch sizes we linearly
+scale the learning rate. For example, default LR = 2.048 for 2048 global batch size on 8xA100. (256 batch size per GPU)
+* Learning rate schedule - we use cosine LR schedule
+* Linear warmup of the learning rate during the first 5 epochs according to [Training ImageNet in 1 hour](https://arxiv.org/abs/1706.02677).
+* Weight decay (WD)= 3.0517578125e-05 (1/32768).
+* We do not apply WD on Batch Norm trainable parameters (gamma/bias)
+* Label smoothing = 0.1
+* We train for: 
+    * 50 Epochs -> configuration that reaches 75.9% top1 accuracy 
+    * 90 Epochs -> configuration that reaches 76.9% top1 accuracy (90 epochs is a standard for ImageNet networks)
+
+
+#### Data augmentation
+
+This model uses the following data augmentation:
+
+* For training:
+  * Normalization
+  * Random resized crop to 224x224
+    * Scale from 8% to 100%
+    * Aspect ratio from 3/4 to 4/3
+  * Random horizontal flip
+* For inference:
+  * Normalization
+  * Scale to 256x256
+  * Center crop to 224x224
+
+#### Other training recipes
+
+This script does not target any specific benchmark.
+There are changes that others have made which can speed up convergence and/or increase accuracy.
+
+One of the more popular training recipes is provided by [fast.ai](https://github.com/fastai/imagenet-fast).
+
+The fast.ai recipe introduces many changes to the training procedure, one of which is progressive resizing of the training images.
+
+The first part of training uses 128px images, the middle part uses 224px images, and the last part uses 288px images.
+The final validation is performed on 288px images.
+
+The training script in this repository performs validation on 224px images, just like the original paper described.
+
+These two approaches can't be directly compared, since the fast.ai recipe requires validation on 288px images,
+and this recipe keeps the original assumption that validation is done on 224px images.
+
+Using 288px images means that more FLOPs are needed during inference to reach the same accuracy.
+
+
+
+### Feature support matrix
+
+This model supports the following features:
+
+| Feature               | ResNet50
+|-----------------------|--------------------------
+|[DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html)   | Yes |
+|[Paddle AMP](https://www.paddlepaddle.org.cn/documentation/docs/en/guides/01_paddle2.0_introduction/basic_concept/amp_en.html) | Yes |
+|[Paddle ASP](https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/static/sparsity/decorate_en.html) | Yes |
+|[Paddle-TRT](https://github.com/PaddlePaddle/Paddle-Inference-Demo/blob/master/docs/optimize/paddle_trt_en.rst) | Yes |
+
+#### Features
+
+- NVIDIA DALI - DALI is a library accelerating the data preparation pipeline. To accelerate your input pipeline, you only need to define your data loader
+with the DALI library. For more information about DALI, refer to the [DALI product documentation](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html).
+
+- Paddle AMP is a PaddlePaddle built-in module that provides functions to construct AMP workflow. The details can be found in [Automatic Mixed Precision (AMP)](https://www.paddlepaddle.org.cn/documentation/docs/en/guides/01_paddle2.0_introduction/basic_concept/amp_en.html#automatic-mixed-precision-training), which requires minimal network code changes to leverage Tensor Cores performance. Refer to the [Enabling mixed precision](#enabling-mixed-precision) section for more details.
+
+- Paddle ASP is a PaddlePaddle built-in module that provides functions to enable automatic sparity workflow with only a few code line insertions. The full APIs can be found in [Paddle.static.sparsity](https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/static/sparsity/calculate_density_en.html). Paddle ASP support, currently, static graph mode only (Dynamic graph support is under development). Refer to the [Enable Automatic SParsity](#enable-automatic-sparsity) section for more details.
+
+- Paddle-TRT is a PaddlePaddle inference integration with [TensorRT](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html). It selects subgraph to be accelerated by TensorRT, while leaving the rest of the operations to be executed natively by PaddlePaddle. Refer to the [Inference with TensorRT](#inference-with-tensorrt) section for more details.
+
+### DALI
+
+We use [NVIDIA DALI](https://github.com/NVIDIA/DALI),
+which speeds up data loading when the CPU becomes a bottleneck.
+DALI can use CPU or GPU and outperforms the PaddlePaddle native data loader.
+
+For data loader, we only support DALI as data loader for now.
+
+
+### Mixed precision training
+
+Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in NVIDIA Volta, and following with both the NVIDIA Turing and NVIDIA Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
+1.  Porting the model to use the FP16 data type where appropriate.
+2.  Adding loss scaling to preserve small gradient values.
+
+The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA 8 in the NVIDIA Deep Learning SDK.
+
+For information about:
+-   How to train using mixed precision in PaddlePaddle, refer to the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Automatic Mixed Precision Training](https://www.paddlepaddle.org.cn/documentation/docs/en/guides/01_paddle2.0_introduction/basic_concept/amp_en.html#automatic-mixed-precision-training) documentation.
+-   Techniques used for mixed precision training, refer to the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
+
+
+#### Enabling mixed precision
+
+Mixed precision is enabled in Paddle by using the Automatic Mixed Precision (AMP) 
+while storing variables in single-precision format. Furthermore, to preserve small gradient magnitudes in backpropagation, a [loss scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling) step must be included when applying gradients.
+In PaddlePaddle, loss scaling can be easily applied by passing in arguments to [GradScaler()](https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/amp/GradScaler_en.html). The scaling value to be used can be dynamic or fixed.
+
+For an in-depth walk through on AMP, check out sample usage [here](https://www.paddlepaddle.org.cn/documentation/docs/en/guides/01_paddle2.0_introduction/basic_concept/amp_en.html). Paddle AMP is a PaddlePaddle built-in module that provides functions to construct AMP workflow. The details can be found in [Automatic Mixed Precision (AMP)](https://www.paddlepaddle.org.cn/documentation/docs/en/guides/01_paddle2.0_introduction/basic_concept/amp_en.html#automatic-mixed-precision-training), which requires minimal network code changes to leverage Tensor Cores performance.    
+
+
+Code example to enable mixed precision for static graph:
+- Use `paddle.static.amp.decorate` to wrap optimizer
+  ```python
+  import paddle.static.amp as amp
+  mp_optimizer = amp.decorate(optimizer=optimizer, init_loss_scaling=8.0)
+  ```
+- Minimize `loss` , and get `scaled_loss`, which is useful when you need customized loss.
+  ```python
+  ops, param_grads = mp_optimizer.minimize(loss)
+  scaled_loss = mp_optimizer.get_scaled_loss()
+  ```
+- For distributed training, it is recommended to use Fleet to enable amp, which is a unified API for distributed training of PaddlePaddle. For more information, refer to [Fleet](https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/distributed/fleet/Fleet_en.html#fleet)
+
+  ```python
+  import paddle.distributed.fleet as fleet
+  strategy = fleet.DistributedStrategy()
+  strategy.amp = True # by default this is false
+  optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy)
+  ```
+
+#### Enabling TF32
+
+TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math, also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. 
+
+TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require a high dynamic range for weights or activations.
+
+For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post.
+
+TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
+
+### Automatic SParsity
+
+Automatic SParsity (ASP) provides a workflow to transfer deep learning models from dense to 2:4 structured sparse, that allows that inference leverage NVIDIA's Sparse Tensor Core, introduced in Ampere architecture, to theoretically reach 2x speedup and save almost 50% memory usage. The workflow of ASP generally includes two steps:
+- Prune well-trained dense models to 2:4 sparse.
+- Retrain sparse model with same hyper-parameters to recover accuracy.
+
+For more information, refer to
+- [GTC 2020: Accelerating Sparsity in the NVIDIA Ampere Architecture.](https://developer.nvidia.com/gtc/2020/video/s22085#)
+- Mishra, Asit, et al. "Accelerating Sparse Deep Neural Networks." arXiv preprint arXiv:2104.08378 (2021).
+- Pool, Jeff, and Chong Yu. "Channel Permutations for N: M Sparsity." Advances in Neural Information Processing Systems 34 (2021).
+
+#### Enable Automatic SParsity
+There is a built-in module in PaddlePaddle to enable ASP training, which only needs to insert a couple of lines in the original codebase [optimizer decoration](https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/static/sparsity/decorate_en.html) and [model pruning](https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/static/sparsity/prune_model_en.html). 
+```python
+optimizer = sparsity.decorate(optimizer)
+...
+sparsity.prune_model(main_program)
+```
+Moreover, ASP is also compatible with mixed precision training.
+
+Note that currently ASP only supports static graphs (Dynamic graph support is under development).
+
+
+## Setup
+
+The following section lists the requirements you need to meet to start training the ResNet50 model.
+
+### Requirements
+This repository contains a Dockerfile that extends the CUDA NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
+
+* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
+* [PaddlePaddle 22.05-py3 NGC container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/paddlepaddle) or newer
+* Supported GPUs:
+    * [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
+
+For more information about how to get started with NGC containers, refer to the
+following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
+DGX Documentation:
+* [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
+* [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
+
+
+## Quick Start Guide
+
+### 1. Clone the repository.
+```bash
+git clone https://github.com/NVIDIA/DeepLearningExamples.git
+cd DeepLearningExamples/PaddlePaddle/Classification/RN50v1.5
+```
+
+### 2. Download and preprocess the dataset.
+
+The ResNet50 script operates on ImageNet 1k, a widely popular image classification dataset from the ILSVRC challenge.
+
+Paddle can work directly on JPEGs; therefore, preprocessing/augmentation is not needed.
+
+To train your model using mixed or TF32 precision with Tensor Cores or using FP32,
+perform the following steps using the default parameters of the resnet50 model on the ImageNet dataset.
+For the specifics concerning training and inference, refer to the [Advanced](#advanced) section.
+
+
+1. [Download the images](http://image-net.org/download-images).
+
+2. Extract the training data:
+  ```bash
+  cd <path to imagenet>
+  mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train
+  tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar
+  find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
+  cd ..
+  ```
+
+3. Extract the validation data and move the images to subfolders:
+  ```bash
+  mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val && tar -xvf ILSVRC2012_img_val.tar
+  wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash
+  ```
+
+The directory in which the `train/` and `val/` directories are placed is referred to as `<path to imagenet>` in this document.
+
+### 3. Build the ResNet50 PaddlePaddle NGC container.
+```bash
+docker build . -t nvidia_resnet50
+```
+
+### 4. Start an interactive session in the NGC container to run training/inference.
+```bash
+nvidia-docker run --rm -it -v <path to imagenet>:/imagenet --ipc=host nvidia_resnet50
+```
+
+### 5. Start training
+
+To run training for a standard configuration (DGXA100, AMP/TF32),
+use one of scripts in `scripts/training` to launch training. (Please ensure ImageNet is mounted in the `/imagenet` directory.)
+
+Example:
+```bash
+# For TF32 and 8 GPUs training in 90 epochs
+bash scripts/training/train_resnet50_TF32_90E_DGXA100.sh
+
+# For AMP and 8 GPUs training in 90 epochs
+bash scripts/training/train_resnet50_TF32_90E_DGXA100.sh
+```
+
+Or you can manually launch training by `paddle.distributed.launch`. `paddle.distributed.launch` is a built-in module in PaddlePaddle that spawns up multiple distributed training processes on each of the training nodes.
+
+Example:
+```bash
+# For single GPU training with AMP
+python -m paddle.distributed.launch --gpus=0 train.py \
+  --epochs 90 \
+  --amp \
+  --scale-loss 128.0 \
+  --use-dynamic-loss-scaling \
+  --data-layout NHWC
+
+# For 8 GPUs training with AMP
+python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
+  --epochs 90 \
+  --amp \
+  --scale-loss 128.0 \
+  --use-dynamic-loss-scaling \
+  --data-layout NHWC
+```
+
+Note that for initializing training with checkpoints or pretrained parameters, refer to [Training process](#training-process) for more details.
+
+### 6. Start validation/evaluation.
+To evaluate the validation dataset located in `/imagenet/val`, you need to specify the pretrained parameters by `--from-pretrained-params` and set `eval_only` to `--run-scope`.
+
+Example:
+* TF32
+```bash
+# For single GPU evaluation
+python -m paddle.distributed.launch --gpus=0 train.py \
+  --from-pretrained-params <path_to_pretrained_params> \
+  --run-scope eval_only
+
+# For 8 GPUs evaluation
+python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
+  --from-pretrained-params <path_to_pretrained_params> \
+  --run-scope eval_only
+```
+
+* AMP
+```bash
+# For single GPU evaluation
+python -m paddle.distributed.launch --gpus=0 train.py \
+  --from-pretrained-params <path_to_pretrained_params> \
+  --run-scope eval_only \
+  --amp \
+  --data-layout NHWC
+
+# For 8 GPUs evaluation
+python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
+  --from-pretrained-params <path_to_pretrained_params> \
+  --run-scope eval_only \
+  --amp \
+  --data-layout NHWC
+```
+
+We also provide scripts to inference with TensorRT that could reach better performance. Refer to [Inference process](#inference-process) in [Advanced](#advanced) for more details.
+
+## Advanced
+
+The following sections provide greater details of the dataset, running training and inference, and the training results.
+
+### Scripts and sample code
+
+To run a non standard configuration use:
+
+```bash
+# For single GPU evaluation
+python -m paddle.distributed.launch --gpus=0 train.py
+
+# For 8 GPUs evaluation
+python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py
+```
+
+### Command-line options:
+To find the full list of available options and their descriptions, use the `-h` or `--help` command-line option, for example:
+`python [train.py|export_model.py|inference.py] -h`
+
+```bash
+PaddlePaddle RN50v1.5 training script
+
+optional arguments:
+  -h, --help            show this help message and exit
+
+Global:
+  --output-dir OUTPUT_DIR
+                        A path to store trained models. (default: ./output/)
+  --run-scope {train_eval,train_only,eval_only}
+                        Running scope. It should be one of {train_eval, train_only, eval_only}. (default: train_eval)
+  --epochs EPOCHS       The number of epochs for training. (default: 90)
+  --save-interval SAVE_INTERVAL
+                        The iteration interval to save checkpoints. (default: 1)
+  --eval-interval EVAL_INTERVAL
+                        The iteration interval to test trained models on a given validation dataset. Ignored when --run-scope is train_only.
+                        (default: 1)
+  --print-interval PRINT_INTERVAL
+                        The iteration interval to show training/evaluation message. (default: 10)
+  --report-file REPORT_FILE
+                        A file in which to store JSON experiment report. (default: ./report.json)
+  --data-layout {NCHW,NHWC}
+                        Data format. It should be one of {NCHW, NHWC}. (default: NCHW)
+  --benchmark           To enable benchmark mode. (default: False)
+  --benchmark-steps BENCHMARK_STEPS
+                        Steps for benchmark run, only be applied when --benchmark is set. (default: 100)
+  --benchmark-warmup-steps BENCHMARK_WARMUP_STEPS
+                        Warmup steps for benchmark run, only be applied when --benchmark is set. (default: 100)
+  --from-pretrained-params FROM_PRETRAINED_PARAMS
+                        A pretrained parameters. It should be a file name without suffix .pdparams, and not be set with --from-checkpoint at
+                        the same time. (default: None)
+  --from-checkpoint FROM_CHECKPOINT
+                        A checkpoint path to resume training. It should not be set with --from-pretrained-params at the same time. (default:
+                        None)
+  --last-epoch-of-checkpoint LAST_EPOCH_OF_CHECKPOINT
+                        The epoch id of the checkpoint given by --from-checkpoint. Default is -1 means training starts from 0-th epoth.
+                        (default: -1)
+  --show-config SHOW_CONFIG
+                        To show arguments. (default: True)
+  --enable-cpu-affinity ENABLE_CPU_AFFINITY
+                        To enable in-built GPU-CPU affinity. (default: True)
+
+Dataset:
+  --image-root IMAGE_ROOT
+                        A root folder of train/val images. It should contain train and val folders, which store corresponding images.
+                        (default: /imagenet)
+  --image-shape IMAGE_SHAPE
+                        The image shape. Its shape should be [channel, height, width]. (default: [4, 224, 224])
+  --batch-size BATCH_SIZE
+                        The batch size for both training and evaluation. (default: 256)
+  --dali-random-seed DALI_RANDOM_SEED
+                        The random seed for DALI data loader. (default: 42)
+  --dali-num-threads DALI_NUM_THREADS
+                        The number of threads applied to DALI data loader. (default: 4)
+  --dali-output-fp16    Output FP16 data from DALI data loader. (default: False)
+
+Data Augmentation:
+  --crop-size CROP_SIZE
+                        The size to crop input images. (default: 224)
+  --rand-crop-scale RAND_CROP_SCALE
+                        Range from which to choose a random area fraction. (default: [0.08, 1.0])
+  --rand-crop-ratio RAND_CROP_RATIO
+                        Range from which to choose a random aspect ratio (width/height). (default: [0.75, 1.3333333333333333])
+  --normalize-scale NORMALIZE_SCALE
+                        A scalar to normalize images. (default: 0.00392156862745098)
+  --normalize-mean NORMALIZE_MEAN
+                        The mean values to normalize RGB images. (default: [0.485, 0.456, 0.406])
+  --normalize-std NORMALIZE_STD
+                        The std values to normalize RGB images. (default: [0.229, 0.224, 0.225])
+  --resize-short RESIZE_SHORT
+                        The length of the shorter dimension of the resized image. (default: 256)
+
+Model:
+  --model-arch-name MODEL_ARCH_NAME
+                        The model architecture name. It should be one of {ResNet50}. (default: ResNet50)
+  --num-of-class NUM_OF_CLASS
+                        The number classes of images. (default: 1000)
+  --bn-weight-decay     Apply weight decay to BatchNorm shift and scale. (default: False)
+
+Training:
+  --label-smoothing LABEL_SMOOTHING
+                        The ratio of label smoothing. (default: 0.1)
+  --optimizer OPTIMIZER
+                        The name of optimizer. It should be one of {Momentum}. (default: Momentum)
+  --momentum MOMENTUM   The momentum value of optimizer. (default: 0.875)
+  --weight-decay WEIGHT_DECAY
+                        The coefficient of weight decay. (default: 3.0517578125e-05)
+  --lr-scheduler LR_SCHEDULER
+                        The name of learning rate scheduler. It should be one of {Cosine}. (default: Cosine)
+  --lr LR               The initial learning rate. (default: 0.256)
+  --warmup-epochs WARMUP_EPOCHS
+                        The number of epochs for learning rate warmup. (default: 5)
+  --warmup-start-lr WARMUP_START_LR
+                        The initial learning rate for warmup. (default: 0.0)
+
+Advanced Training:
+  --amp                 Enable automatic mixed precision training (AMP). (default: False)
+  --scale-loss SCALE_LOSS
+                        The loss scalar for AMP training, only be applied when --amp is set. (default: 1.0)
+  --use-dynamic-loss-scaling
+                        Enable dynamic loss scaling in AMP training, only be applied when --amp is set. (default: False)
+  --use-pure-fp16       Enable pure FP16 training, only be applied when --amp is set. (default: False)
+  --asp                 Enable automatic sparse training (ASP). (default: False)
+  --prune-model         Prune model to 2:4 sparse pattern, only be applied when --asp is set. (default: False)
+  --mask-algo {mask_1d,mask_2d_greedy,mask_2d_best}
+                        The algorithm to generate sparse masks. It should be one of {mask_1d, mask_2d_greedy, mask_2d_best}. This only be
+                        applied when --asp and --prune-model is set. (default: mask_1d)
+
+Paddle-TRT:
+  --trt-inference-dir TRT_INFERENCE_DIR
+                        A path to store/load inference models. export_model.py would export models to this folder, then inference.py would
+                        load from here. (default: ./inference)
+  --trt-precision {FP32,FP16,INT8}
+                        The precision of TensorRT. It should be one of {FP32, FP16, INT8}. (default: FP32)
+  --trt-workspace-size TRT_WORKSPACE_SIZE
+                        The memory workspace of TensorRT in MB. (default: 1073741824)
+  --trt-min-subgraph-size TRT_MIN_SUBGRAPH_SIZE
+                        The minimal subgraph size to enable PaddleTRT. (default: 3)
+  --trt-use-static TRT_USE_STATIC
+                        Fix TensorRT engine at first running. (default: False)
+  --trt-use-calib-mode TRT_USE_CALIB_MODE
+                        Use the PTQ calibration of PaddleTRT int8. (default: False)
+  --trt-export-log-path TRT_EXPORT_LOG_PATH
+                        A file in which to store JSON model exporting report. (default: ./export.json)
+  --trt-log-path TRT_LOG_PATH
+                        A file in which to store JSON inference report. (default: ./inference.json)
+  --trt-use-synthat TRT_USE_SYNTHAT
+                        Apply synthetic data for benchmark. (default: False)
+```
+
+### Dataset guidelines
+
+To use your own dataset, divide it in directories as in the following scheme:
+
+ - Training images - `train/<class id>/<image>`
+ - Validation images - `val/<class id>/<image>`
+
+If the number of classes in your dataset is not 1000, you need to specify it to `--num-of-class`.
+
+### Training process
+The model will be stored in the directory specified with `--output-dir`, including three files:
+- `.pdparams`: The parameters contain all the trainable tensors and will save to a file with the suffix “.pdparams”. 
+- `.pdopts`: The optimizer information contains all the Tensors used by the optimizer. For Adam optimizer, it contains beta1, beta2, momentum, and so on. All the information will be saved to a file with suffix “.pdopt”. (If the optimizer has no Tensor need to save (like SGD), the file will not be generated).
+- `.pdmodel`: The network description is the description of the program. It’s only used for deployment. The description will save to a file with the suffix “.pdmodel”.
+
+The default prefix of model files is `paddle_example`. Model of each epoch would be stored in directory `./output/ResNet/epoch_id/` with three files by default, including `paddle_example.pdparams`, `paddle_example.pdopts`, `paddle_example.pdmodel`. Note that `epoch_id` is 0-based, which means `epoch_id` is from 0 to 89 for a total of 90 epochs. For example, the model of the 89th epoch would be stored in `./output/ResNet/89/paddle_example` 
+
+Assume you want to train the ResNet for 90 epochs, but the training process aborts during the 50th epoch due to infrastructure faults. To resume training from the checkpoint, specify `--from-checkpoint` and `--last-epoch-of-checkpoint` with following these steps:  
+- Set `./output/ResNet/49/paddle_example` to `--from-checkpoint`.
+- Set `--last-epoch-of-checkpoint` to `49`.
+Then rerun the training to resume training from the 50th epoch to the 89th epoch.
+
+Example:
+```bash
+# Resume AMP training from checkpoint of 50-th epoch
+python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
+  --epochs 90 \
+  --amp \
+  --scale-loss 128.0 \
+  --use-dynamic-loss-scaling \
+  --data-layout NHWC \
+  --from-checkpoint ./output/ResNet/49/paddle_example
+  --last-epoch-of-checkpoint 49
+```
+
+To start training from pretrained weights, set `--from-pretrained-params` to `./output/ResNet/<epoch_id>/paddle_example`.
+
+Example:
+```bash
+# Train AMP with model initialization by <./your_own_path_to/paddle_example>
+python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
+  --epochs 90 \
+  --amp \
+  --scale-loss 128.0 \
+  --use-dynamic-loss-scaling \
+  --data-layout NHWC \
+  --from-pretrained-params ./your_own_path_to/paddle_example
+```
+
+Make sure:
+- Resume from checkpoints: Both `paddle_example.pdopts` and `paddle_example.pdparams` must be in the given path.
+- Start from pretrained weights: `paddle_example.pdparams` must be in the given path.
+- The prefix `paddle_example` must be added to the end of the given path. For example: set path as `./output/ResNet/89/paddle_example` instead of `./output/ResNet/89/`
+- Don't set `--from-checkpoint` and `--from-pretrained-params` at the same time.
+
+The difference between those two is that `--from-pretrained-params` contain only model weights, and `--from-checkpoint`, apart from model weights, contain the optimizer state, and LR scheduler state.
+
+`--from-checkpoint` is suitable for dividing the training into parts, for example, in order to divide the training job into shorter stages, or restart training after infrastructure faults.
+
+`--from-pretrained-params` can be used as a base for finetuning the model to a different dataset or as a backbone to detection models.
+
+Metrics gathered through both training and evaluation:
+ - `[train|val].loss` - loss
+ - `[train|val].top1` - top 1 accuracy
+ - `[train|val].top5` - top 5 accuracy
+ - `[train|val].data_time` - time spent on waiting on data
+ - `[train|val].compute_time` - time spent on computing
+ - `[train|val].batch_time` - time spent on a mini-batch
+ - `[train|val].ips` - speed measured in images per second
+
+ Metrics gathered through training only
+ - `train.lr` - learning rate
+
+
+### Automatic SParsity training process:
+To enable automatic sparsity training workflow, turn on `--amp` and `--prune-mode` when training launches. Refer to [Command-line options](#command-line-options)
+
+Note that automatic sparsity (ASP) requires a pretrained model to initialize parameters.
+
+You can apply `scripts/training/train_resnet50_AMP_ASP_90E_DGXA100.sh` we provided to launch ASP + AMP training.
+```bash
+# Default path to pretrained parameters is ./output/ResNet50/89/paddle_example
+bash scripts/training/train_resnet50_AMP_ASP_90E_DGXA100.sh <pretrained_parameters>
+```
+
+Or following steps below to manually launch ASP + AMP training.
+
+First, set `--from-pretrained-params` to a pretrained model file. For example, if you have trained the ResNet for 90 epochs following [Training process](#training-process), the final pretrained weights would be stored in `./output/ResNet50/89/paddle_example.pdparams` by default, and set `--from-pretrained-params` to `./output/ResNet/89/paddle_example`.
+
+Then run following command to run AMP + ASP:
+```bash
+python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
+  --from-pretrained-params ./output/ResNet50/89/paddle_example \
+  --epochs 90 \
+  --amp \
+  --scale-loss 128.0 \
+  --use-dynamic-loss-scaling \
+  --data-layout NHWC \
+  --asp \
+  --prune-model \
+  --mask-algo mask_1d
+```
+
+### Inference process
+
+#### Inference on your own datasets.
+To run inference on a single example with pretrained parameters,
+
+1. Set `--from-pretrained-params` to your pretrained parameters.
+2. Set `--image-root` to the root folder of your own dataset. 
+  - Note that validation dataset should be in `image-root/val`.
+3. Set `--run-scope` to `eval_only`.
+```bash
+# For single GPU evaluation
+python -m paddle.distributed.launch --gpus=0 train.py \
+  --from-pretrained-params <path_to_pretrained_params> \
+  --image-root <your_own_dataset> \
+  --run-scope eval_only
+
+# For 8 GPUs evaluation
+python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
+  --from-pretrained-params <path_to_pretrained_params> \
+  --image-root <your_own_dataset> \
+  --run-scope eval_only
+```
+
+#### Inference with TensorRT
+To run inference with TensorRT for the best performance, you can apply the scripts in `scripts/inference`.
+
+For example,
+1. Run `bash scripts/inference/export_resnet50_AMP.sh <your_checkpoint>` to export an inference model.
+  - The default path of checkpoint is `./output/ResNet/89/paddle_example`.
+2. Run `bash scripts/inference/infer_resnet50_AMP.sh` to infer with TensorRT.
+
+Or you could manually run `export_model.py` and `inference.py` with specific arguments, refer to [Command-line options](#command-line-options).
+
+Note that arguments passed to `export_model.py` and `inference.py` should be the same with arguments used in training.
+
+## Performance
+
+The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA’s latest software release. For the most up-to-date performance measurements, go to [NVIDIA Data Center Deep Learning Product Performance](https://developer.nvidia.com/deep-learning-performance-training-inference).
+
+### Benchmarking
+
+The following section shows how to run benchmarks measuring the model performance in training and inference modes.
+
+#### Training performance benchmark
+
+To benchmark training (A100 GPUs only for now), set `--benchmark`, `--benchmark-steps` and `--benchmark-warmup-steps`, then run training with `--run-scope train_only`.
+Refer to [Command-line options](#command-line-options).
+
+Example:
+```bash
+# For 8 GPUs benchmark for AMP
+python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
+  --run-scope train_only \
+  --amp \
+  --scale-loss 128.0 \
+  --use-dynamic-loss-scaling \
+  --data-layout NHWC \
+  --benchmark \
+  --benchmark-steps 100 \
+  --benchmark-warmup-steps 300
+```
+
+Benchmark will run 300 iterations for warmup and 100 iterations for benchmark, then save benchmark results in the `--report-file` file.
+
+#### Inference performance benchmark
+
+##### Benchmark
+
+To benchmark evaluation (A100 GPUs only for now), set `--benchmark`, `--benchmark-steps` and `--benchmark-warmup-steps`, then run training with `--run-scope eval_only`.
+Refer to [Command-line options](#command-line-options).
+
+Example:
+```bash
+# For 8 GPUs benchmark for AMP
+python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
+  --run-scope eval_only \
+  --amp \
+  --data-layout NHWC \
+  --benchmark \
+  --benchmark-steps 100 \
+  --benchmark-warmup-steps 300
+```
+
+Benchmark will run 300 iterations for warmup and 100 iterations for benchmark, then save benchmark results in the `--report-file` file.
+
+It is also allowed to set batch size for benchmark by adding `--batch-size <batch_size>` in launching commands.
+```bash
+# For 8 GPUs benchmark for AMP
+python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
+  --run-scope eval_only \
+  --batch-size 32 \
+  --amp \
+  --data-layout NHWC \
+  --benchmark \
+  --benchmark-steps 100 \
+  --benchmark-warmup-steps 300
+```
+
+##### Benchmark with TensorRT
+
+To benchmark the inference performance with TensorRT on a specific batch size, run:
+
+* FP32 / TF32
+```bash
+python inference.py \
+    --trt-inference-dir <path_to_exported_model> \
+    --trt-precision FP32 \
+    --batch-size <batch_size> \
+    --benchmark-steps 1024 \
+    --benchmark-warmup-steps 16
+```
+
+* FP16
+```bash
+python inference.py \
+    --trt-inference-dir <path_to_exported_model> \
+    --trt-precision FP16 \
+    --batch-size <batch_size>
+    --benchmark-steps 1024 \
+    --benchmark-warmup-steps 16
+```
+
+Note that arguments passed to `inference.py` should be the same with arguments used in training.
+
+The benchmark uses the validation dataset by default, which should be put in `--image-root/val`.
+For the performance benchmark of the raw model, a synthetic dataset can be used. To use synthetic dataset, add `--trt-use-synthat True` as a command line option.
+
+### Results
+
+#### Training accuracy results
+
+Our results were obtained by running the applicable training script in the PaddlePaddle NGC container.
+
+To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
+
+
+##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
+
+| **Epochs** | **Mixed Precision Top1** | **TF32 Top1**   |
+|:----------:|:------------------------:|:---------------:|
+|     50     |       75.96 +/- 0.09     |  76.17 +/- 0.11 |
+|     90     |       76.93 +/- 0.14     |  76.91 +/- 0.13 |
+
+##### Example plots
+
+The following images show the 90 epochs configuration on a DGX-A100.
+
+![ValidationLoss](./img/loss.png)
+![ValidationTop1](./img/top1.png)
+![ValidationTop5](./img/top5.png)
+
+##### Accuracy recovering of Automatic SParsity: NVIDIA DGX A100 (8x A100 80GB)
+
+| **Epochs** | **Mixed Precision Top1 (Baseline)** | **Mixed Precision+ASP Top1**  |
+|:----------:|:-----------------------------------:|:-----------------------------:|
+|     90     |              76.92                  |              76.72             |
+
+#### Training performance results
+
+Our results were obtained by running the applicable training script in the PaddlePaddle NGC container.
+
+To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
+
+##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
+
+| **GPUs** |  **Throughput - TF32**  | **Throughput - mixed precision** | **Throughput speedup (TF32 to mixed precision)** | **TF32 Scaling** | **Mixed Precision Scaling** | **Mixed Precision Training Time (90E)** | **TF32 Training Time (90E)** |
+|:--------:|:------------:|:-------------:|:------------:|:------:|:--------:|:--------:|:--------:|
+|    1     |    993 img/s |  2711 img/s   |    2.73 x    | 1.0 x  |  1.0 x   | ~13 hours| ~40 hours|
+|    8     |  7955 img/s  |   20267 img/s |    2.54 x    | 8.01 x | 7.47 x   | ~2 hours | ~4 hours |
+
+##### Training performance of Automatic SParsity: NVIDIA DGX A100 (8x A100 80GB)
+| **GPUs** |  **Throughput - mixed precision** | **Throughput - mixed precision+ASP** | **Overhead** |
+|:--------:|:---------------------------------:|:------------------------------------:|:------------:|
+|    1     |           2711 img/s              |               2686 img/s             | 1.0%         |
+|    8     |          20267 img/s              |              20144 img/s             | 0.6%         |
+
+
+Note that the `train.py` would enable CPU affinity binding to GPUs by default, that is designed and guaranteed being optimal for NVIDIA DGX-series. You could disable binding via launch `train.py` with `--enable-cpu-affinity false`.
+
+
+### Inference performance results
+
+#### Inference performance: NVIDIA DGX A100 (1x A100 80GB)
+Our results were obtained by running the applicable training script with `--run-scope eval_only` argument in the PaddlePaddle NGC container.
+
+**TF32 Inference Latency**
+
+|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
+|--------------|------------------|---------------|---------------|---------------|---------------|
+| 1 | 138.90 img/s | 7.19 ms | 7.25 ms | 7.70 ms | 17.05 ms |
+| 2 | 263.20 img/s | 7.59 ms | 7.61 ms | 8.27 ms | 18.17 ms |
+| 4 | 442.47 img/s | 9.04 ms | 9.31 ms | 10.10 ms | 20.41 ms |
+| 8 | 904.99 img/s | 8.83 ms | 9.27 ms | 10.08 ms | 18.16 ms |
+| 16 | 1738.12 img/s | 9.20 ms | 9.75 ms | 10.16 ms | 18.06 ms |
+| 32 | 2423.74 img/s | 13.20 ms | 16.09 ms | 18.10 ms | 28.01 ms |
+| 64 | 2890.31 img/s | 22.14 ms | 22.10 ms | 22.79 ms | 30.62 ms |
+| 128 | 2676.88 img/s | 47.81 ms | 68.94 ms | 77.97 ms | 92.41 ms |
+| 256 | 3283.94 img/s | 77.95 ms | 79.02 ms | 80.88 ms | 98.36 ms |
+
+**Mixed Precision Inference Latency**
+
+|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
+|--------------|------------------|---------------|---------------|---------------|---------------|
+| 1 | 127.12 img/s | 7.86 ms | 8.24 ms | 8.52 ms | 14.17 ms |
+| 2 | 239.49 img/s | 8.35 ms | 9.08 ms | 9.78 ms |  9.89 ms |
+| 4 | 519.19 img/s | 7.70 ms | 7.44 ms | 7.69 ms | 14.20 ms |
+| 8 | 918.01 img/s | 8.71 ms | 8.39 ms | 9.08 ms | 21.23 ms |
+| 16 | 1795.41 img/s | 8.91 ms | 9.73 ms | 10.36 ms | 11.39 ms |
+| 32 | 3201.59 img/s | 9.99 ms | 12.04 ms | 15.29 ms | 23.23 ms |
+| 64 | 4919.89 img/s | 13.00 ms | 13.66 ms | 14.06 ms | 24.75 ms |
+| 128 | 4361.36 img/s | 29.34 ms | 47.47 ms | 157.49 ms | 77.42 ms |
+| 256 | 5742.03 img/s | 44.58 ms | 52.78 ms | 356.58 ms | 78.99 ms |
+
+### Paddle-TRT performance results
+
+#### Paddle-TRT performance: NVIDIA DGX A100 (1x A100 80GB)
+Our results for Paddle-TRT were obtained by running the `inference.py` script on NVIDIA DGX A100 with (1x A100 80G) GPU.
+
+**TF32 Inference Latency**
+
+|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
+|--------------|------------------|---------------|---------------|---------------|---------------|
+| 1 | 716.49 img/s | 1.40 ms | 1.96 ms | 2.20 ms | 3.01 ms |
+| 2 | 1219.98 img/s | 1.64 ms | 2.26 ms | 2.90 ms | 5.04 ms |
+| 4 | 1880.12 img/s | 2.13 ms | 3.39 ms | 4.44 ms | 7.32 ms |
+| 8 | 2404.10 img/s | 3.33 ms | 4.51 ms | 5.90 ms | 10.39 ms |
+| 16 | 3101.28 img/s | 5.16 ms | 7.06 ms | 9.13 ms | 15.18 ms |
+| 32 | 3294.11 img/s | 9.71 ms | 21.42 ms | 26.94 ms | 35.79 ms |
+| 64 | 4327.38 img/s | 14.79 ms | 25.59 ms | 30.45 ms | 45.34 ms |
+| 128 | 4956.59 img/s | 25.82 ms | 33.74 ms | 40.36 ms | 56.06 ms |
+| 256 | 5244.29 img/s | 48.81 ms | 62.11 ms | 67.56 ms | 88.38 ms |
+
+**FP16 Inference Latency**
+
+|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
+|--------------|------------------|---------------|---------------|---------------|---------------|
+| 1 | 860.90 img/s | 1.16 ms | 1.81 ms | 2.06 ms | 2.98 ms |
+| 2 | 1464.06 img/s | 1.37 ms | 2.13 ms | 2.73 ms | 4.76 ms |
+| 4 | 2246.24 img/s | 1.78 ms | 3.17 ms | 4.20 ms | 7.39 ms |
+| 8 | 2457.44 img/s | 3.25 ms | 4.35 ms | 5.50 ms | 9.98 ms |
+| 16 | 3928.83 img/s | 4.07 ms | 6.26 ms | 8.50 ms | 15.10 ms |
+| 32 | 3853.13 img/s | 8.30 ms | 19.87 ms | 25.51 ms | 34.99 ms |
+| 64 | 5581.89 img/s | 11.46 ms | 22.32 ms | 30.75 ms | 43.35 ms |
+| 128 | 6846.77 img/s | 18.69 ms | 25.43 ms | 35.03 ms | 50.04 ms |
+| 256 | 7481.19 img/s | 34.22 ms | 40.92 ms | 51.10 ms | 65.68 ms |
+
+#### Paddle-TRT performance: NVIDIA A30 (1x A30 24GB)
+Our results for Paddle-TRT were obtained by running the `inference.py` script on NVIDIA A30 with (1x A30 24G) GPU.
+
+**TF32 Inference Latency**
+
+|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
+|--------------|------------------|---------------|---------------|---------------|---------------|
+| 1 | 672.79 img/s | 1.49 ms | 2.01 ms | 2.29 ms | 3.04 ms |
+| 2 | 1041.47 img/s | 1.92 ms | 2.49 ms | 2.87 ms | 4.13 ms |
+| 4 | 1505.64 img/s | 2.66 ms | 3.43 ms | 4.06 ms | 6.85 ms |
+| 8 | 2001.13 img/s | 4.00 ms | 4.72 ms | 5.54 ms | 9.51 ms |
+| 16 | 2462.80 img/s | 6.50 ms | 7.71 ms | 9.32 ms | 15.54 ms |
+| 32 | 2474.34 img/s | 12.93 ms | 21.61 ms | 25.76 ms | 34.69 ms |
+| 64 | 2949.38 img/s | 21.70 ms | 29.58 ms | 34.63 ms | 47.11 ms |
+| 128 | 3278.67 img/s | 39.04 ms | 43.34 ms | 52.72 ms | 66.78 ms |
+| 256 | 3293.10 img/s | 77.74 ms | 90.51 ms | 99.71 ms | 110.80 ms |
+
+**FP16 Inference Latency**
+
+|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
+|--------------|------------------|---------------|---------------|---------------|---------------|
+| 1 | 804.56 img/s | 1.24 ms | 1.81 ms | 2.15 ms | 3.07 ms |
+| 2 | 1435.74 img/s | 1.39 ms | 2.05 ms | 2.48 ms | 3.86 ms |
+| 4 | 2169.87 img/s | 1.84 ms | 2.72 ms | 3.39 ms | 5.94 ms |
+| 8 | 2395.13 img/s | 3.34 ms | 4.46 ms | 5.11 ms | 9.49 ms |
+| 16 | 3779.82 img/s | 4.23 ms | 5.83 ms | 7.66 ms | 14.44 ms |
+| 32 | 3620.18 img/s | 8.84 ms | 17.90 ms | 22.31 ms | 30.91 ms |
+| 64 | 4592.08 img/s | 13.94 ms | 24.00 ms | 29.38 ms | 41.41 ms |
+| 128 | 5064.06 img/s | 25.28 ms | 31.73 ms | 37.79 ms | 53.01 ms |
+| 256 | 4774.61 img/s | 53.62 ms | 59.04 ms | 67.29 ms | 80.51 ms |
+
+
+#### Paddle-TRT performance: NVIDIA A10 (1x A10 24GB)
+Our results for Paddle-TRT were obtained by running the `inference.py` script on NVIDIA A10 with (1x A10 24G) GPU.
+
+**TF32 Inference Latency**
+
+|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
+|--------------|------------------|---------------|---------------|---------------|---------------|
+| 1 | 372.04 img/s | 2.69 ms | 3.64 ms | 4.20 ms | 5.28 ms |
+| 2 | 615.93 img/s | 3.25 ms | 4.08 ms | 4.59 ms | 6.42 ms |
+| 4 | 1070.02 img/s | 3.74 ms | 3.90 ms | 4.35 ms | 7.48 ms |
+| 8 | 1396.88 img/s | 5.73 ms | 6.87 ms | 7.52 ms | 10.63 ms |
+| 16 | 1522.20 img/s | 10.51 ms | 12.73 ms | 13.84 ms | 17.84 ms |
+| 32 | 1674.39 img/s | 19.11 ms | 23.23 ms | 24.63 ms | 29.55 ms |
+| 64 | 1782.14 img/s | 35.91 ms | 41.84 ms | 44.53 ms | 48.94 ms |
+| 128 | 1722.33 img/s | 74.32 ms | 85.37 ms | 89.27 ms | 94.85 ms |
+| 256 | 1576.89 img/s | 162.34 ms | 181.01 ms | 185.92 ms | 194.42 ms |
+
+**FP16 Inference Latency**
+
+|**Batch Size**|**Avg throughput**|**Avg latency**|**90% Latency**|**95% Latency**|**99% Latency**|
+|--------------|------------------|---------------|---------------|---------------|---------------|
+| 1 | 365.38 img/s | 2.74 ms | 3.94 ms | 4.35 ms | 5.64 ms |
+| 2 | 612.52 img/s | 3.26 ms | 4.34 ms | 4.80 ms | 6.97 ms |
+| 4 | 1018.15 img/s | 3.93 ms | 4.95 ms | 5.55 ms | 9.16 ms |
+| 8 | 1924.26 img/s | 4.16 ms | 5.44 ms | 6.20 ms | 11.89 ms |
+| 16 | 2477.49 img/s | 6.46 ms | 8.07 ms | 9.21 ms | 15.05 ms |
+| 32 | 2896.01 img/s | 11.05 ms | 13.56 ms | 15.32 ms | 21.76 ms |
+| 64 | 3165.27 img/s | 20.22 ms | 24.20 ms | 25.94 ms | 33.18 ms |
+| 128 | 3176.46 img/s | 40.29 ms | 46.36 ms | 49.15 ms | 54.95 ms |
+| 256 | 3110.01 img/s | 82.31 ms | 93.21 ms | 96.06 ms | 99.97 ms |
+
+## Release notes
+
+### Changelog
+
+1. December 2021
+  * Initial release
+  * Cosine LR schedule
+  * DALI support
+  * DALI-CPU dataloader
+  * Added A100 scripts
+  * Paddle AMP
+
+
+2. January 2022
+  * Added options Label Smoothing, fan-in initialization, skipping weight decay on batch norm gamma and bias.
+  * Updated README
+  * A100 convergence benchmark
+
+
+### Known issues
+  * Allreduce issues to top1 and top5 accuracy in evaluation. Workaround: use `build_strategy.fix_op_run_order = True` for eval program. (refer to [Paddle-issue-39567](https://github.com/PaddlePaddle/Paddle/issues/39567) for details)

+ 238 - 0
PaddlePaddle/Classification/RN50v1.5/dali.py

@@ -0,0 +1,238 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+from dataclasses import dataclass
+import paddle
+import nvidia.dali.ops as ops
+import nvidia.dali.types as types
+from nvidia.dali.pipeline import Pipeline
+from nvidia.dali.plugin.paddle import DALIGenericIterator
+from utils.mode import Mode
+from utils.utility import get_num_trainers, get_trainer_id
+
+
+@dataclass
+class PipeOpMeta:
+    crop: int
+    resize_shorter: int
+    min_area: float
+    max_area: float
+    lower: float
+    upper: float
+    interp: types.DALIInterpType
+    mean: float
+    std: float
+    output_dtype: types.DALIDataType
+    output_layout: str
+    pad_output: bool
+
+
+class HybridPipeBase(Pipeline):
+    def __init__(self,
+                 file_root,
+                 batch_size,
+                 device_id,
+                 ops_meta,
+                 num_threads=4,
+                 seed=42,
+                 shard_id=0,
+                 num_shards=1,
+                 random_shuffle=True,
+                 dont_use_mmap=True):
+        super().__init__(batch_size, num_threads, device_id, seed=seed)
+
+        self.input = ops.readers.File(
+            file_root=file_root,
+            shard_id=shard_id,
+            num_shards=num_shards,
+            random_shuffle=random_shuffle,
+            dont_use_mmap=dont_use_mmap)
+
+        self.build_ops(ops_meta)
+
+    def build_ops(self, ops_meta):
+        pass
+
+    def __len__(self):
+        return self.epoch_size("Reader")
+
+
+class HybridTrainPipe(HybridPipeBase):
+    def build_ops(self, ops_meta):
+        # Set internal nvJPEG buffers size to handle full-sized ImageNet images
+        # without additional reallocations
+        device_memory_padding = 211025920
+        host_memory_padding = 140544512
+        self.decode = ops.decoders.ImageRandomCrop(
+            device='mixed',
+            output_type=types.DALIImageType.RGB,
+            device_memory_padding=device_memory_padding,
+            host_memory_padding=host_memory_padding,
+            random_aspect_ratio=[ops_meta.lower, ops_meta.upper],
+            random_area=[ops_meta.min_area, ops_meta.max_area],
+            num_attempts=100)
+        self.res = ops.Resize(
+            device='gpu',
+            resize_x=ops_meta.crop,
+            resize_y=ops_meta.crop,
+            interp_type=ops_meta.interp)
+        self.cmnp = ops.CropMirrorNormalize(
+            device="gpu",
+            dtype=ops_meta.output_dtype,
+            output_layout=ops_meta.output_layout,
+            crop=(ops_meta.crop, ops_meta.crop),
+            mean=ops_meta.mean,
+            std=ops_meta.std,
+            pad_output=ops_meta.pad_output)
+        self.coin = ops.random.CoinFlip(probability=0.5)
+        self.to_int64 = ops.Cast(dtype=types.DALIDataType.INT64, device="gpu")
+
+    def define_graph(self):
+        rng = self.coin()
+        jpegs, labels = self.input(name="Reader")
+        images = self.decode(jpegs)
+        images = self.res(images)
+        output = self.cmnp(images.gpu(), mirror=rng)
+        return [output, self.to_int64(labels.gpu())]
+
+
+class HybridValPipe(HybridPipeBase):
+    def build_ops(self, ops_meta):
+        self.decode = ops.decoders.Image(device="mixed")
+        self.res = ops.Resize(
+            device="gpu",
+            resize_shorter=ops_meta.resize_shorter,
+            interp_type=ops_meta.interp)
+        self.cmnp = ops.CropMirrorNormalize(
+            device="gpu",
+            dtype=ops_meta.output_dtype,
+            output_layout=ops_meta.output_layout,
+            crop=(ops_meta.crop, ops_meta.crop),
+            mean=ops_meta.mean,
+            std=ops_meta.std,
+            pad_output=ops_meta.pad_output)
+        self.to_int64 = ops.Cast(dtype=types.DALIDataType.INT64, device="gpu")
+
+    def define_graph(self):
+        jpegs, labels = self.input(name="Reader")
+        images = self.decode(jpegs)
+        images = self.res(images)
+        output = self.cmnp(images)
+        return [output, self.to_int64(labels.gpu())]
+
+
+def dali_dataloader(args, mode, device):
+    """
+    Define a dali dataloader with configuration to operate datasets.
+
+    Args:
+        args(Namespace): Arguments obtained from ArgumentParser.
+        mode(utils.Mode): Train or eval mode.
+        device(int): Id of GPU to load data.
+    Outputs:
+        DALIGenericIterator(nvidia.dali.plugin.paddle.DALIGenericIterator)
+            Iteratable outputs of DALI pipeline,
+            including "data" and "label" in type of Paddle's Tensor.
+    """
+    assert "gpu" in device, "gpu training is required for DALI"
+    assert mode in Mode, "Dataset mode should be in supported Modes"
+
+    device_id = int(device.split(':')[1])
+
+    seed = args.dali_random_seed
+    num_threads = args.dali_num_threads
+    batch_size = args.batch_size
+
+    interp = 1  # settings.interpolation or 1  # default to linear
+    interp_map = {
+        # cv2.INTER_NEAREST
+        0: types.DALIInterpType.INTERP_NN,
+        # cv2.INTER_LINEAR
+        1: types.DALIInterpType.INTERP_LINEAR,
+        # cv2.INTER_CUBIC
+        2: types.DALIInterpType.INTERP_CUBIC,
+        # LANCZOS3 for cv2.INTER_LANCZOS4
+        3: types.DALIInterpType.INTERP_LANCZOS3
+    }
+    assert interp in interp_map, "interpolation method not supported by DALI"
+    interp = interp_map[interp]
+
+    normalize_scale = args.normalize_scale
+    normalize_mean = args.normalize_mean
+    normalize_std = args.normalize_std
+    normalize_mean = [v / normalize_scale for v in normalize_mean]
+    normalize_std = [v / normalize_scale for v in normalize_std]
+
+    output_layout = args.data_layout[1:]  # NCHW -> CHW or NHWC -> HWC
+    pad_output = args.image_channel == 4
+    output_dtype = types.FLOAT16 if args.dali_output_fp16 else types.FLOAT
+
+    shard_id = get_trainer_id()
+    num_shards = get_num_trainers()
+
+    scale = args.rand_crop_scale
+    ratio = args.rand_crop_ratio
+
+    ops_meta = PipeOpMeta(
+        crop=args.crop_size,
+        resize_shorter=args.resize_short,
+        min_area=scale[0],
+        max_area=scale[1],
+        lower=ratio[0],
+        upper=ratio[1],
+        interp=interp,
+        mean=normalize_mean,
+        std=normalize_std,
+        output_dtype=output_dtype,
+        output_layout=output_layout,
+        pad_output=pad_output)
+
+    file_root = args.image_root
+    pipe_class = None
+
+    if mode == Mode.TRAIN:
+        file_root = os.path.join(file_root, 'train')
+        pipe_class = HybridTrainPipe
+    else:
+        file_root = os.path.join(file_root, 'val')
+        pipe_class = HybridValPipe
+
+    pipe = pipe_class(
+        file_root,
+        batch_size,
+        device_id,
+        ops_meta,
+        num_threads=num_threads,
+        seed=seed + shard_id,
+        shard_id=shard_id,
+        num_shards=num_shards)
+    pipe.build()
+    return DALIGenericIterator([pipe], ['data', 'label'], reader_name='Reader')
+
+
+def build_dataloader(args, mode):
+    """
+    Build a dataloader to process datasets. Only DALI dataloader is supported now.
+    Args:
+        args(Namespace): Arguments obtained from ArgumentParser.
+        mode(utils.Mode): Train or eval mode.
+
+    Returns:
+        dataloader(nvidia.dali.plugin.paddle.DALIGenericIterator):
+            Iteratable outputs of DALI pipeline,
+            including "data" and "label" in type of Paddle's Tensor.
+    """
+    assert mode in Mode, "Dataset mode should be in supported Modes (train or eval)"
+    return dali_dataloader(args, mode, paddle.device.get_device())

+ 75 - 0
PaddlePaddle/Classification/RN50v1.5/export_model.py

@@ -0,0 +1,75 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import logging
+import paddle
+import program
+from dali import build_dataloader
+from utils.mode import Mode
+from utils.save_load import init_ckpt
+from utils.logger import setup_dllogger
+from utils.config import parse_args, print_args
+
+
+def main(args):
+    '''
+    Export saved model params to paddle inference model
+    '''
+    setup_dllogger(args.trt_export_log_path)
+    if args.show_config:
+        print_args(args)
+
+    eval_dataloader = build_dataloader(args, Mode.EVAL)
+
+    startup_prog = paddle.static.Program()
+    eval_prog = paddle.static.Program()
+
+    eval_fetchs, _, eval_feeds, _ = program.build(
+        args,
+        eval_prog,
+        startup_prog,
+        step_each_epoch=len(eval_dataloader),
+        is_train=False)
+    eval_prog = eval_prog.clone(for_test=True)
+
+    device = paddle.set_device('gpu')
+    exe = paddle.static.Executor(device)
+    exe.run(startup_prog)
+
+    path_to_ckpt = args.from_checkpoint
+
+    if path_to_ckpt is None:
+        logging.warning(
+            'The --from-checkpoint is not set, model weights will not be initialize.'
+        )
+    else:
+        init_ckpt(path_to_ckpt, eval_prog, exe)
+        logging.info('Checkpoint path is %s', path_to_ckpt)
+
+    save_inference_dir = args.trt_inference_dir
+    paddle.static.save_inference_model(
+        path_prefix=os.path.join(save_inference_dir, args.model_arch_name),
+        feed_vars=[eval_feeds['data']],
+        fetch_vars=[eval_fetchs['label'][0]],
+        executor=exe,
+        program=eval_prog)
+
+    logging.info('Successully export inference model to %s',
+                 save_inference_dir)
+
+
+if __name__ == '__main__':
+    paddle.enable_static()
+    main(parse_args(including_trt=True))

BIN
PaddlePaddle/Classification/RN50v1.5/img/loss.png


BIN
PaddlePaddle/Classification/RN50v1.5/img/top1.png


BIN
PaddlePaddle/Classification/RN50v1.5/img/top5.png


+ 213 - 0
PaddlePaddle/Classification/RN50v1.5/inference.py

@@ -0,0 +1,213 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import time
+import glob
+
+import numpy as np
+import dllogger
+
+from paddle.fluid import LoDTensor
+from paddle.inference import Config, PrecisionType, create_predictor
+
+from dali import dali_dataloader
+from utils.config import parse_args, print_args
+from utils.mode import Mode
+from utils.logger import setup_dllogger
+
+
+def init_predictor(args):
+    infer_dir = args.trt_inference_dir
+    assert os.path.isdir(
+        infer_dir), f'inference_dir = "{infer_dir}" is not a directory'
+    pdiparams_path = glob.glob(os.path.join(infer_dir, '*.pdiparams'))
+    pdmodel_path = glob.glob(os.path.join(infer_dir, '*.pdmodel'))
+    assert len(pdiparams_path) == 1, \
+        f'There should be only 1 pdiparams in {infer_dir}, but there are {len(pdiparams_path)}'
+    assert len(pdmodel_path) == 1, \
+        f'There should be only 1 pdmodel in {infer_dir}, but there are {len(pdmodel_path)}'
+    predictor_config = Config(pdmodel_path[0], pdiparams_path[0])
+    predictor_config.enable_memory_optim()
+    predictor_config.enable_use_gpu(0, 0)
+    precision = args.trt_precision
+    max_batch_size = args.batch_size
+    assert precision in ['FP32', 'FP16', 'INT8'], \
+        'precision should be FP32/FP16/INT8'
+    if precision == 'INT8':
+        precision_mode = PrecisionType.Int8
+    elif precision == 'FP16':
+        precision_mode = PrecisionType.Half
+    elif precision == 'FP32':
+        precision_mode = PrecisionType.Float32
+    else:
+        raise NotImplementedError
+    predictor_config.enable_tensorrt_engine(
+        workspace_size=args.trt_workspace_size,
+        max_batch_size=max_batch_size,
+        min_subgraph_size=args.trt_min_subgraph_size,
+        precision_mode=precision_mode,
+        use_static=args.trt_use_static,
+        use_calib_mode=args.trt_use_calib_mode)
+    predictor = create_predictor(predictor_config)
+    return predictor
+
+
+def predict(predictor, input_data):
+    '''
+    Args:
+        predictor: Paddle inference predictor
+        input_data: A list of input
+    Returns:
+        output_data: A list of output
+    '''
+    # copy image data to input tensor
+    input_names = predictor.get_input_names()
+    for i, name in enumerate(input_names):
+        input_tensor = predictor.get_input_handle(name)
+
+        if isinstance(input_data[i], LoDTensor):
+            input_tensor.share_external_data(input_data[i])
+        else:
+            input_tensor.reshape(input_data[i].shape)
+            input_tensor.copy_from_cpu(input_data[i])
+
+    # do the inference
+    predictor.run()
+
+    results = []
+    # get out data from output tensor
+    output_names = predictor.get_output_names()
+    for i, name in enumerate(output_names):
+        output_tensor = predictor.get_output_handle(name)
+        output_data = output_tensor.copy_to_cpu()
+        results.append(output_data)
+    return results
+
+
+def benchmark_dataset(args):
+    """
+    Benchmark DALI format dataset, which reflects real the pipeline throughput including
+    1. Read images
+    2. Pre-processing
+    3. Inference
+    4. H2D, D2H
+    """
+    predictor = init_predictor(args)
+
+    dali_iter = dali_dataloader(args, Mode.EVAL, 'gpu:0')
+
+    # Warmup some samples for the stable performance number
+    batch_size = args.batch_size
+    image_shape = args.image_shape
+    image = np.zeros((batch_size, *image_shape)).astype(np.single)
+    for _ in range(args.benchmark_warmup_steps):
+        predict(predictor, [image])[0]
+
+    total_images = 0
+    correct_predict = 0
+
+    latency = []
+
+    start = time.perf_counter()
+    last_time_step = time.perf_counter()
+    for dali_data in dali_iter:
+        for data in dali_data:
+            label = np.asarray(data['label'])
+            total_images += label.shape[0]
+            label = label.flatten()
+            image = data['data']
+            predict_label = predict(predictor, [image])[0]
+            correct_predict += (label == predict_label).sum()
+        batch_end_time_step = time.perf_counter()
+        batch_latency = batch_end_time_step - last_time_step
+        latency.append(batch_latency)
+        last_time_step = time.perf_counter()
+    end = time.perf_counter()
+
+    latency = np.array(latency) * 1000
+    quantile = np.quantile(latency, [0.9, 0.95, 0.99])
+
+    statistics = {
+        'precision': args.trt_precision,
+        'batch_size': batch_size,
+        'throughput': total_images / (end - start),
+        'accuracy': correct_predict / total_images,
+        'eval_latency_avg': np.mean(latency),
+        'eval_latency_p90': quantile[0],
+        'eval_latency_p95': quantile[1],
+        'eval_latency_p99': quantile[2],
+    }
+    return statistics
+
+
+def benchmark_synthat(args):
+    """
+    Benchmark on the synthatic data and bypass all pre-processing.
+    The host to device copy is still included.
+    This used to find the upper throughput bound when tunning the full input pipeline.
+    """
+
+    predictor = init_predictor(args)
+    batch_size = args.batch_size
+    image_shape = args.image_shape
+    image = np.random.random((batch_size, *image_shape)).astype(np.single)
+
+    latency = []
+
+    # warmup
+    for _ in range(args.benchmark_warmup_steps):
+        predict(predictor, [image])[0]
+
+    # benchmark
+    start = time.perf_counter()
+    last_time_step = time.perf_counter()
+    for _ in range(args.benchmark_steps):
+        predict(predictor, [image])[0]
+        batch_end_time_step = time.perf_counter()
+        batch_latency = batch_end_time_step - last_time_step
+        latency.append(batch_latency)
+        last_time_step = time.perf_counter()
+    end = time.perf_counter()
+
+    latency = np.array(latency) * 1000
+    quantile = np.quantile(latency, [0.9, 0.95, 0.99])
+
+    statistics = {
+        'precision': args.trt_precision,
+        'batch_size': batch_size,
+        'throughput': args.benchmark_steps * batch_size / (end - start),
+        'eval_latency_avg': np.mean(latency),
+        'eval_latency_p90': quantile[0],
+        'eval_latency_p95': quantile[1],
+        'eval_latency_p99': quantile[2],
+    }
+    return statistics
+
+
+def main(args):
+    setup_dllogger(args.trt_log_path)
+    if args.show_config:
+        print_args(args)
+
+    if args.trt_use_synthat:
+        statistics = benchmark_synthat(args)
+    else:
+        statistics = benchmark_dataset(args)
+
+    dllogger.log(step=tuple(), data=statistics)
+
+
+if __name__ == '__main__':
+    main(parse_args(including_trt=True))

+ 77 - 0
PaddlePaddle/Classification/RN50v1.5/lr_scheduler.py

@@ -0,0 +1,77 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+import logging
+import paddle
+
+
+class Cosine:
+    """
+    Cosine learning rate decay.
+    lr = eta_min + 0.5 * (learning_rate - eta_min) * (cos(epoch * (PI / epochs)) + 1)
+    Args:
+        args(Namespace): Arguments obtained from ArgumentParser.
+        step_each_epoch(int): The number of steps in each epoch.
+        last_epoch (int, optional):  The index of last epoch. Can be set to restart training.
+            Default: -1, meaning initial learning rate.
+    """
+
+    def __init__(self, args, step_each_epoch, last_epoch=-1):
+        super().__init__()
+        if args.warmup_epochs >= args.epochs:
+            args.warmup_epochs = args.epochs
+        self.learning_rate = args.lr
+        self.T_max = (args.epochs - args.warmup_epochs) * step_each_epoch
+        self.eta_min = 0.0
+        self.last_epoch = last_epoch
+        self.warmup_steps = round(args.warmup_epochs * step_each_epoch)
+        self.warmup_start_lr = args.warmup_start_lr
+
+    def __call__(self):
+        learning_rate = paddle.optimizer.lr.CosineAnnealingDecay(
+            learning_rate=self.learning_rate,
+            T_max=self.T_max,
+            eta_min=self.eta_min,
+            last_epoch=self.
+            last_epoch) if self.T_max > 0 else self.learning_rate
+        if self.warmup_steps > 0:
+            learning_rate = paddle.optimizer.lr.LinearWarmup(
+                learning_rate=learning_rate,
+                warmup_steps=self.warmup_steps,
+                start_lr=self.warmup_start_lr,
+                end_lr=self.learning_rate,
+                last_epoch=self.last_epoch)
+        return learning_rate
+
+
+def build_lr_scheduler(args, step_each_epoch):
+    """
+    Build a learning rate scheduler.
+
+    Args:
+        args(Namespace): Arguments obtained from ArgumentParser.
+        step_each_epoch(int): The number of steps in each epoch.
+    return:
+        lr(paddle.optimizer.lr.LRScheduler): A learning rate scheduler.
+    """
+    # Turn last_epoch to last_step, since we update lr each step instead of each epoch.
+    last_step = args.start_epoch * step_each_epoch - 1
+    learning_rate_mod = sys.modules[__name__]
+    lr = getattr(learning_rate_mod, args.lr_scheduler)(args, step_each_epoch,
+                                                       last_step)
+    if not isinstance(lr, paddle.optimizer.lr.LRScheduler):
+        lr = lr()
+    logging.info("build lr %s success..", lr)
+    return lr

+ 15 - 0
PaddlePaddle/Classification/RN50v1.5/models/__init__.py

@@ -0,0 +1,15 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .resnet import ResNet50

+ 222 - 0
PaddlePaddle/Classification/RN50v1.5/models/resnet.py

@@ -0,0 +1,222 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import math
+import paddle
+from paddle import ParamAttr
+import paddle.nn as nn
+from paddle.nn import Conv2D, BatchNorm, Linear
+from paddle.nn import AdaptiveAvgPool2D, MaxPool2D, AvgPool2D
+from paddle.nn.initializer import Uniform, Constant, KaimingNormal
+
+MODELS = ["ResNet50"]
+
+__all__ = MODELS
+
+
+class ConvBNLayer(nn.Layer):
+    def __init__(self,
+                 num_channels,
+                 num_filters,
+                 filter_size,
+                 stride=1,
+                 groups=1,
+                 act=None,
+                 lr_mult=1.0,
+                 data_format="NCHW",
+                 bn_weight_decay=True):
+        super().__init__()
+        self.act = act
+        self.avg_pool = AvgPool2D(
+            kernel_size=2, stride=2, padding=0, ceil_mode=True)
+        self.conv = Conv2D(
+            in_channels=num_channels,
+            out_channels=num_filters,
+            kernel_size=filter_size,
+            stride=stride,
+            padding=(filter_size - 1) // 2,
+            groups=groups,
+            weight_attr=ParamAttr(
+                learning_rate=lr_mult, initializer=KaimingNormal()),
+            bias_attr=False,
+            data_format=data_format)
+        self.bn = BatchNorm(
+            num_filters,
+            param_attr=ParamAttr(
+                learning_rate=lr_mult,
+                regularizer=None
+                if bn_weight_decay else paddle.regularizer.L2Decay(0.0),
+                initializer=Constant(1.0)),
+            bias_attr=ParamAttr(
+                learning_rate=lr_mult,
+                regularizer=None
+                if bn_weight_decay else paddle.regularizer.L2Decay(0.0),
+                initializer=Constant(0.0)),
+            data_layout=data_format)
+        self.relu = nn.ReLU()
+
+    def forward(self, x):
+        x = self.conv(x)
+        x = self.bn(x)
+        if self.act:
+            x = self.relu(x)
+        return x
+
+
+class BottleneckBlock(nn.Layer):
+    def __init__(self,
+                 num_channels,
+                 num_filters,
+                 stride,
+                 shortcut=True,
+                 lr_mult=1.0,
+                 data_format="NCHW",
+                 bn_weight_decay=True):
+        super().__init__()
+
+        self.conv0 = ConvBNLayer(
+            num_channels=num_channels,
+            num_filters=num_filters,
+            filter_size=1,
+            act="relu",
+            lr_mult=lr_mult,
+            data_format=data_format,
+            bn_weight_decay=bn_weight_decay)
+        self.conv1 = ConvBNLayer(
+            num_channels=num_filters,
+            num_filters=num_filters,
+            filter_size=3,
+            stride=stride,
+            act="relu",
+            lr_mult=lr_mult,
+            data_format=data_format,
+            bn_weight_decay=bn_weight_decay)
+        self.conv2 = ConvBNLayer(
+            num_channels=num_filters,
+            num_filters=num_filters * 4,
+            filter_size=1,
+            act=None,
+            lr_mult=lr_mult,
+            data_format=data_format,
+            bn_weight_decay=bn_weight_decay)
+
+        if not shortcut:
+            self.short = ConvBNLayer(
+                num_channels=num_channels,
+                num_filters=num_filters * 4,
+                filter_size=1,
+                stride=stride,
+                lr_mult=lr_mult,
+                data_format=data_format,
+                bn_weight_decay=bn_weight_decay)
+        self.relu = nn.ReLU()
+        self.shortcut = shortcut
+
+    def forward(self, x):
+        identity = x
+        x = self.conv0(x)
+        x = self.conv1(x)
+        x = self.conv2(x)
+
+        if self.shortcut:
+            short = identity
+        else:
+            short = self.short(identity)
+        x = paddle.add(x=x, y=short)
+        x = self.relu(x)
+        return x
+
+
+class ResNet(nn.Layer):
+    def __init__(self,
+                 class_num=1000,
+                 data_format="NCHW",
+                 input_image_channel=3,
+                 use_pure_fp16=False,
+                 bn_weight_decay=True):
+        super().__init__()
+
+        self.class_num = class_num
+        self.num_filters = [64, 128, 256, 512]
+        self.block_depth = [3, 4, 6, 3]
+        self.num_channels = [64, 256, 512, 1024]
+        self.channels_mult = 1 if self.num_channels[-1] == 256 else 4
+        self.use_pure_fp16 = use_pure_fp16
+
+        self.stem_cfg = {
+            #num_channels, num_filters, filter_size, stride
+            "vb": [[input_image_channel, 64, 7, 2]],
+        }
+        self.stem = nn.Sequential(* [
+            ConvBNLayer(
+                num_channels=in_c,
+                num_filters=out_c,
+                filter_size=k,
+                stride=s,
+                act="relu",
+                data_format=data_format,
+                bn_weight_decay=bn_weight_decay)
+            for in_c, out_c, k, s in self.stem_cfg['vb']
+        ])
+
+        self.max_pool = MaxPool2D(
+            kernel_size=3, stride=2, padding=1, data_format=data_format)
+        block_list = []
+        for block_idx in range(len(self.block_depth)):
+            shortcut = False
+            for i in range(self.block_depth[block_idx]):
+                block_list.append(
+                    BottleneckBlock(
+                        num_channels=self.num_channels[block_idx] if i == 0
+                        else self.num_filters[block_idx] * self.channels_mult,
+                        num_filters=self.num_filters[block_idx],
+                        stride=2 if i == 0 and block_idx != 0 else 1,
+                        shortcut=shortcut,
+                        data_format=data_format,
+                        bn_weight_decay=bn_weight_decay))
+                shortcut = True
+        self.blocks = nn.Sequential(*block_list)
+
+        self.avg_pool = AdaptiveAvgPool2D(1, data_format=data_format)
+        self.flatten = nn.Flatten()
+        self.avg_pool_channels = self.num_channels[-1] * 2
+        stdv = 1.0 / math.sqrt(self.avg_pool_channels * 1.0)
+        self.fc = Linear(
+            self.avg_pool_channels,
+            self.class_num,
+            weight_attr=ParamAttr(initializer=Uniform(-stdv, stdv)))
+
+    def forward(self, x):
+        if self.use_pure_fp16:
+            with paddle.static.amp.fp16_guard():
+                x = self.stem(x)
+                x = self.max_pool(x)
+                x = self.blocks(x)
+                x = self.avg_pool(x)
+                x = self.flatten(x)
+                x = self.fc(x)
+        else:
+            x = self.stem(x)
+            x = self.max_pool(x)
+            x = self.blocks(x)
+            x = self.avg_pool(x)
+            x = self.flatten(x)
+            x = self.fc(x)
+
+        return x
+
+
+def ResNet50(**kwargs):
+    model = ResNet(**kwargs)
+    return model

+ 64 - 0
PaddlePaddle/Classification/RN50v1.5/optimizer.py

@@ -0,0 +1,64 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import sys
+import logging
+from paddle import optimizer as optim
+
+
+class Momentum:
+    """
+    Simple Momentum optimizer with velocity state.
+
+    Args:
+        args(Namespace): Arguments obtained from ArgumentParser.
+        learning_rate(float|LRScheduler): The learning rate used to update parameters.
+            Can be a float value or a paddle.optimizer.lr.LRScheduler.
+    """
+
+    def __init__(self, args, learning_rate):
+        super().__init__()
+        self.learning_rate = learning_rate
+        self.momentum = args.momentum
+        self.weight_decay = args.weight_decay
+        self.grad_clip = None
+        self.multi_precision = args.amp
+
+    def __call__(self):
+        # model_list is None in static graph
+        parameters = None
+        opt = optim.Momentum(
+            learning_rate=self.learning_rate,
+            momentum=self.momentum,
+            weight_decay=self.weight_decay,
+            grad_clip=self.grad_clip,
+            multi_precision=self.multi_precision,
+            parameters=parameters)
+        return opt
+
+
+def build_optimizer(args, lr):
+    """
+    Build a raw optimizer with learning rate scheduler.
+
+    Args:
+        args(Namespace): Arguments obtained from ArgumentParser.
+        lr(paddle.optimizer.lr.LRScheduler): A LRScheduler used for training.
+    return:
+        optim(paddle.optimizer): A normal optmizer.
+    """
+    optimizer_mod = sys.modules[__name__]
+    opt = getattr(optimizer_mod, args.optimizer)(args, learning_rate=lr)()
+    logging.info("build optimizer %s success..", opt)
+    return opt

+ 69 - 0
PaddlePaddle/Classification/RN50v1.5/profile.py

@@ -0,0 +1,69 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import logging
+from contextlib import contextmanager
+from utils.cuda_bind import cuda_profile_start, cuda_profile_stop
+from utils.cuda_bind import cuda_nvtx_range_push, cuda_nvtx_range_pop
+
+
+class Profiler:
+    def __init__(self):
+        super().__init__()
+        self._enable_profile = int(os.environ.get('ENABLE_PROFILE', 0))
+        self._start_step = int(os.environ.get('PROFILE_START_STEP', 0))
+        self._stop_step = int(os.environ.get('PROFILE_STOP_STEP', 0))
+
+        if self._enable_profile:
+            log_msg = f"Profiling start at {self._start_step}-th and stop at {self._stop_step}-th iteration"
+            logging.info(log_msg)
+
+    def profile_setup(self, step):
+        """
+        Setup profiling related status.
+
+        Args:
+            step (int): the index of iteration.
+        Return:
+            stop (bool): a signal to indicate whether profiling should stop or not.
+        """
+
+        if self._enable_profile and step == self._start_step:
+            cuda_profile_start()
+            logging.info("Profiling start at %d-th iteration",
+                         self._start_step)
+
+        if self._enable_profile and step == self._stop_step:
+            cuda_profile_stop()
+            logging.info("Profiling stop at %d-th iteration", self._stop_step)
+            return True
+        return False
+
+    def profile_tag_push(self, step, msg):
+        if self._enable_profile and \
+           step >= self._start_step and \
+           step < self._stop_step:
+            tag_msg = f"Iter-{step}-{msg}"
+            cuda_nvtx_range_push(tag_msg)
+
+    def profile_tag_pop(self):
+        if self._enable_profile:
+            cuda_nvtx_range_pop()
+
+    @contextmanager
+    def profile_tag(self, step, msg):
+        self.profile_tag_push(step, msg)
+        yield
+        self.profile_tag_pop()

+ 449 - 0
PaddlePaddle/Classification/RN50v1.5/program.py

@@ -0,0 +1,449 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import time
+import logging
+
+from profile import Profiler
+import numpy as np
+from optimizer import build_optimizer
+from lr_scheduler import build_lr_scheduler
+from utils.misc import AverageMeter
+from utils.mode import Mode, RunScope
+from utils.utility import get_num_trainers
+import models
+
+import dllogger
+
+import paddle
+import paddle.nn.functional as F
+from paddle.distributed import fleet
+from paddle.distributed.fleet import DistributedStrategy
+from paddle.static import sparsity
+from paddle.distributed.fleet.meta_optimizers.common import CollectiveHelper
+
+
+def create_feeds(image_shape):
+    """
+    Create feeds mapping for the inputs of Pragrm execution.
+
+    Args:
+        image_shape(list[int]): Model input shape, such as [4, 224, 224].
+    Returns:
+        feeds(dict): A dict to map variables'name to their values.
+                     key (string): Name of variable to feed.
+                     Value (tuple): paddle.static.data.
+    """
+    feeds = dict()
+    feeds['data'] = paddle.static.data(
+        name="data", shape=[None] + image_shape, dtype="float32")
+    feeds['label'] = paddle.static.data(
+        name="label", shape=[None, 1], dtype="int64")
+
+    return feeds
+
+
+def create_fetchs(out, feeds, class_num, label_smoothing=0, mode=Mode.TRAIN):
+    """
+    Create fetchs to obtain specific outputs from Pragrm execution (included loss and measures).
+
+    Args:
+        out(variable): The model output variable.
+        feeds(dict): A dict of mapping variables'name to their values
+                     (The input of Program execution).
+        class_num(int): The number of classes.
+        label_smoothing(float, optional): Epsilon of label smoothing. Default: 0.
+        mode(utils.Mode, optional): Train or eval mode. Default: Mode.TRAIN
+    Returns:
+        fetchs(dict): A dict of outputs from Program execution (included loss and measures).
+                      key (string): Name of variable to fetch.
+                      Value (tuple): (variable, AverageMeter).
+    """
+    fetchs = dict()
+    target = paddle.reshape(feeds['label'], [-1, 1])
+
+    if mode == Mode.TRAIN:
+        if label_smoothing == 0:
+            loss = F.cross_entropy(out, target)
+        else:
+            label_one_hot = F.one_hot(target, class_num)
+            soft_target = F.label_smooth(
+                label_one_hot, epsilon=label_smoothing)
+            soft_target = paddle.reshape(soft_target, shape=[-1, class_num])
+            log_softmax = -F.log_softmax(out, axis=-1)
+            loss = paddle.sum(log_softmax * soft_target, axis=-1)
+    else:
+        loss = F.cross_entropy(out, target)
+        label = paddle.argmax(out, axis=-1, dtype='int32')
+        fetchs['label'] = (label, None)
+
+    loss = loss.mean()
+
+    fetchs['loss'] = (loss, AverageMeter('loss', '7.4f', need_avg=True))
+
+    acc_top1 = paddle.metric.accuracy(input=out, label=target, k=1)
+    acc_top5 = paddle.metric.accuracy(input=out, label=target, k=5)
+    metric_dict = dict()
+    metric_dict["top1"] = acc_top1
+    metric_dict["top5"] = acc_top5
+
+    for key in metric_dict:
+        if mode != Mode.TRAIN and paddle.distributed.get_world_size() > 1:
+            paddle.distributed.all_reduce(
+                metric_dict[key], op=paddle.distributed.ReduceOp.SUM)
+            metric_dict[key] = metric_dict[
+                key] / paddle.distributed.get_world_size()
+
+        fetchs[key] = (metric_dict[key], AverageMeter(
+            key, '7.4f', need_avg=True))
+
+    return fetchs
+
+
+def create_strategy(args, is_train=True):
+    """
+    Create paddle.static.BuildStrategy and paddle.static.ExecutionStrategy with arguments.
+
+    Args:
+        args(Namespace): Arguments obtained from ArgumentParser.
+        is_train(bool, optional): Indicate the prupose of strategy is for training
+                                  of not. Default is True.
+    Returns:
+        build_strategy(paddle.static.BuildStrategy): A instance of BuildStrategy.
+        exec_strategy(paddle.static.ExecutionStrategy): A instance of ExecutionStrategy.
+    """
+    build_strategy = paddle.static.BuildStrategy()
+    exec_strategy = paddle.static.ExecutionStrategy()
+
+    exec_strategy.num_threads = 1
+    exec_strategy.num_iteration_per_drop_scope = (10000 if args.amp and
+                                                  args.use_pure_fp16 else 10)
+
+    paddle.set_flags({
+        'FLAGS_cudnn_exhaustive_search': True,
+        'FLAGS_conv_workspace_size_limit': 4096
+    })
+
+    if not is_train:
+        build_strategy.fix_op_run_order = True
+
+    if args.amp:
+        build_strategy.fuse_bn_act_ops = True
+        build_strategy.fuse_elewise_add_act_ops = True
+        build_strategy.fuse_bn_add_act_ops = True
+        build_strategy.enable_addto = True
+
+    return build_strategy, exec_strategy
+
+
+def dist_optimizer(args, optimizer):
+    """
+    Create a distributed optimizer based on a given optimizer.
+
+    Args:
+        args(Namespace): Arguments obtained from ArgumentParser.
+        optimizer(paddle.optimizer): A normal optimizer.
+    Returns:
+        optimizer(fleet.distributed_optimizer): A distributed optimizer.
+    """
+    build_strategy, exec_strategy = create_strategy(args)
+
+    dist_strategy = DistributedStrategy()
+    dist_strategy.execution_strategy = exec_strategy
+    dist_strategy.build_strategy = build_strategy
+
+    dist_strategy.fuse_all_reduce_ops = True
+    all_reduce_size = 16
+    dist_strategy.fuse_grad_size_in_MB = all_reduce_size
+    dist_strategy.nccl_comm_num = 1
+    dist_strategy.sync_nccl_allreduce = True
+
+    if args.amp:
+        dist_strategy.cudnn_batchnorm_spatial_persistent = True
+        dist_strategy.amp = True
+        dist_strategy.amp_configs = {
+            "init_loss_scaling": args.scale_loss,
+            "use_dynamic_loss_scaling": args.use_dynamic_loss_scaling,
+            "use_pure_fp16": args.use_pure_fp16
+        }
+
+    dist_strategy.asp = args.asp
+
+    optimizer = fleet.distributed_optimizer(optimizer, strategy=dist_strategy)
+
+    return optimizer
+
+
+def build(args, main_prog, startup_prog, step_each_epoch, is_train=True):
+    """
+    Build a executable paddle.static.Program via following four steps:
+        1. Create feeds.
+        2. Create a model.
+        3. Create fetchs.
+        4. Create an optimizer if is_train==True.
+
+    Args:
+        args(Namespace): Arguments obtained from ArgumentParser.
+        main_prog(paddle.static.Program):The main program.
+        startup_prog(paddle.static.Program):The startup program.
+        step_each_epoch(int): The number of steps in each epoch.
+        is_train(bool, optional): Whether the main programe created is for training. Default: True.
+    Returns:
+        fetchs(dict): A dict of outputs from Program execution (included loss and measures).
+        lr_scheduler(paddle.optimizer.lr.LRScheduler): A learning rate scheduler.
+        feeds(dict): A dict to map variables'name to their values.
+        optimizer(Optimizer): An optimizer with distributed/AMP/ASP strategy.
+    """
+    with paddle.static.program_guard(main_prog, startup_prog):
+        with paddle.utils.unique_name.guard():
+            mode = Mode.TRAIN if is_train else Mode.EVAL
+            feeds = create_feeds(args.image_shape)
+
+            model_name = args.model_arch_name
+            class_num = args.num_of_class
+            input_image_channel = args.image_channel
+            data_format = args.data_layout
+            use_pure_fp16 = args.use_pure_fp16
+            bn_weight_decay = args.bn_weight_decay
+            model = models.__dict__[model_name](
+                class_num=class_num,
+                input_image_channel=input_image_channel,
+                data_format=data_format,
+                use_pure_fp16=use_pure_fp16,
+                bn_weight_decay=bn_weight_decay)
+            out = model(feeds["data"])
+
+            fetchs = create_fetchs(
+                out, feeds, class_num, args.label_smoothing, mode=mode)
+
+            if args.asp:
+                sparsity.set_excluded_layers(main_prog, [model.fc.weight.name])
+
+            lr_scheduler = None
+            optimizer = None
+            if is_train:
+                lr_scheduler = build_lr_scheduler(args, step_each_epoch)
+                optimizer = build_optimizer(args, lr_scheduler)
+
+                optimizer = dist_optimizer(args, optimizer)
+                optimizer.minimize(fetchs['loss'][0], startup_prog)
+
+    # This is a workaround to "Communicator of ring id 0 has not been initialized.".
+    # Since Paddle's design, the initialization would be done inside train program,
+    # eval_only need to manually call initialization.
+    if args.run_scope == RunScope.EVAL_ONLY and \
+       paddle.distributed.get_world_size() > 1:
+        collective_helper = CollectiveHelper(
+            role_maker=fleet.PaddleCloudRoleMaker(is_collective=True))
+        collective_helper.update_startup_program(startup_prog)
+
+    return fetchs, lr_scheduler, feeds, optimizer
+
+
+def compile_prog(args, program, loss_name=None, is_train=True):
+    """
+    Compile the given program, which would fuse computing ops or optimize memory footprint
+    based building strategy in config.
+
+    Args:
+        args(Namespace): Arguments obtained from ArgumentParser.
+        program(paddle.static.Program): The main program to be compiled.
+        loss_name(str, optional): The name of loss variable. Default: None.
+        is_train(bool, optional): Indicate the prupose of strategy is for
+                                  training of not. Default is True.
+    Returns:
+        compiled_program(paddle.static.CompiledProgram): A compiled program.
+    """
+    build_strategy, exec_strategy = create_strategy(args, is_train)
+
+    compiled_program = paddle.static.CompiledProgram(
+        program).with_data_parallel(
+            loss_name=loss_name,
+            build_strategy=build_strategy,
+            exec_strategy=exec_strategy)
+
+    return compiled_program
+
+
+def run(args,
+        dataloader,
+        exe,
+        program,
+        fetchs,
+        epoch,
+        mode=Mode.TRAIN,
+        lr_scheduler=None):
+    """
+    Execute program.
+
+    Args:
+        args(Namespace): Arguments obtained from ArgumentParser.
+        dataloader(nvidia.dali.plugin.paddle.DALIGenericIterator):
+                Iteratable output of NVIDIA DALI pipeline,
+                please refer to dali_dataloader in dali.py for details.
+        exe(paddle.static.Executor): A executor to run program.
+        program(paddle.static.Program): The program to be executed.
+        fetchs(dict): A dict of outputs from Program execution (included loss and measures).
+        epoch(int): Current epoch id to run.
+        mode(utils.Mode, optional): Train or eval mode. Default: Mode.TRAIN.
+        lr_scheduler(paddle.optimizer.lr.LRScheduler, optional): A learning rate scheduler.
+                                                                 Default: None.
+    Returns:
+        metrics (dict): A dictionary to collect values of metrics.
+    """
+    num_trainers = get_num_trainers()
+    fetch_list = [f[0] for f in fetchs.values()]
+    metric_dict = {"lr": AverageMeter('lr', 'f', postfix=",", need_avg=False)}
+
+    for k in fetchs:
+        if fetchs[k][1] is not None:
+            metric_dict[k] = fetchs[k][1]
+
+    metric_dict["batch_time"] = AverageMeter(
+        'batch_time', '.5f', postfix=" s,")
+    metric_dict["data_time"] = AverageMeter('data_time', '.5f', postfix=" s,")
+    metric_dict["compute_time"] = AverageMeter(
+        'compute_time', '.5f', postfix=" s,")
+
+    for m in metric_dict.values():
+        m.reset()
+
+    profiler = Profiler()
+    tic = time.perf_counter()
+
+    idx = 0
+    batch_size = None
+    latency = []
+
+    total_benchmark_steps = \
+        args.benchmark_steps + args.benchmark_warmup_steps
+
+    dataloader.reset()
+    while True:
+        # profiler.profile_setup return True only when
+        # profile is enable and idx == stop steps
+        if profiler.profile_setup(idx):
+            break
+
+        idx += 1
+        try:
+            batch = next(dataloader)
+        except StopIteration:
+            # Reset dataloader when run benchmark to fill required steps.
+            if args.benchmark and (idx < total_benchmark_steps):
+                dataloader.reset()
+                # Reset tic timestamp to ignore exception handling time.
+                tic = time.perf_counter()
+                continue
+            break
+        except RuntimeError:
+            logging.warning(
+                "Except RuntimeError when reading data from dataloader, try to read once again..."
+            )
+            continue
+
+        reader_toc = time.perf_counter()
+        metric_dict['data_time'].update(reader_toc - tic)
+
+        batch_size = batch[0]["data"].shape()[0]
+        feed_dict = batch[0]
+
+        with profiler.profile_tag(idx, "Training"
+                                  if mode == Mode.TRAIN else "Evaluation"):
+            results = exe.run(program=program,
+                              feed=feed_dict,
+                              fetch_list=fetch_list)
+
+        for name, m in zip(fetchs.keys(), results):
+            if name in metric_dict:
+                metric_dict[name].update(np.mean(m), batch_size)
+        metric_dict["compute_time"].update(time.perf_counter() - reader_toc)
+        metric_dict["batch_time"].update(time.perf_counter() - tic)
+        if mode == Mode.TRAIN:
+            metric_dict['lr'].update(lr_scheduler.get_lr())
+
+        if lr_scheduler is not None:
+            with profiler.profile_tag(idx, "LR Step"):
+                lr_scheduler.step()
+
+        tic = time.perf_counter()
+
+        if idx % args.print_interval == 0:
+            log_msg = dict()
+            log_msg['loss'] = metric_dict['loss'].val.item()
+            log_msg['top1'] = metric_dict['top1'].val.item()
+            log_msg['top5'] = metric_dict['top5'].val.item()
+            log_msg['data_time'] = metric_dict['data_time'].val
+            log_msg['compute_time'] = metric_dict['compute_time'].val
+            log_msg['batch_time'] = metric_dict['batch_time'].val
+            log_msg['ips'] = \
+                batch_size * num_trainers / metric_dict['batch_time'].val
+            if mode == Mode.TRAIN:
+                log_msg['lr'] = metric_dict['lr'].val
+            log_info((epoch, idx), log_msg, mode)
+
+        if args.benchmark:
+            latency.append(metric_dict['batch_time'].val)
+            # Ignore the warmup iters
+            if idx == args.benchmark_warmup_steps:
+                metric_dict["compute_time"].reset()
+                metric_dict["data_time"].reset()
+                metric_dict["batch_time"].reset()
+                latency.clear()
+                logging.info("Begin benchmark at step %d", idx + 1)
+
+            if idx == total_benchmark_steps:
+                benchmark_data = dict()
+                benchmark_data[
+                    'ips'] = batch_size * num_trainers / metric_dict[
+                        'batch_time'].avg
+                if mode == mode.EVAL:
+                    latency = np.array(latency) * 1000
+                    quantile = np.quantile(latency, [0.9, 0.95, 0.99])
+
+                    benchmark_data['latency_avg'] = np.mean(latency)
+                    benchmark_data['latency_p90'] = quantile[0]
+                    benchmark_data['latency_p95'] = quantile[1]
+                    benchmark_data['latency_p99'] = quantile[2]
+
+                logging.info("End benchmark at epoch step %d", idx)
+                return benchmark_data
+
+    epoch_data = dict()
+    epoch_data['loss'] = metric_dict['loss'].avg.item()
+    epoch_data['epoch_time'] = metric_dict['batch_time'].total
+    epoch_data['ips'] = batch_size * num_trainers * \
+            metric_dict["batch_time"].count / metric_dict["batch_time"].sum
+    if mode == Mode.EVAL:
+        epoch_data['top1'] = metric_dict['top1'].avg.item()
+        epoch_data['top5'] = metric_dict['top5'].avg.item()
+    log_info((epoch, ), epoch_data, mode)
+
+    return epoch_data
+
+
+def log_info(step, metrics, mode):
+    """
+    Log metrics with step and mode information.
+
+    Args:
+        step(tuple): Step, coulbe (epoch-id, iter-id). Use tuple() for summary.
+        metrics(dict): A dictionary collected values of metrics.
+        mode(utils.Mode): Train or eval mode.
+    """
+    prefix = 'train' if mode == Mode.TRAIN else 'val'
+    dllogger_iter_data = dict()
+    for key in metrics:
+        dllogger_iter_data[f"{prefix}.{key}"] = metrics[key]
+    dllogger.log(step=step, data=dllogger_iter_data)

+ 1 - 0
PaddlePaddle/Classification/RN50v1.5/requirements.txt

@@ -0,0 +1 @@
+git+https://github.com/NVIDIA/[email protected]#egg=dllogger

+ 21 - 0
PaddlePaddle/Classification/RN50v1.5/scripts/inference/export_resnet50_AMP.sh

@@ -0,0 +1,21 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+CKPT=${1:-"./output/ResNet50/89/paddle_example"}
+
+python -m paddle.distributed.launch --gpus=0 export_model.py \
+    --amp \
+    --data-layout NHWC \
+    --trt-inference-dir ./inference_amp \
+    --from-checkpoint $CKPT

+ 19 - 0
PaddlePaddle/Classification/RN50v1.5/scripts/inference/export_resnet50_TF32.sh

@@ -0,0 +1,19 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+CKPT=${1:-"./output/ResNet50/89/paddle_example"}
+
+python -m paddle.distributed.launch --gpus=0 export_model.py \
+    --trt-inference-dir ./inference_tf32 \
+    --from-checkpoint $CKPT

+ 21 - 0
PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_AMP.sh

@@ -0,0 +1,21 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python inference.py \
+    --data-layout NHWC \
+    --trt-inference-dir ./inference_amp \
+    --trt-precision FP16 \
+    --batch-size 256 \
+    --benchmark-steps 1024 \
+    --benchmark-warmup-steps 16

+ 21 - 0
PaddlePaddle/Classification/RN50v1.5/scripts/inference/infer_resnet50_TF32.sh

@@ -0,0 +1,21 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python inference.py \
+    --trt-inference-dir ./inference_tf32 \
+    --trt-precision FP32 \
+    --dali-num-threads 8 \
+    --batch-size 256 \
+    --benchmark-steps 1024 \
+    --benchmark-warmup-steps 16

+ 47 - 0
PaddlePaddle/Classification/RN50v1.5/scripts/nsys_profiling.sh

@@ -0,0 +1,47 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Nsys Profile Flags
+export ENABLE_PROFILE=1
+export PROFILE_START_STEP=100
+export PROFILE_STOP_STEP=105
+
+# Affinity Flags
+export GPUS_PER_NODE=8
+
+NSYS_CMD=" \
+        nsys profile --stats=true \
+        --output ./log/%p.qdrep \
+        --force-overwrite true \
+        -t cuda,nvtx,osrt,cudnn,cublas \
+        --capture-range=cudaProfilerApi \
+        --capture-range-end=stop \
+        --gpu-metrics-device=0 \
+        --sample=cpu \
+        -d 60 \
+        --kill=sigkill \
+        -x true"
+
+PADDLE_CMD=" \
+        python -m paddle.distributed.launch \
+        --gpus=0,1,2,3,4,5,6,7 \
+        train.py \
+        --epochs 1"
+
+if [[ ${ENABLE_PROFILE} -ge 1 ]]; then
+        ${NSYS_CMD} ${PADDLE_CMD}
+else
+        ${PADDLE_CMD}
+fi
+export ENABLE_PROFILE=0

+ 20 - 0
PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_AMP_90E_DGXA100.sh

@@ -0,0 +1,20 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
+    --epochs 90 \
+    --amp \
+    --scale-loss 128.0 \
+    --use-dynamic-loss-scaling \
+    --data-layout NHWC

+ 26 - 0
PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_AMP_ASP_90E_DGXA100.sh

@@ -0,0 +1,26 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+CKPT=${1:-"./output/ResNet50/89/paddle_example"}
+
+python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py \
+  --from-pretrained-params ./output/ResNet50/89/paddle_example \
+  --epochs 90 \
+  --amp \
+  --scale-loss 128.0 \
+  --use-dynamic-loss-scaling \
+  --data-layout NHWC \
+  --asp \
+  --prune-model \
+  --mask-algo mask_1d

+ 15 - 0
PaddlePaddle/Classification/RN50v1.5/scripts/training/train_resnet50_TF32_90E_DGXA100.sh

@@ -0,0 +1,15 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+# 
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+# 
+#     http://www.apache.org/licenses/LICENSE-2.0
+# 
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+python -m paddle.distributed.launch --gpus=0,1,2,3,4,5,6,7 train.py --epochs 90

+ 167 - 0
PaddlePaddle/Classification/RN50v1.5/train.py

@@ -0,0 +1,167 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import logging
+import paddle
+from paddle.distributed import fleet
+from paddle.static import sparsity
+from paddle.fluid.contrib.mixed_precision.fp16_utils import rewrite_program
+from paddle.fluid.contrib.mixed_precision.fp16_lists import AutoMixedPrecisionLists
+from dali import build_dataloader
+from utils.config import parse_args, print_args
+from utils.logger import setup_dllogger
+from utils.save_load import init_program, save_model
+from utils.affinity import set_cpu_affinity
+from utils.mode import Mode, RunScope
+import program
+
+
+class MetricSummary:
+    def __init__(self):
+        super().__init__()
+        self.metric_dict = None
+
+    def update(self, new_metrics):
+        if not self.is_updated:
+            self.metric_dict = dict()
+
+        for key in new_metrics:
+            if key in self.metric_dict:
+                # top1, top5 and ips are "larger is better"
+                if key in ['top1', 'top5', 'ips']:
+                    self.metric_dict[key] = new_metrics[key] if new_metrics[
+                        key] > self.metric_dict[key] else self.metric_dict[key]
+                # Others are "Smaller is better"
+                else:
+                    self.metric_dict[key] = new_metrics[key] if new_metrics[
+                        key] < self.metric_dict[key] else self.metric_dict[key]
+            else:
+                self.metric_dict[key] = new_metrics[key]
+
+    @property
+    def is_updated(self):
+        return self.metric_dict is not None
+
+
+def main(args):
+    """
+    A enterpoint to train and evaluate a ResNet50 model, which contains six steps.
+        1. Parse arguments from command line.
+        2. Initialize distributed training related setting, including CPU affinity.
+        3. Build dataloader via DALI.
+        4. Create training and evaluating Paddle.static.Program.
+        5. Load checkpoint or pretrained model if given.
+        6. Run program (train and evaluate with datasets, then save model if necessary).
+    """
+    setup_dllogger(args.report_file)
+    if args.show_config:
+        print_args(args)
+
+    fleet.init(is_collective=True)
+    if args.enable_cpu_affinity:
+        set_cpu_affinity()
+
+    device = paddle.set_device('gpu')
+    startup_prog = paddle.static.Program()
+
+    train_dataloader = None
+    train_prog = None
+    optimizer = None
+    if args.run_scope in [RunScope.TRAIN_EVAL, RunScope.TRAIN_ONLY]:
+        train_dataloader = build_dataloader(args, Mode.TRAIN)
+        train_step_each_epoch = len(train_dataloader)
+        train_prog = paddle.static.Program()
+
+        train_fetchs, lr_scheduler, _, optimizer = program.build(
+            args,
+            train_prog,
+            startup_prog,
+            step_each_epoch=train_step_each_epoch,
+            is_train=True)
+
+    eval_dataloader = None
+    eval_prog = None
+    if args.run_scope in [RunScope.TRAIN_EVAL, RunScope.EVAL_ONLY]:
+        eval_dataloader = build_dataloader(args, Mode.EVAL)
+        eval_step_each_epoch = len(eval_dataloader)
+        eval_prog = paddle.static.Program()
+
+        eval_fetchs, _, _, _ = program.build(
+            args,
+            eval_prog,
+            startup_prog,
+            step_each_epoch=eval_step_each_epoch,
+            is_train=False)
+        # clone to prune some content which is irrelevant in eval_prog
+        eval_prog = eval_prog.clone(for_test=True)
+
+    exe = paddle.static.Executor(device)
+    exe.run(startup_prog)
+
+    init_program(
+        args,
+        exe=exe,
+        program=train_prog if train_prog is not None else eval_prog)
+
+    if args.amp:
+        if args.run_scope == RunScope.EVAL_ONLY:
+            rewrite_program(eval_prog, amp_lists=AutoMixedPrecisionLists())
+        else:
+            optimizer.amp_init(
+                device,
+                scope=paddle.static.global_scope(),
+                test_program=eval_prog,
+                use_fp16_test=True)
+
+    if args.asp and args.prune_model:
+        logging.info("Pruning model to 2:4 sparse pattern...")
+        sparsity.prune_model(train_prog, mask_algo=args.mask_algo)
+        logging.info("Pruning model done.")
+
+    if eval_prog is not None:
+        eval_prog = program.compile_prog(args, eval_prog, is_train=False)
+
+    train_summary = MetricSummary()
+    eval_summary = MetricSummary()
+    for epoch_id in range(args.start_epoch, args.epochs):
+        # Training
+        if train_prog is not None:
+            metric_summary = program.run(args, train_dataloader, exe,
+                                         train_prog, train_fetchs, epoch_id,
+                                         Mode.TRAIN, lr_scheduler)
+            train_summary.update(metric_summary)
+
+            # Save a checkpoint
+            if epoch_id % args.save_interval == 0:
+                model_path = os.path.join(args.output_dir,
+                                          args.model_arch_name)
+                save_model(train_prog, model_path, epoch_id)
+
+        # Evaluation
+        if (eval_prog is not None) and \
+            (epoch_id % args.eval_interval == 0):
+            metric_summary = program.run(args, eval_dataloader, exe, eval_prog,
+                                         eval_fetchs, epoch_id, Mode.EVAL)
+            eval_summary.update(metric_summary)
+
+    if train_summary.is_updated:
+        program.log_info(tuple(), train_summary.metric_dict, Mode.TRAIN)
+    if eval_summary.is_updated:
+        program.log_info(tuple(), eval_summary.metric_dict, Mode.EVAL)
+
+
+if __name__ == '__main__':
+    paddle.enable_static()
+    main(parse_args())

+ 13 - 0
PaddlePaddle/Classification/RN50v1.5/utils/__init__.py

@@ -0,0 +1,13 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

+ 214 - 0
PaddlePaddle/Classification/RN50v1.5/utils/affinity.py

@@ -0,0 +1,214 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import logging
+import paddle
+
+
+def _get_gpu_affinity_table():
+    """
+    Generate three dict objects, gpu_cpu_affinity_map, cpu_socket_gpus_list, cpu_core_groups.
+    gpu_cpu_affinity_map (dict): Key is GPU ID and value is cpu_affinity string.
+    cpu_socket_gpus_list (dict): Key is cpu_affinity string and value is  a list
+                                 collected all GPU IDs that affinity to this cpu socket.
+    cpu_core_groups (dict):      Key is cpu_affinity string and value is cpu core groups.
+                                 cpu core groups contains #GPUs groups, each group have,
+                                 nearly eaual amount of cpu cores.
+
+    Example:
+        $nvidis-smi topo -m
+            GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity
+        GPU0     X     SYS     SYS     SYS      0-9,20-29           0
+        GPU1   SYS       X     SYS     SYS      0-9,20-29           0
+        GPU2   SYS      SYS      X     SYS      10-19,30-39         1
+        GPU3   SYS      SYS    SYS       X      10-19,30-39         1
+
+        gpu_cpu_affinity_map =
+            { 0: '0-9,20-29', # GPU0's cpu affninity is '0-9,20-29'
+              1: '0-9,20-29', # GPU1's cpu affninity is '0-9,20-29'
+              2: '10-19,30-39', # GPU2's cpu affninity is '10-19,30-39'
+              3: '10-19,30-39' } # GPU3's cpu affninity is '10-19,30-39'
+        cpu_socket_gpus_list =
+            { '0-9,20-29': [0, 1], # There are 2 GPUs, 0 and 1, belong to cpu affinity '0-9,20-29'.
+              '10-19,30-39': [2, 3] # There are 2 GPUs, 2 and 3, belong to cpu affinity '10-19,30-39'.
+            }
+        cpu_core_groups =
+            # There are 2 GPUs belong to cpu affinity '0-9,20-29', then
+            # cores [0, 1, ..., 8, 9] would be split to two groups every
+            # 2-th elements
+            # [0, 2, 4, 6, 8] and [1, 3, 5, 7, 9]
+            # The same for cores [20, 21, ..., 28, 29].
+            {'0-9,20-29': [
+                               [[0, 2, 4, 6, 8], [1, 3, 5, 7, 9]],
+                               [[20, 22, 24, 26, 28], [21, 23, 25, 27, 29]]
+                              ],
+            # The same as '0-9,20-29'
+            '10-19,30-39': [
+                            [[10, 12, 14, 16, 18], [11, 13, 15, 17, 19]],
+                            [[30, 32, 34, 36, 38], [31, 33, 35, 37, 39]]
+                           ]}
+
+    """
+    lines = os.popen('nvidia-smi topo -m').readlines()
+
+    cpu_affinity_idx = -1
+    titles = lines[0].split('\t')
+    for idx in range(len(titles)):
+        if 'CPU Affinity' in titles[idx]:
+            cpu_affinity_idx = idx
+    assert cpu_affinity_idx > 0, \
+        "Can not obtain correct CPU affinity column index via nvidia-smi!"
+
+    gpu_cpu_affinity_map = dict()
+    cpu_socket_gpus_list = dict()
+    # Skip title
+    for idx in range(1, len(lines)):
+        line = lines[idx]
+        items = line.split('\t')
+
+        if 'GPU' in items[0]:
+            gpu_id = int(items[0][3:])
+            affinity = items[cpu_affinity_idx]
+            gpu_cpu_affinity_map[gpu_id] = affinity
+            if affinity in cpu_socket_gpus_list:
+                cpu_socket_gpus_list[affinity].append(gpu_id)
+            else:
+                cpu_socket_gpus_list[affinity] = [gpu_id]
+
+    cpu_core_groups = _group_cpu_cores(cpu_socket_gpus_list)
+    return gpu_cpu_affinity_map, cpu_socket_gpus_list, cpu_core_groups
+
+
+def _group_cpu_cores(cpu_socket_gpus_list):
+    """
+    Generate a dictionary that key is cpu_affinity string and value is cpu core groups.
+    cpu core groups contains #GPUs groups, each group have, nearly eaual amount of cpu cores.
+    The grouping way is collect cpu cores every #GPUs-th elements, due to index of hyperthreading.
+    For examle, 4 physical cores, 8 cores with hyperthreading. The CPU indices [0, 1, 2, 3] is
+    physical cores, and [4, 5, 6, 7] is hyperthreading. In this case, distributing physical cores
+    first, then hyperthreading would reach better performance.
+    Args:
+        cpu_socket_gpus_list (dict): a dict that map cpu_affinity_str to all GPUs that belong to it.
+    Return:
+        cpu_core_groups (dict): a dict that map cpu_affinity_str to cpu core groups.
+    Example:
+        cpu_socket_gpus_list = { '0-9,20-29': [0, 1], '10-19,30-39': [2, 3] },
+        which means there are 2 GPUs, 0 and 1, belong to '0-9,20-29' and
+        2 GPUs, 2 and 3, belong to '10-19,30-39'
+        therefore, cpu_core_groups =
+                {'0-9,20-29': [
+                               [[0, 2, 4, 6, 8], [1, 3, 5, 7, 9]],
+                               [[20, 22, 24, 26, 28], [21, 23, 25, 27, 29]]
+                              ],
+                 '10-19,30-39': [
+                                 [[10, 12, 14, 16, 18], [11, 13, 15, 17, 19]],
+                                 [[30, 32, 34, 36, 38], [31, 33, 35, 37, 39]]
+                                ]}
+
+    """
+    cpu_core_groups = dict()
+    for cpu_socket in cpu_socket_gpus_list:
+        cpu_core_groups[cpu_socket] = list()
+        gpu_count = len(cpu_socket_gpus_list[cpu_socket])
+        cores = cpu_socket.split(',')
+        for core in cores:
+            core_indices = _get_core_indices(core)
+            core_group = list()
+            for i in range(gpu_count):
+                start = i % len(core_indices)
+                sub_core_set = core_indices[start::gpu_count]
+                core_group.append(sub_core_set)
+            cpu_core_groups[cpu_socket].append(core_group)
+    return cpu_core_groups
+
+
+def _get_core_indices(cores_str):
+    """
+    Generate a dictionary of cpu core indices.
+    Args:
+        cores_str (str): a string with format "start_idx-end_idx".
+    Return:
+        cpu_core_indices (list): a list collected all indices in [start_idx, end_idx].
+    Example:
+        cores_str = '0-20'
+        cpu_core_indices = [0, 1, 2, ..., 18, 19, 20]
+    """
+    start, end = cores_str.split('-')
+    return [*range(int(start), int(end) + 1)]
+
+
+def set_cpu_affinity():
+    """
+    Setup CPU affinity.
+    Each GPU would be bound to a specific set of CPU cores for optimal and stable performance.
+    This function would obtain GPU-CPU affinity via "nvidia-smi topo -m", then equally distribute
+    CPU cores to each GPU.
+    """
+
+    gpu_cpu_affinity_map, cpu_socket_gpus_list, cpu_core_groups = \
+        _get_gpu_affinity_table()
+
+    node_num = paddle.distributed.fleet.node_num()
+    gpu_per_node = paddle.distributed.get_world_size() // node_num
+    local_rank = paddle.distributed.get_rank() % gpu_per_node
+
+    # gpu_cpu_affinity_map (dict): Key is GPU ID and value is cpu_affinity string.
+    # cpu_socket_gpus_list (dict): Key is cpu_affinity string and value is  a list
+    #                              collected all GPU IDs that affinity to this cpu socket.
+    # cpu_core_groups (dict):      Key is cpu_affinity string and value is cpu core groups.
+    #                              cpu core groups contains #GPUs groups, each group have,
+    #                              nearly eaual amount of cpu cores.
+    # Example:
+    # $nvidis-smi topo -m
+    #        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity
+    # GPU0     X     SYS     SYS     SYS      0-9,20-29           0
+    # GPU1   SYS       X     SYS     SYS      0-9,20-29           0
+    # GPU2   SYS      SYS      X     SYS      10-19,30-39         1
+    # GPU3   SYS      SYS    SYS       X      10-19,30-39         1
+    #
+    # gpu_cpu_affinity_map =
+    #     { 0: '0-9,20-29',
+    #       1: '0-9,20-29',
+    #       2: '10-19,30-39',
+    #       3: '10-19,30-39' }
+    # cpu_socket_gpus_list =
+    #     { '0-9,20-29': [0, 1],
+    #       '10-19,30-39': [2, 3] }
+    # cpu_core_groups =
+    #     {'0-9,20-29': [
+    #                     [[0, 2, 4, 6, 8], [1, 3, 5, 7, 9]],
+    #                     [[20, 22, 24, 26, 28], [21, 23, 25, 27, 29]]
+    #                    ],
+    #       '10-19,30-39': [
+    #                        [[10, 12, 14, 16, 18], [11, 13, 15, 17, 19]],
+    #                        [[30, 32, 34, 36, 38], [31, 33, 35, 37, 39]]
+    #                       ]}
+    #
+    # for rank-0, it belong to '0-9,20-29' cpu_affinity_key,
+    # and it locate in index-0 of cpu_socket_gpus_list['0-9,20-29'],
+    # therefore, affinity_mask would be a collection of all cpu cores
+    # in index-0 of cpu_core_groups['0-9,20-29'], that is [0, 2, 4, 6, 8]
+    # and [20, 22, 24, 26, 28].
+    # affinity_mask = [0, 2, 4, 6, 8, 20, 22, 24, 26, 28]
+    affinity_mask = list()
+    cpu_affinity_key = gpu_cpu_affinity_map[local_rank]
+    cpu_core_idx = cpu_socket_gpus_list[cpu_affinity_key].index(local_rank)
+    for cpu_core_group in cpu_core_groups[cpu_affinity_key]:
+        affinity_mask.extend(cpu_core_group[cpu_core_idx])
+
+    pid = os.getpid()
+    os.sched_setaffinity(pid, affinity_mask)
+    logging.info("Set CPU affinity of rank-%d (Process %d) "
+                 "to %s.", local_rank, pid, str(os.sched_getaffinity(pid)))

+ 424 - 0
PaddlePaddle/Classification/RN50v1.5/utils/config.py

@@ -0,0 +1,424 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import copy
+import argparse
+import distutils.util
+import dllogger
+from utils.mode import RunScope
+from utils.utility import get_num_trainers
+
+
+def print_args(args):
+    args_for_log = copy.deepcopy(args)
+
+    # Due to dllogger cannot serializable Enum into JSON.
+    args_for_log.run_scope = args_for_log.run_scope.value
+
+    dllogger.log(step='PARAMETER', data=vars(args_for_log))
+
+
+def check_and_process_args(args):
+    run_scope = None
+    for scope in RunScope:
+        if args.run_scope == scope.value:
+            run_scope = scope
+            break
+    assert run_scope is not None, \
+           f"only support {[scope.value for scope in RunScope]} as run_scope"
+    args.run_scope = run_scope
+
+    args.image_channel = args.image_shape[0]
+    if args.data_layout == "NHWC":
+        args.image_shape = [
+            args.image_shape[1], args.image_shape[2], args.image_shape[0]
+        ]
+
+    args.lr = get_num_trainers() * args.lr
+
+    assert not (args.from_checkpoint is not None and \
+                args.from_pretrained_params is not None), \
+           "--from-pretrained-params and --from-checkpoint should " \
+           "not be set simultaneously."
+    args.last_epoch_of_checkpoint = -1 if args.from_checkpoint is None \
+                                     else args.last_epoch_of_checkpoint
+    args.start_epoch = 1 + args.last_epoch_of_checkpoint
+
+    if args.benchmark:
+        assert args.run_scope in [
+            RunScope.TRAIN_ONLY, RunScope.EVAL_ONLY
+        ], "If benchmark enabled, run_scope must be `train_only` or `eval_only`"
+
+    # Only run one epoch when benchmark on eval_only.
+    if args.benchmark or \
+      (args.run_scope == RunScope.EVAL_ONLY):
+        args.epochs = args.start_epoch + 1
+
+    if args.run_scope == RunScope.EVAL_ONLY:
+        args.eval_interval = 1
+
+
+def add_global_args(parser):
+    group = parser.add_argument_group('Global')
+    group.add_argument(
+        '--output-dir',
+        type=str,
+        default='./output/',
+        help='A path to store trained models.')
+    group.add_argument(
+        '--run-scope',
+        default='train_eval',
+        choices=('train_eval', 'train_only', 'eval_only'),
+        help='Running scope. It should be one of {train_eval, train_only, eval_only}.'
+    )
+    group.add_argument(
+        '--epochs',
+        type=int,
+        default=90,
+        help='The number of epochs for training.')
+    group.add_argument(
+        '--save-interval',
+        type=int,
+        default=1,
+        help='The iteration interval to save checkpoints.')
+    group.add_argument(
+        '--eval-interval',
+        type=int,
+        default=1,
+        help='The iteration interval to test trained models on a given validation dataset. ' \
+             'Ignored when --run-scope is train_only.'
+    )
+    group.add_argument(
+        '--print-interval',
+        type=int,
+        default=10,
+        help='The iteration interval to show training/evaluation message.')
+    group.add_argument(
+        '--report-file',
+        type=str,
+        default='./report.json',
+        help='A file in which to store JSON experiment report.')
+    group.add_argument(
+        '--data-layout',
+        default='NCHW',
+        choices=('NCHW', 'NHWC'),
+        help='Data format. It should be one of {NCHW, NHWC}.')
+    group.add_argument(
+        '--benchmark', action='store_true', help='To enable benchmark mode.')
+    group.add_argument(
+        '--benchmark-steps',
+        type=int,
+        default=100,
+        help='Steps for benchmark run, only be applied when --benchmark is set.'
+    )
+    group.add_argument(
+        '--benchmark-warmup-steps',
+        type=int,
+        default=100,
+        help='Warmup steps for benchmark run, only be applied when --benchmark is set.'
+    )
+    group.add_argument(
+        '--from-pretrained-params',
+        type=str,
+        default=None,
+        help='A pretrained parameters. It should be a file name without suffix .pdparams, ' \
+             'and not be set with --from-checkpoint at the same time.'
+    )
+    group.add_argument(
+        '--from-checkpoint',
+        type=str,
+        default=None,
+        help='A checkpoint path to resume training. It should not be set ' \
+             'with --from-pretrained-params at the same time.'
+    )
+    group.add_argument(
+        '--last-epoch-of-checkpoint',
+        type=int,
+        default=-1,
+        help='The epoch id of the checkpoint given by --from-checkpoint. ' \
+             'Default is -1 means training starts from 0-th epoth.'
+    )
+    group.add_argument(
+        '--show-config',
+        type=distutils.util.strtobool,
+        default=True,
+        help='To show arguments.')
+    group.add_argument(
+        '--enable-cpu-affinity',
+        type=distutils.util.strtobool,
+        default=True,
+        help='To enable in-built GPU-CPU affinity.')
+    return parser
+
+
+def add_advance_args(parser):
+    group = parser.add_argument_group('Advanced Training')
+    # AMP
+    group.add_argument(
+        '--amp',
+        action='store_true',
+        help='Enable automatic mixed precision training (AMP).')
+    group.add_argument(
+        '--scale-loss',
+        type=float,
+        default=1.0,
+        help='The loss scalar for AMP training, only be applied when --amp is set.'
+    )
+    group.add_argument(
+        '--use-dynamic-loss-scaling',
+        action='store_true',
+        help='Enable dynamic loss scaling in AMP training, only be applied when --amp is set.'
+    )
+    group.add_argument(
+        '--use-pure-fp16',
+        action='store_true',
+        help='Enable pure FP16 training, only be applied when --amp is set.')
+
+    # ASP
+    group.add_argument(
+        '--asp',
+        action='store_true',
+        help='Enable automatic sparse training (ASP).')
+    group.add_argument(
+        '--prune-model',
+        action='store_true',
+        help='Prune model to 2:4 sparse pattern, only be applied when --asp is set.'
+    )
+    group.add_argument(
+        '--mask-algo',
+        default='mask_1d',
+        choices=('mask_1d', 'mask_2d_greedy', 'mask_2d_best'),
+        help='The algorithm to generate sparse masks. It should be one of ' \
+             '{mask_1d, mask_2d_greedy, mask_2d_best}. This only be applied ' \
+             'when --asp and --prune-model is set.'
+    )
+    return parser
+
+
+def add_dataset_args(parser):
+    def float_list(x):
+        return list(map(float, x.split(',')))
+
+    def int_list(x):
+        return list(map(int, x.split(',')))
+
+    dataset_group = parser.add_argument_group('Dataset')
+    dataset_group.add_argument(
+        '--image-root',
+        type=str,
+        default='/imagenet',
+        help='A root folder of train/val images. It should contain train and val folders, ' \
+             'which store corresponding images.'
+    )
+    dataset_group.add_argument(
+        '--image-shape',
+        type=int_list,
+        default=[4, 224, 224],
+        help='The image shape. Its shape should be [channel, height, width].')
+
+    # Data Loader
+    dataset_group.add_argument(
+        '--batch-size',
+        type=int,
+        default=256,
+        help='The batch size for both training and evaluation.')
+    dataset_group.add_argument(
+        '--dali-random-seed',
+        type=int,
+        default=42,
+        help='The random seed for DALI data loader.')
+    dataset_group.add_argument(
+        '--dali-num-threads',
+        type=int,
+        default=4,
+        help='The number of threads applied to DALI data loader.')
+    dataset_group.add_argument(
+        '--dali-output-fp16',
+        action='store_true',
+        help='Output FP16 data from DALI data loader.')
+
+    # Augmentation
+    augmentation_group = parser.add_argument_group('Data Augmentation')
+    augmentation_group.add_argument(
+        '--crop-size',
+        type=int,
+        default=224,
+        help='The size to crop input images.')
+    augmentation_group.add_argument(
+        '--rand-crop-scale',
+        type=float_list,
+        default=[0.08, 1.],
+        help='Range from which to choose a random area fraction.')
+    augmentation_group.add_argument(
+        '--rand-crop-ratio',
+        type=float_list,
+        default=[3.0 / 4, 4.0 / 3],
+        help='Range from which to choose a random aspect ratio (width/height).')
+    augmentation_group.add_argument(
+        '--normalize-scale',
+        type=float,
+        default=1.0 / 255.0,
+        help='A scalar to normalize images.')
+    augmentation_group.add_argument(
+        '--normalize-mean',
+        type=float_list,
+        default=[0.485, 0.456, 0.406],
+        help='The mean values to normalize RGB images.')
+    augmentation_group.add_argument(
+        '--normalize-std',
+        type=float_list,
+        default=[0.229, 0.224, 0.225],
+        help='The std values to normalize RGB images.')
+    augmentation_group.add_argument(
+        '--resize-short',
+        type=int,
+        default=256,
+        help='The length of the shorter dimension of the resized image.')
+    return parser
+
+
+def add_model_args(parser):
+    group = parser.add_argument_group('Model')
+    group.add_argument(
+        '--model-arch-name',
+        type=str,
+        default='ResNet50',
+        help='The model architecture name. It should be one of {ResNet50}.')
+    group.add_argument(
+        '--num-of-class',
+        type=int,
+        default=1000,
+        help='The number classes of images.')
+    group.add_argument(
+        '--bn-weight-decay',
+        action='store_true',
+        help='Apply weight decay to BatchNorm shift and scale.')
+    return parser
+
+
+def add_training_args(parser):
+    group = parser.add_argument_group('Training')
+    group.add_argument(
+        '--label-smoothing',
+        type=float,
+        default=0.1,
+        help='The ratio of label smoothing.')
+    group.add_argument(
+        '--optimizer',
+        default='Momentum',
+        metavar="OPTIMIZER",
+        choices=('Momentum'),
+        help='The name of optimizer. It should be one of {Momentum}.')
+    group.add_argument(
+        '--momentum',
+        type=float,
+        default=0.875,
+        help='The momentum value of optimizer.')
+    group.add_argument(
+        '--weight-decay',
+        type=float,
+        default=3.0517578125e-05,
+        help='The coefficient of weight decay.')
+    group.add_argument(
+        '--lr-scheduler',
+        default='Cosine',
+        metavar="LR_SCHEDULER",
+        choices=('Cosine'),
+        help='The name of learning rate scheduler. It should be one of {Cosine}.'
+    )
+    group.add_argument(
+        '--lr', type=float, default=0.256, help='The initial learning rate.')
+    group.add_argument(
+        '--warmup-epochs',
+        type=int,
+        default=5,
+        help='The number of epochs for learning rate warmup.')
+    group.add_argument(
+        '--warmup-start-lr',
+        type=float,
+        default=0.0,
+        help='The initial learning rate for warmup.')
+    return parser
+
+
+def add_trt_args(parser):
+    group = parser.add_argument_group('Paddle-TRT')
+    group.add_argument(
+        '--trt-inference-dir',
+        type=str,
+        default='./inference',
+        help='A path to store/load inference models. ' \
+             'export_model.py would export models to this folder, ' \
+             'then inference.py would load from here.'
+    )
+    group.add_argument(
+        '--trt-precision',
+        default='FP32',
+        choices=('FP32', 'FP16', 'INT8'),
+        help='The precision of TensorRT. It should be one of {FP32, FP16, INT8}.'
+    )
+    group.add_argument(
+        '--trt-workspace-size',
+        type=int,
+        default=(1 << 30),
+        help='The memory workspace of TensorRT in MB.')
+    group.add_argument(
+        '--trt-min-subgraph-size',
+        type=int,
+        default=3,
+        help='The minimal subgraph size to enable PaddleTRT.')
+    group.add_argument(
+        '--trt-use-static',
+        type=distutils.util.strtobool,
+        default=False,
+        help='Fix TensorRT engine at first running.')
+    group.add_argument(
+        '--trt-use-calib-mode',
+        type=distutils.util.strtobool,
+        default=False,
+        help='Use the PTQ calibration of PaddleTRT int8.')
+    group.add_argument(
+        '--trt-export-log-path',
+        type=str,
+        default='./export.json',
+        help='A file in which to store JSON model exporting report.')
+    group.add_argument(
+        '--trt-log-path',
+        type=str,
+        default='./inference.json',
+        help='A file in which to store JSON inference report.')
+    group.add_argument(
+        '--trt-use-synthat',
+        type=distutils.util.strtobool,
+        default=False,
+        help='Apply synthetic data for benchmark.')
+    return parser
+
+
+def parse_args(including_trt=False):
+    parser = argparse.ArgumentParser(
+        description="PaddlePaddle RN50v1.5 training script",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+    parser = add_global_args(parser)
+    parser = add_dataset_args(parser)
+    parser = add_model_args(parser)
+    parser = add_training_args(parser)
+    parser = add_advance_args(parser)
+
+    if including_trt:
+        parser = add_trt_args(parser)
+
+    args = parser.parse_args()
+    check_and_process_args(args)
+    return args

+ 39 - 0
PaddlePaddle/Classification/RN50v1.5/utils/cuda_bind.py

@@ -0,0 +1,39 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import ctypes
+
+_cuda_home = os.environ.get('CUDA_HOME', '/usr/local/cuda')
+
+_cudart = ctypes.CDLL(os.path.join(_cuda_home, 'lib64/libcudart.so'))
+
+
+def cuda_profile_start():
+    _cudart.cudaProfilerStart()
+
+
+def cuda_profile_stop():
+    _cudart.cudaProfilerStop()
+
+
+_nvtx = ctypes.CDLL(os.path.join(_cuda_home, 'lib64/libnvToolsExt.so'))
+
+
+def cuda_nvtx_range_push(name):
+    _nvtx.nvtxRangePushW(ctypes.c_wchar_p(name))
+
+
+def cuda_nvtx_range_pop():
+    _nvtx.nvtxRangePop()

+ 59 - 0
PaddlePaddle/Classification/RN50v1.5/utils/logger.py

@@ -0,0 +1,59 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import paddle.distributed as dist
+import dllogger
+
+
+def format_step(step):
+    """
+    Define prefix for different prefix message for dllogger.
+    Args:
+        step(str|tuple): Dllogger step format.
+    Returns:
+        s(str): String to print in log.
+    """
+    if isinstance(step, str):
+        return step
+    s = ""
+    if len(step) > 0:
+        s += f"Epoch: {step[0]} "
+    if len(step) > 1:
+        s += f"Iteration: {step[1]} "
+    if len(step) > 2:
+        s += f"Validation Iteration: {step[2]} "
+    if len(step) == 0:
+        s = "Summary:"
+    return s
+
+
+def setup_dllogger(log_file):
+    """
+    Setup logging and dllogger.
+    Args:
+        log_file(str): Path to log file.
+    """
+    logging.basicConfig(
+        level=logging.DEBUG,
+        format='{asctime}:{levelname}: {message}',
+        style='{')
+    if dist.get_rank() == 0:
+        dllogger.init(backends=[
+            dllogger.StdOutBackend(
+                dllogger.Verbosity.DEFAULT, step_format=format_step),
+            dllogger.JSONStreamBackend(dllogger.Verbosity.VERBOSE, log_file),
+        ])
+    else:
+        dllogger.init([])

+ 47 - 0
PaddlePaddle/Classification/RN50v1.5/utils/misc.py

@@ -0,0 +1,47 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+__all__ = ['AverageMeter']
+
+
+class AverageMeter:
+    """
+    A container to keep running sum, mean and last value.
+    """
+
+    def __init__(self, name='', fmt='f', postfix="", need_avg=True):
+        self.name = name
+        self.fmt = fmt
+        self.postfix = postfix
+        self.need_avg = need_avg
+        self.val = 0
+        self.avg = 0
+        self.sum = 0
+        self.count = 0
+
+    def reset(self):
+        self.val = 0
+        self.avg = 0
+        self.sum = 0
+        self.count = 0
+
+    def update(self, val, n=1):
+        self.val = val
+        self.sum += val * n
+        self.count += n
+        self.avg = self.sum / self.count
+
+    @property
+    def total(self):
+        return '{self.sum:{self.fmt}}{self.postfix}'.format(self=self)

+ 26 - 0
PaddlePaddle/Classification/RN50v1.5/utils/mode.py

@@ -0,0 +1,26 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from enum import Enum
+
+
+class Mode(Enum):
+    TRAIN = 'Train'
+    EVAL = 'Eval'
+
+
+class RunScope(Enum):
+    TRAIN_ONLY = 'train_only'
+    EVAL_ONLY = 'eval_only'
+    TRAIN_EVAL = 'train_eval'

+ 164 - 0
PaddlePaddle/Classification/RN50v1.5/utils/save_load.py

@@ -0,0 +1,164 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import errno
+import os
+import re
+import shutil
+import tempfile
+import logging
+import paddle
+
+_PDOPT_SUFFIX = '.pdopt'
+_PDPARAMS_SUFFIX = '.pdparams'
+
+
+def _mkdir_if_not_exist(path):
+    """
+    Mkdir if not exists, ignore the exception when multiprocess mkdir together.
+    """
+    if not os.path.exists(path):
+        try:
+            os.makedirs(path)
+        except OSError as e:
+            if e.errno == errno.EEXIST and os.path.isdir(path):
+                logging.warning(
+                    'be happy if some process has already created %s', path)
+            else:
+                raise OSError(f'Failed to mkdir {path}')
+
+
+def _load_state(path):
+    """
+    Load model parameters from .pdparams file.
+    Args:
+        path(str): Path to .pdparams file.
+    Returns:
+        state(dict): Dict of parameters loaded from file.
+    """
+    if os.path.exists(path + _PDOPT_SUFFIX):
+        tmp = tempfile.mkdtemp()
+        dst = os.path.join(tmp, os.path.basename(os.path.normpath(path)))
+        shutil.copy(path + _PDPARAMS_SUFFIX, dst + _PDPARAMS_SUFFIX)
+        state = paddle.static.load_program_state(dst)
+        shutil.rmtree(tmp)
+    else:
+        state = paddle.static.load_program_state(path)
+    return state
+
+
+def load_params(prog, path, ignore_params=None):
+    """
+    Load model from the given path.
+    Args:
+        prog (paddle.static.Program): Load weight to which Program object.
+        path (string): Model path.
+        ignore_params (list): Ignore variable to load when finetuning.
+    """
+    if not (os.path.isdir(path) or os.path.exists(path + _PDPARAMS_SUFFIX)):
+        raise ValueError(f"Model pretrain path {path} does not exists.")
+
+    logging.info("Loading parameters from %s...", path)
+
+    ignore_set = set()
+    state = _load_state(path)
+
+    # ignore the parameter which mismatch the shape
+    # between the model and pretrain weight.
+    all_var_shape = {}
+    for block in prog.blocks:
+        for param in block.all_parameters():
+            all_var_shape[param.name] = param.shape
+    ignore_set.update([
+        name for name, shape in all_var_shape.items()
+        if name in state and shape != state[name].shape
+    ])
+
+    if ignore_params:
+        all_var_names = [var.name for var in prog.list_vars()]
+        ignore_list = filter(
+            lambda var: any([re.match(name, var) for name in ignore_params]),
+            all_var_names)
+        ignore_set.update(list(ignore_list))
+
+    if len(ignore_set) > 0:
+        for k in ignore_set:
+            if k in state:
+                logging.warning(
+                    'variable %s is already excluded automatically', k)
+                del state[k]
+
+    paddle.static.set_program_state(prog, state)
+
+
+def init_ckpt(path_to_ckpt, program, exe):
+    """
+    Init from checkpoints or pretrained model in given path.
+    Args:
+        path_to_ckpt(str): The path to files of checkpoints,
+                           including '.pdparams' and '.pdopt'.
+        program(paddle.static.Program): The program to init model.
+        exe(paddle.static.Executor): The executor to run program.
+    """
+    paddle.static.load(program, path_to_ckpt, exe)
+    logging.info("Finish initalizing the checkpoint from %s", path_to_ckpt)
+
+
+def init_pretrained(path_to_pretrained, program):
+    """
+    Init from checkpoints or pretrained model in given path.
+    Args:
+        path_to_pretrained(str): The path to file of pretrained model.
+        program(paddle.static.Program): The program to init model.
+    """
+    if not isinstance(path_to_pretrained, list):
+        pretrained_model = [path_to_pretrained]
+    for pretrain in pretrained_model:
+        load_params(program, pretrain)
+    logging.info("Finish initalizing pretrained parameters from %s",
+                 pretrained_model)
+
+
+def init_program(args, program, exe):
+    """
+    Init from given checkpoint or pretrained parameters .
+    Args:
+        args(Namespace): Arguments obtained from ArgumentParser.
+        program(paddle.static.Program): The program to init model.
+        exe(paddle.static.Executor): The executor to run program.
+    """
+    if args.from_checkpoint is not None:
+        init_ckpt(args.from_checkpoint, program, exe)
+        logging.info("Training will start at the %d-th epoch",
+                     args.start_epoch)
+    elif args.from_pretrained_params is not None:
+        init_pretrained(args.from_pretrained_params, program)
+
+
+def save_model(program, model_path, epoch_id, prefix='paddle_example'):
+    """
+    Save a model to given path.
+    Args:
+        program(paddle.static.Program): The program to be saved.
+        model_path(str): The path to save model.
+        epoch_id(int): The current epoch id.
+        prefix(str): The prefix of model files.
+    """
+    if paddle.distributed.get_rank() != 0:
+        return
+    model_path = os.path.join(model_path, str(epoch_id))
+    _mkdir_if_not_exist(model_path)
+    model_prefix = os.path.join(model_path, prefix)
+    paddle.static.save(program, model_prefix)
+    logging.info("Already save model in %s", model_path)

+ 25 - 0
PaddlePaddle/Classification/RN50v1.5/utils/utility.py

@@ -0,0 +1,25 @@
+# Copyright (c) 2022 NVIDIA Corporation.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+
+def get_num_trainers():
+    num_trainers = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+    return num_trainers
+
+
+def get_trainer_id():
+    trainer_id = int(os.environ.get('PADDLE_TRAINER_ID', 0))
+    return trainer_id