Преглед изворни кода

[ConvNets/PyT] Adding support for Ampere and 20.06 container

Przemek Strzelczyk пре 5 година
родитељ
комит
46ff3707e0
49 измењених фајлова са 1590 додато и 1240 уклоњено
  1. 1 1
      PyTorch/Classification/ConvNets/Dockerfile
  2. 41 32
      PyTorch/Classification/ConvNets/README.md
  3. 1 1
      PyTorch/Classification/ConvNets/checkpoint2model.py
  4. 6 3
      PyTorch/Classification/ConvNets/classify.py
  5. 245 105
      PyTorch/Classification/ConvNets/image_classification/dataloaders.py
  6. 36 35
      PyTorch/Classification/ConvNets/image_classification/logger.py
  7. 14 12
      PyTorch/Classification/ConvNets/image_classification/mixup.py
  8. 173 122
      PyTorch/Classification/ConvNets/image_classification/resnet.py
  9. 2 0
      PyTorch/Classification/ConvNets/image_classification/smoothing.py
  10. 264 217
      PyTorch/Classification/ConvNets/image_classification/training.py
  11. 16 14
      PyTorch/Classification/ConvNets/image_classification/utils.py
  12. 336 270
      PyTorch/Classification/ConvNets/main.py
  13. 57 33
      PyTorch/Classification/ConvNets/multiproc.py
  14. 1 0
      PyTorch/Classification/ConvNets/requirements.txt
  15. 126 124
      PyTorch/Classification/ConvNets/resnet50v1.5/README.md
  16. 1 1
      PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX1_RN50_AMP_250E.sh
  17. 1 1
      PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX1_RN50_AMP_50E.sh
  18. 1 1
      PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX1_RN50_AMP_90E.sh
  19. 1 1
      PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX2_RN50_AMP_250E.sh
  20. 1 1
      PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX2_RN50_AMP_50E.sh
  21. 1 1
      PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX2_RN50_AMP_90E.sh
  22. 1 0
      PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGXA100_RN50_AMP_90E.sh
  23. 0 1
      PyTorch/Classification/ConvNets/resnet50v1.5/training/FP16/DGX1_RN50_FP16_250E.sh
  24. 0 1
      PyTorch/Classification/ConvNets/resnet50v1.5/training/FP16/DGX1_RN50_FP16_50E.sh
  25. 0 1
      PyTorch/Classification/ConvNets/resnet50v1.5/training/FP16/DGX1_RN50_FP16_90E.sh
  26. 0 1
      PyTorch/Classification/ConvNets/resnet50v1.5/training/FP16/DGX2_RN50_FP16_250E.sh
  27. 0 1
      PyTorch/Classification/ConvNets/resnet50v1.5/training/FP16/DGX2_RN50_FP16_50E.sh
  28. 0 1
      PyTorch/Classification/ConvNets/resnet50v1.5/training/FP16/DGX2_RN50_FP16_90E.sh
  29. 1 1
      PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX1_RN50_FP32_250E.sh
  30. 1 1
      PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX1_RN50_FP32_50E.sh
  31. 1 1
      PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX1_RN50_FP32_90E.sh
  32. 1 1
      PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX2_RN50_FP32_250E.sh
  33. 1 1
      PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX2_RN50_FP32_50E.sh
  34. 1 1
      PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX2_RN50_FP32_90E.sh
  35. 1 0
      PyTorch/Classification/ConvNets/resnet50v1.5/training/TF32/DGXA100_RN50_TF32_90E.sh
  36. 123 123
      PyTorch/Classification/ConvNets/resnext101-32x4d/README.md
  37. 1 1
      PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGX1_RNXT101-32x4d_AMP_250E.sh
  38. 1 1
      PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGX1_RNXT101-32x4d_AMP_90E.sh
  39. 1 0
      PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGXA100_RNXT101-32x4d_AMP_90E.sh
  40. 1 1
      PyTorch/Classification/ConvNets/resnext101-32x4d/training/FP32/DGX1_RNXT101-32x4d_FP32_250E.sh
  41. 1 1
      PyTorch/Classification/ConvNets/resnext101-32x4d/training/FP32/DGX1_RNXT101-32x4d_FP32_90E.sh
  42. 1 0
      PyTorch/Classification/ConvNets/resnext101-32x4d/training/TF32/DGXA100_RNXT101-32x4d_TF32_90E.sh
  43. 122 122
      PyTorch/Classification/ConvNets/se-resnext101-32x4d/README.md
  44. 1 1
      PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGX1_SE-RNXT101-32x4d_AMP_250E.sh
  45. 1 1
      PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGX1_SE-RNXT101-32x4d_AMP_90E.sh
  46. 1 0
      PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGXA100_SE-RNXT101-32x4d_AMP_90E.sh
  47. 1 1
      PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/FP32/DGX1_SE-RNXT101-32x4d_FP32_250E.sh
  48. 1 1
      PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/FP32/DGX1_SE-RNXT101-32x4d_FP32_90E.sh
  49. 1 0
      PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/TF32/DGXA100_SE-RNXT101-32x4d_TF32_90E.sh

+ 1 - 1
PyTorch/Classification/ConvNets/Dockerfile

@@ -1,4 +1,4 @@
-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.10-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.06-py3
 FROM ${FROM_IMAGE_NAME}
 
 ADD requirements.txt /workspace/

+ 41 - 32
PyTorch/Classification/ConvNets/README.md

@@ -2,13 +2,16 @@
 
 In this repository you will find implementations of various image classification models.
 
+Detailed information on each model can be found here:
+
 ## Table Of Contents
 
 * [Models](#models)
 * [Validation accuracy results](#validation-accuracy-results)
 * [Training performance results](#training-performance-results)
-  * [Training performance: NVIDIA DGX-1 (8x V100 16G)](#training-performance-nvidia-dgx-1-(8x-v100-16G))
-  * [Training performance: NVIDIA DGX-2 (16x V100 32G)](#training-performance-nvidia-dgx-2-(16x-v100-32G))
+  * [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
+  * [Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)](#training-performance-nvidia-dgx-1-16gb-8x-v100-16gb)
+  * [Training performance: NVIDIA DGX-2 (16x V100 32GB)](#training-performance-nvidia-dgx-2-16x-v100-32gb)
 * [Model comparison](#model-comparison)
   * [Accuracy vs FLOPS](#accuracy-vs-flops)
   * [Latency vs Throughput on different batch sizes](#latency-vs-throughput-on-different-batch-sizes)
@@ -25,14 +28,14 @@ The following table provides links to where you can find additional information
 
 ## Validation accuracy results
 
-Our results were obtained by running the applicable 
-training scripts in the [framework-container-name] NGC container 
-on NVIDIA DGX-1 with (8x V100 16G) GPUs. 
-The specific training script that was run is documented 
+Our results were obtained by running the applicable
+training scripts in the [framework-container-name] NGC container
+on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
+The specific training script that was run is documented
 in the corresponding model's README.
 
 
-The following table shows the validation accuracy results of the 
+The following table shows the validation accuracy results of the
 three classification models side-by-side.
 
 
@@ -45,48 +48,54 @@ three classification models side-by-side.
 
 ## Training performance results
 
-
-### Training performance: NVIDIA DGX-1 (8x V100 16G)
+### Training performance: NVIDIA DGX A100 (8x A100 40GB)
 
 
-Our results were obtained by running the applicable 
-training scripts in the pytorch-19.10 NGC container 
-on NVIDIA DGX-1 with (8x V100 16G) GPUs. 
-Performance numbers (in images per second) 
+Our results were obtained by running the applicable
+training scripts in the pytorch-20.06 NGC container
+on NVIDIA DGX A100 with (8x A100 40GB) GPUs.
+Performance numbers (in images per second)
 were averaged over an entire training epoch.
-The specific training script that was run is documented 
+The specific training script that was run is documented
 in the corresponding model's README.
 
-The following table shows the training accuracy results of the 
+The following table shows the training accuracy results of the
 three classification models side-by-side.
 
 
-| **arch** | **Mixed Precision** | **FP32** | **Mixed Precision speedup** |
-|:-:|:-:|:-:|:-:|
-| resnet50 | 6888.75 img/s | 2945.37 img/s | 2.34x |
-| resnext101-32x4d | 2384.85 img/s | 1116.58 img/s | 2.14x |
-| se-resnext101-32x4d | 2031.17 img/s | 977.45 img/s | 2.08x |
+|      **arch**       | **Mixed Precision** |   **TF32**    | **Mixed Precision Speedup** |
+|:-------------------:|:-------------------:|:-------------:|:---------------------------:|
+|      resnet50       |    9488.39 img/s    | 5322.10 img/s |            1.78x            |
+|  resnext101-32x4d   |    6758.98 img/s    | 2353.25 img/s |            2.87x            |
+| se-resnext101-32x4d |    4670.72 img/s    | 2011.21 img/s |            2.32x            |
+
+ResNeXt and SE-ResNeXt use [NHWC data layout](https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html) when training using Mixed Precision,
+which improves the model performance. We are currently working on adding it for ResNet.
 
-### Training performance: NVIDIA DGX-2 (16x V100 32G)
 
+### Training performance: NVIDIA DGX-1 16G (8x V100 16GB)
 
-Our results were obtained by running the applicable 
-training scripts in the pytorch-19.10 NGC container 
-on NVIDIA DGX-2 with (16x V100 32G) GPUs. 
-Performance numbers (in images per second) 
+
+Our results were obtained by running the applicable
+training scripts in the pytorch-20.06 NGC container
+on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
+Performance numbers (in images per second)
 were averaged over an entire training epoch.
-The specific training script that was run is documented 
+The specific training script that was run is documented
 in the corresponding model's README.
 
-The following table shows the training accuracy results of the 
+The following table shows the training accuracy results of the
 three classification models side-by-side.
 
 
-| **arch** | **Mixed Precision** | **FP32** | **Mixed Precision speedup** |
-|:-:|:-:|:-:|:-:|
-| resnet50 | 13443.82 img/s | 6263.41 img/s | 2.15x |
-| resnext101-32x4d | 4473.37 img/s | 2261.97 img/s | 1.98x |
-| se-resnext101-32x4d | 3776.03 img/s | 1953.13 img/s | 1.93x |
+|      **arch**       | **Mixed Precision** |   **FP32**    | **Mixed Precision Speedup** |
+|:-------------------:|:-------------------:|:-------------:|:---------------------------:|
+|      resnet50       |    6565.61 img/s    | 2869.19 img/s |            2.29x            |
+|  resnext101-32x4d   |    3922.74 img/s    | 1136.30 img/s |            3.45x            |
+| se-resnext101-32x4d |    2651.13 img/s    | 982.78 img/s  |            2.70x            |
+
+ResNeXt and SE-ResNeXt use [NHWC data layout](https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html) when training using Mixed Precision,
+which improves the model performance. We are currently working on adding it for ResNet.
 
 
 ## Model Comparison

+ 1 - 1
PyTorch/Classification/ConvNets/checkpoint2model.py

@@ -33,7 +33,7 @@ if __name__ == "__main__":
     checkpoint = torch.load(args.checkpoint_path)
 
     model_state_dict = {
-        k[len("module.1.") :] if "module.1." in k else k: v
+        k[len("module.") :] if "module." in k else k: v
         for k, v in checkpoint["state_dict"].items()
     }
 

+ 6 - 3
PyTorch/Classification/ConvNets/classify.py

@@ -59,7 +59,7 @@ def add_parser_arguments(parser):
 
 def main(args):
     imgnet_classes = np.array(json.load(open("./LOC_synset_mapping.json", "r")))
-    model = models.build_resnet(args.arch, args.model_config, verbose=False)
+    model = models.build_resnet(args.arch, args.model_config, 1000, verbose=False)
 
     if args.weights is not None:
         weights = torch.load(args.weights)
@@ -67,13 +67,16 @@ def main(args):
 
     model = model.cuda()
 
-    if args.precision == "FP16":
+    if args.precision in ["AMP", "FP16"]:
         model = network_to_half(model)
 
+
     model.eval()
 
     with torch.no_grad():
-        input = load_jpeg_from_file(args.image, cuda=True, fp16=args.precision!='FP32')
+        input = load_jpeg_from_file(
+            args.image, cuda=True, fp16=args.precision != "FP32"
+        )
 
         output = torch.nn.functional.softmax(model(input), dim=1).cpu().view(-1).numpy()
         top5 = np.argsort(output)[-5:][::-1]

+ 245 - 105
PyTorch/Classification/ConvNets/image_classification/dataloaders.py

@@ -33,17 +33,21 @@ import numpy as np
 import torchvision.datasets as datasets
 import torchvision.transforms as transforms
 from PIL import Image
+from functools import partial
 
-DATA_BACKEND_CHOICES = ['pytorch', 'syntetic']
+DATA_BACKEND_CHOICES = ["pytorch", "syntetic"]
 try:
     from nvidia.dali.plugin.pytorch import DALIClassificationIterator
     from nvidia.dali.pipeline import Pipeline
     import nvidia.dali.ops as ops
     import nvidia.dali.types as types
-    DATA_BACKEND_CHOICES.append('dali-gpu')
-    DATA_BACKEND_CHOICES.append('dali-cpu')
+
+    DATA_BACKEND_CHOICES.append("dali-gpu")
+    DATA_BACKEND_CHOICES.append("dali-cpu")
 except ImportError:
-    print("Please install DALI from https://www.github.com/NVIDIA/DALI to run this example.")
+    print(
+        "Please install DALI from https://www.github.com/NVIDIA/DALI to run this example."
+    )
 
 
 def load_jpeg_from_file(path, cuda=True, fp16=False):
@@ -76,8 +80,12 @@ def load_jpeg_from_file(path, cuda=True, fp16=False):
 
 
 class HybridTrainPipe(Pipeline):
-    def __init__(self, batch_size, num_threads, device_id, data_dir, crop, dali_cpu=False):
-        super(HybridTrainPipe, self).__init__(batch_size, num_threads, device_id, seed = 12 + device_id)
+    def __init__(
+        self, batch_size, num_threads, device_id, data_dir, crop, dali_cpu=False
+    ):
+        super(HybridTrainPipe, self).__init__(
+            batch_size, num_threads, device_id, seed=12 + device_id
+        )
         if torch.distributed.is_initialized():
             rank = torch.distributed.get_rank()
             world_size = torch.distributed.get_world_size()
@@ -86,10 +94,11 @@ class HybridTrainPipe(Pipeline):
             world_size = 1
 
         self.input = ops.FileReader(
-                file_root = data_dir,
-                shard_id = rank,
-                num_shards = world_size,
-                random_shuffle = True)
+            file_root=data_dir,
+            shard_id=rank,
+            num_shards=world_size,
+            random_shuffle=True,
+        )
 
         if dali_cpu:
             dali_device = "cpu"
@@ -98,37 +107,47 @@ class HybridTrainPipe(Pipeline):
             dali_device = "gpu"
             # This padding sets the size of the internal nvJPEG buffers to be able to handle all images from full-sized ImageNet
             # without additional reallocations
-            self.decode = ops.ImageDecoder(device="mixed", output_type=types.RGB, device_memory_padding=211025920, host_memory_padding=140544512)
+            self.decode = ops.ImageDecoder(
+                device="mixed",
+                output_type=types.RGB,
+                device_memory_padding=211025920,
+                host_memory_padding=140544512,
+            )
 
         self.res = ops.RandomResizedCrop(
-                device=dali_device,
-                size=[crop, crop],
-                interp_type=types.INTERP_LINEAR,
-                random_aspect_ratio=[0.75, 4./3.],
-                random_area=[0.08, 1.0],
-                num_attempts=100)
-
-        self.cmnp = ops.CropMirrorNormalize(device = "gpu",
-                                            output_dtype = types.FLOAT,
-                                            output_layout = types.NCHW,
-                                            crop = (crop, crop),
-                                            image_type = types.RGB,
-                                            mean = [0.485 * 255,0.456 * 255,0.406 * 255],
-                                            std = [0.229 * 255,0.224 * 255,0.225 * 255])
-        self.coin = ops.CoinFlip(probability = 0.5)
+            device=dali_device,
+            size=[crop, crop],
+            interp_type=types.INTERP_LINEAR,
+            random_aspect_ratio=[0.75, 4.0 / 3.0],
+            random_area=[0.08, 1.0],
+            num_attempts=100,
+        )
+
+        self.cmnp = ops.CropMirrorNormalize(
+            device="gpu",
+            output_dtype=types.FLOAT,
+            output_layout=types.NCHW,
+            crop=(crop, crop),
+            image_type=types.RGB,
+            mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
+            std=[0.229 * 255, 0.224 * 255, 0.225 * 255],
+        )
+        self.coin = ops.CoinFlip(probability=0.5)
 
     def define_graph(self):
         rng = self.coin()
-        self.jpegs, self.labels = self.input(name = "Reader")
+        self.jpegs, self.labels = self.input(name="Reader")
         images = self.decode(self.jpegs)
         images = self.res(images)
-        output = self.cmnp(images.gpu(), mirror = rng)
+        output = self.cmnp(images.gpu(), mirror=rng)
         return [output, self.labels]
 
 
 class HybridValPipe(Pipeline):
     def __init__(self, batch_size, num_threads, device_id, data_dir, crop, size):
-        super(HybridValPipe, self).__init__(batch_size, num_threads, device_id, seed = 12 + device_id)
+        super(HybridValPipe, self).__init__(
+            batch_size, num_threads, device_id, seed=12 + device_id
+        )
         if torch.distributed.is_initialized():
             rank = torch.distributed.get_rank()
             world_size = torch.distributed.get_world_size()
@@ -137,23 +156,26 @@ class HybridValPipe(Pipeline):
             world_size = 1
 
         self.input = ops.FileReader(
-                file_root = data_dir,
-                shard_id = rank,
-                num_shards = world_size,
-                random_shuffle = False)
-
-        self.decode = ops.ImageDecoder(device = "mixed", output_type = types.RGB)
-        self.res = ops.Resize(device = "gpu", resize_shorter = size)
-        self.cmnp = ops.CropMirrorNormalize(device = "gpu",
-                output_dtype = types.FLOAT,
-                output_layout = types.NCHW,
-                crop = (crop, crop),
-                image_type = types.RGB,
-                mean = [0.485 * 255,0.456 * 255,0.406 * 255],
-                std = [0.229 * 255,0.224 * 255,0.225 * 255])
+            file_root=data_dir,
+            shard_id=rank,
+            num_shards=world_size,
+            random_shuffle=False,
+        )
+
+        self.decode = ops.ImageDecoder(device="mixed", output_type=types.RGB)
+        self.res = ops.Resize(device="gpu", resize_shorter=size)
+        self.cmnp = ops.CropMirrorNormalize(
+            device="gpu",
+            output_dtype=types.FLOAT,
+            output_layout=types.NCHW,
+            crop=(crop, crop),
+            image_type=types.RGB,
+            mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
+            std=[0.229 * 255, 0.224 * 255, 0.225 * 255],
+        )
 
     def define_graph(self):
-        self.jpegs, self.labels = self.input(name = "Reader")
+        self.jpegs, self.labels = self.input(name="Reader")
         images = self.decode(self.jpegs)
         images = self.res(images)
         output = self.cmnp(images)
@@ -161,25 +183,39 @@ class HybridValPipe(Pipeline):
 
 
 class DALIWrapper(object):
-    def gen_wrapper(dalipipeline, num_classes, one_hot):
+    def gen_wrapper(dalipipeline, num_classes, one_hot, memory_format):
         for data in dalipipeline:
-            input = data[0]["data"]
+            input = data[0]["data"].contiguous(memory_format=memory_format)
             target = torch.reshape(data[0]["label"], [-1]).cuda().long()
             if one_hot:
                 target = expand(num_classes, torch.float, target)
             yield input, target
         dalipipeline.reset()
 
-    def __init__(self, dalipipeline, num_classes, one_hot):
+    def __init__(self, dalipipeline, num_classes, one_hot, memory_format):
         self.dalipipeline = dalipipeline
-        self.num_classes =  num_classes
+        self.num_classes = num_classes
         self.one_hot = one_hot
+        self.memory_format = memory_format
 
     def __iter__(self):
-        return DALIWrapper.gen_wrapper(self.dalipipeline, self.num_classes, self.one_hot)
+        return DALIWrapper.gen_wrapper(
+                self.dalipipeline, self.num_classes, self.one_hot, self.memory_format
+        )
+
 
 def get_dali_train_loader(dali_cpu=False):
-    def gdtl(data_path, batch_size, num_classes, one_hot, workers=5, _worker_init_fn=None, fp16=False):
+    def gdtl(
+        data_path,
+        batch_size,
+        num_classes,
+        one_hot,
+        start_epoch=0,
+        workers=5,
+        _worker_init_fn=None,
+        fp16=False,
+        memory_format=torch.contiguous_format,
+    ):
         if torch.distributed.is_initialized():
             rank = torch.distributed.get_rank()
             world_size = torch.distributed.get_world_size()
@@ -187,22 +223,41 @@ def get_dali_train_loader(dali_cpu=False):
             rank = 0
             world_size = 1
 
-        traindir = os.path.join(data_path, 'train')
+        traindir = os.path.join(data_path, "train")
 
-        pipe = HybridTrainPipe(batch_size=batch_size, num_threads=workers,
-                device_id = rank % torch.cuda.device_count(),
-                data_dir = traindir, crop = 224, dali_cpu=dali_cpu)
+        pipe = HybridTrainPipe(
+            batch_size=batch_size,
+            num_threads=workers,
+            device_id=rank % torch.cuda.device_count(),
+            data_dir=traindir,
+            crop=224,
+            dali_cpu=dali_cpu,
+        )
 
         pipe.build()
-        train_loader = DALIClassificationIterator(pipe, size = int(pipe.epoch_size("Reader") / world_size))
+        train_loader = DALIClassificationIterator(
+            pipe, size=int(pipe.epoch_size("Reader") / world_size)
+        )
 
-        return DALIWrapper(train_loader, num_classes, one_hot), int(pipe.epoch_size("Reader") / (world_size * batch_size))
+        return (
+            DALIWrapper(train_loader, num_classes, one_hot, memory_format),
+            int(pipe.epoch_size("Reader") / (world_size * batch_size)),
+        )
 
     return gdtl
 
 
 def get_dali_val_loader():
-    def gdvl(data_path, batch_size, num_classes, one_hot, workers=5, _worker_init_fn=None, fp16=False):
+    def gdvl(
+        data_path,
+        batch_size,
+        num_classes,
+        one_hot,
+        workers=5,
+        _worker_init_fn=None,
+        fp16=False,
+        memory_format=torch.contiguous_format,
+    ):
         if torch.distributed.is_initialized():
             rank = torch.distributed.get_rank()
             world_size = torch.distributed.get_world_size()
@@ -210,30 +265,41 @@ def get_dali_val_loader():
             rank = 0
             world_size = 1
 
-        valdir = os.path.join(data_path, 'val')
+        valdir = os.path.join(data_path, "val")
 
-        pipe = HybridValPipe(batch_size=batch_size, num_threads=workers,
-                device_id = rank % torch.cuda.device_count(),
-                data_dir = valdir,
-                crop = 224, size = 256)
+        pipe = HybridValPipe(
+            batch_size=batch_size,
+            num_threads=workers,
+            device_id=rank % torch.cuda.device_count(),
+            data_dir=valdir,
+            crop=224,
+            size=256,
+        )
 
         pipe.build()
-        val_loader = DALIClassificationIterator(pipe, size = int(pipe.epoch_size("Reader") / world_size))
+        val_loader = DALIClassificationIterator(
+            pipe, size=int(pipe.epoch_size("Reader") / world_size)
+        )
+
+        return (
+            DALIWrapper(val_loader, num_classes, one_hot, memory_format),
+            int(pipe.epoch_size("Reader") / (world_size * batch_size)),
+        )
 
-        return DALIWrapper(val_loader, num_classes, one_hot), int(pipe.epoch_size("Reader") / (world_size * batch_size))
     return gdvl
 
 
-def fast_collate(batch):
+def fast_collate(memory_format, batch):
     imgs = [img[0] for img in batch]
     targets = torch.tensor([target[1] for target in batch], dtype=torch.int64)
     w = imgs[0].size[0]
     h = imgs[0].size[1]
-    tensor = torch.zeros( (len(imgs), 3, h, w), dtype=torch.uint8 )
+    tensor = torch.zeros((len(imgs), 3, h, w), dtype=torch.uint8).contiguous(
+        memory_format=memory_format
+    )
     for i, img in enumerate(imgs):
         nump_array = np.asarray(img, dtype=np.uint8)
-        tens = torch.from_numpy(nump_array)
-        if(nump_array.ndim < 3):
+        if nump_array.ndim < 3:
             nump_array = np.expand_dims(nump_array, axis=-1)
         nump_array = np.rollaxis(nump_array, 2)
 
@@ -243,14 +309,25 @@ def fast_collate(batch):
 
 
 def expand(num_classes, dtype, tensor):
-    e = torch.zeros(tensor.size(0), num_classes, dtype=dtype, device=torch.device('cuda'))
+    e = torch.zeros(
+        tensor.size(0), num_classes, dtype=dtype, device=torch.device("cuda")
+    )
     e = e.scatter(1, tensor.unsqueeze(1), 1.0)
     return e
 
+
 class PrefetchedWrapper(object):
     def prefetched_loader(loader, num_classes, fp16, one_hot):
-        mean = torch.tensor([0.485 * 255, 0.456 * 255, 0.406 * 255]).cuda().view(1,3,1,1)
-        std = torch.tensor([0.229 * 255, 0.224 * 255, 0.225 * 255]).cuda().view(1,3,1,1)
+        mean = (
+            torch.tensor([0.485 * 255, 0.456 * 255, 0.406 * 255])
+            .cuda()
+            .view(1, 3, 1, 1)
+        )
+        std = (
+            torch.tensor([0.229 * 255, 0.224 * 255, 0.225 * 255])
+            .cuda()
+            .view(1, 3, 1, 1)
+        )
         if fp16:
             mean = mean.half()
             std = std.half()
@@ -284,30 +361,46 @@ class PrefetchedWrapper(object):
 
         yield input, target
 
-    def __init__(self, dataloader, num_classes, fp16, one_hot):
+    def __init__(self, dataloader, start_epoch, num_classes, fp16, one_hot):
         self.dataloader = dataloader
         self.fp16 = fp16
-        self.epoch = 0
+        self.epoch = start_epoch
         self.one_hot = one_hot
         self.num_classes = num_classes
 
     def __iter__(self):
-        if (self.dataloader.sampler is not None and
-            isinstance(self.dataloader.sampler,
-                       torch.utils.data.distributed.DistributedSampler)):
+        if self.dataloader.sampler is not None and isinstance(
+            self.dataloader.sampler, torch.utils.data.distributed.DistributedSampler
+        ):
 
             self.dataloader.sampler.set_epoch(self.epoch)
         self.epoch += 1
-        return PrefetchedWrapper.prefetched_loader(self.dataloader, self.num_classes, self.fp16, self.one_hot)
-
-def get_pytorch_train_loader(data_path, batch_size, num_classes, one_hot, workers=5, _worker_init_fn=None, fp16=False):
-    traindir = os.path.join(data_path, 'train')
+        return PrefetchedWrapper.prefetched_loader(
+            self.dataloader, self.num_classes, self.fp16, self.one_hot
+        )
+
+    def __len__(self):
+        return len(self.dataloader)
+
+
+def get_pytorch_train_loader(
+    data_path,
+    batch_size,
+    num_classes,
+    one_hot,
+    start_epoch=0,
+    workers=5,
+    _worker_init_fn=None,
+    fp16=False,
+    memory_format=torch.contiguous_format,
+):
+    traindir = os.path.join(data_path, "train")
     train_dataset = datasets.ImageFolder(
-            traindir,
-            transforms.Compose([
-                transforms.RandomResizedCrop(224),
-                transforms.RandomHorizontalFlip(),
-                ]))
+        traindir,
+        transforms.Compose(
+            [transforms.RandomResizedCrop(224), transforms.RandomHorizontalFlip()]
+        ),
+    )
 
     if torch.distributed.is_initialized():
         train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
@@ -315,18 +408,37 @@ def get_pytorch_train_loader(data_path, batch_size, num_classes, one_hot, worker
         train_sampler = None
 
     train_loader = torch.utils.data.DataLoader(
-            train_dataset, batch_size=batch_size, shuffle=(train_sampler is None),
-            num_workers=workers, worker_init_fn=_worker_init_fn, pin_memory=True, sampler=train_sampler, collate_fn=fast_collate, drop_last=True)
+        train_dataset,
+        batch_size=batch_size,
+        shuffle=(train_sampler is None),
+        num_workers=workers,
+        worker_init_fn=_worker_init_fn,
+        pin_memory=True,
+        sampler=train_sampler,
+        collate_fn=partial(fast_collate, memory_format),
+        drop_last=True,
+    )
+
+    return (
+        PrefetchedWrapper(train_loader, start_epoch, num_classes, fp16, one_hot),
+        len(train_loader),
+    )
 
-    return PrefetchedWrapper(train_loader, num_classes, fp16, one_hot), len(train_loader)
 
-def get_pytorch_val_loader(data_path, batch_size, num_classes, one_hot, workers=5, _worker_init_fn=None, fp16=False):
-    valdir = os.path.join(data_path, 'val')
+def get_pytorch_val_loader(
+    data_path,
+    batch_size,
+    num_classes,
+    one_hot,
+    workers=5,
+    _worker_init_fn=None,
+    fp16=False,
+    memory_format=torch.contiguous_format,
+):
+    valdir = os.path.join(data_path, "val")
     val_dataset = datasets.ImageFolder(
-            valdir, transforms.Compose([
-                transforms.Resize(256),
-                transforms.CenterCrop(224),
-                ]))
+        valdir, transforms.Compose([transforms.Resize(256), transforms.CenterCrop(224)])
+    )
 
     if torch.distributed.is_initialized():
         val_sampler = torch.utils.data.distributed.DistributedSampler(val_dataset)
@@ -334,23 +446,40 @@ def get_pytorch_val_loader(data_path, batch_size, num_classes, one_hot, workers=
         val_sampler = None
 
     val_loader = torch.utils.data.DataLoader(
-            val_dataset,
-            sampler=val_sampler,
-            batch_size=batch_size, shuffle=False,
-            num_workers=workers, worker_init_fn=_worker_init_fn, pin_memory=True,
-            collate_fn=fast_collate)
+        val_dataset,
+        sampler=val_sampler,
+        batch_size=batch_size,
+        shuffle=False,
+        num_workers=workers,
+        worker_init_fn=_worker_init_fn,
+        pin_memory=True,
+        collate_fn=partial(fast_collate, memory_format),
+    )
+
+    return PrefetchedWrapper(val_loader, 0, num_classes, fp16, one_hot), len(val_loader)
 
-    return PrefetchedWrapper(val_loader, num_classes, fp16, one_hot), len(val_loader)
 
 class SynteticDataLoader(object):
-    def __init__(self, fp16, batch_size, num_classes, num_channels, height, width, one_hot):
-        input_data = torch.empty(batch_size, num_channels, height, width).cuda().normal_(0, 1.0)
+    def __init__(
+        self,
+        fp16,
+        batch_size,
+        num_classes,
+        num_channels,
+        height,
+        width,
+        one_hot,
+        memory_format=torch.contiguous_format,
+    ):
+        input_data = (
+            torch.empty(batch_size, num_channels, height, width).contiguous(memory_format=memory_format).cuda().normal_(0, 1.0)
+        )
         if one_hot:
             input_target = torch.empty(batch_size, num_classes).cuda()
             input_target[:, 0] = 1.0
         else:
             input_target = torch.randint(0, num_classes, (batch_size,))
-        input_target=input_target.cuda()
+        input_target = input_target.cuda()
         if fp16:
             input_data = input_data.half()
 
@@ -361,5 +490,16 @@ class SynteticDataLoader(object):
         while True:
             yield self.input_data, self.input_target
 
-def get_syntetic_loader(data_path, batch_size, num_classes, one_hot, workers=None, _worker_init_fn=None, fp16=False):
-    return SynteticDataLoader(fp16, batch_size, 1000, 3, 224, 224, one_hot), -1
+
+def get_syntetic_loader(
+    data_path,
+    batch_size,
+    num_classes,
+    one_hot,
+    start_epoch=0,
+    workers=None,
+    _worker_init_fn=None,
+    fp16=False,
+    memory_format=torch.contiguous_format,
+):
+    return SynteticDataLoader(fp16, batch_size, num_classes, 3, 224, 224, one_hot, memory_format=memory_format), -1

+ 36 - 35
PyTorch/Classification/ConvNets/image_classification/logger.py

@@ -56,6 +56,7 @@ LAT_100 = lambda: Meter(QuantileMeter(1), QuantileMeter(1), QuantileMeter(1))
 LAT_99 = lambda: Meter(QuantileMeter(0.99), QuantileMeter(0.99), QuantileMeter(0.99))
 LAT_95 = lambda: Meter(QuantileMeter(0.95), QuantileMeter(0.95), QuantileMeter(0.95))
 
+
 class Meter(object):
     def __init__(self, iteration_aggregator, epoch_aggregator, run_aggregator):
         self.run_aggregator = run_aggregator
@@ -113,7 +114,7 @@ class QuantileMeter(object):
     def get_val(self):
         if not self.vals:
             return None, self.n
-        return np.quantile(self.vals, self.q, interpolation='nearest'), self.n
+        return np.quantile(self.vals, self.q, interpolation="nearest"), self.n
 
     def get_data(self):
         return self.vals, self.n
@@ -206,8 +207,8 @@ class AverageMeter(object):
 
 
 class Logger(object):
-    def __init__(self, print_interval, backends, verbose=False):
-        self.epoch = -1
+    def __init__(self, print_interval, backends, start_epoch=-1, verbose=False):
+        self.epoch = start_epoch
         self.iteration = -1
         self.val_iteration = -1
         self.metrics = OrderedDict()
@@ -222,11 +223,11 @@ class Logger(object):
     def register_metric(self, metric_name, meter, verbosity=0, metadata={}):
         if self.verbose:
             print("Registering metric: {}".format(metric_name))
-        self.metrics[metric_name] = {'meter': meter, 'level': verbosity}
+        self.metrics[metric_name] = {"meter": meter, "level": verbosity}
         dllogger.metadata(metric_name, metadata)
 
     def log_metric(self, metric_name, val, n=1):
-        self.metrics[metric_name]['meter'].record(val, n=n)
+        self.metrics[metric_name]["meter"].record(val, n=n)
 
     def start_iteration(self, val=False):
         if val:
@@ -236,29 +237,28 @@ class Logger(object):
 
     def end_iteration(self, val=False):
         it = self.val_iteration if val else self.iteration
-        if (it % self.print_interval == 0):
+        if it % self.print_interval == 0:
             metrics = {
-                n: m
-                for n, m in self.metrics.items() if n.startswith('val') == val
+                n: m for n, m in self.metrics.items() if n.startswith("val") == val
             }
-            step = (self.epoch,
-                    self.iteration) if not val else (self.epoch,
-                                                     self.iteration,
-                                                     self.val_iteration)
+            step = (
+                (self.epoch, self.iteration)
+                if not val
+                else (self.epoch, self.iteration, self.val_iteration)
+            )
 
-            verbositys = {m['level'] for _, m in metrics.items()}
+            verbositys = {m["level"] for _, m in metrics.items()}
             for ll in verbositys:
-                llm = {n: m for n, m in metrics.items() if m['level'] == ll}
+                llm = {n: m for n, m in metrics.items() if m["level"] == ll}
 
-                dllogger.log(step=step,
-                         data={
-                             n: m['meter'].get_iteration()
-                             for n, m in llm.items()
-                         },
-                         verbosity=ll)
+                dllogger.log(
+                    step=step,
+                    data={n: m["meter"].get_iteration() for n, m in llm.items()},
+                    verbosity=ll,
+                )
 
             for n, m in metrics.items():
-                m['meter'].reset_iteration()
+                m["meter"].reset_iteration()
 
             dllogger.flush()
 
@@ -268,32 +268,33 @@ class Logger(object):
         self.val_iteration = 0
 
         for n, m in self.metrics.items():
-            m['meter'].reset_epoch()
+            m["meter"].reset_epoch()
 
     def end_epoch(self):
         for n, m in self.metrics.items():
-            m['meter'].reset_iteration()
+            m["meter"].reset_iteration()
 
-        verbositys = {m['level'] for _, m in self.metrics.items()}
+        verbositys = {m["level"] for _, m in self.metrics.items()}
         for ll in verbositys:
-            llm = {n: m for n, m in self.metrics.items() if m['level'] == ll}
-            dllogger.log(step=(self.epoch, ),
-                     data={n: m['meter'].get_epoch()
-                           for n, m in llm.items()})
+            llm = {n: m for n, m in self.metrics.items() if m["level"] == ll}
+            dllogger.log(
+                step=(self.epoch,),
+                data={n: m["meter"].get_epoch() for n, m in llm.items()},
+            )
 
     def end(self):
         for n, m in self.metrics.items():
-            m['meter'].reset_epoch()
+            m["meter"].reset_epoch()
 
-        verbositys = {m['level'] for _, m in self.metrics.items()}
+        verbositys = {m["level"] for _, m in self.metrics.items()}
         for ll in verbositys:
-            llm = {n: m for n, m in self.metrics.items() if m['level'] == ll}
-            dllogger.log(step=tuple(),
-                     data={n: m['meter'].get_run()
-                           for n, m in llm.items()})
+            llm = {n: m for n, m in self.metrics.items() if m["level"] == ll}
+            dllogger.log(
+                step=tuple(), data={n: m["meter"].get_run() for n, m in llm.items()}
+            )
 
         for n, m in self.metrics.items():
-            m['meter'].reset_epoch()
+            m["meter"].reset_epoch()
 
         dllogger.flush()
 

+ 14 - 12
PyTorch/Classification/ConvNets/image_classification/mixup.py

@@ -16,35 +16,37 @@ import torch.nn as nn
 import numpy as np
 
 
-def mixup(alpha, num_classes, data, target):
+def mixup(alpha, data, target):
     with torch.no_grad():
         bs = data.size(0)
         c = np.random.beta(alpha, alpha)
 
         perm = torch.randperm(bs).cuda()
 
-        md = c * data + (1-c) * data[perm, :]
-        mt = c * target + (1-c) * target[perm, :]
+        md = c * data + (1 - c) * data[perm, :]
+        mt = c * target + (1 - c) * target[perm, :]
         return md, mt
 
 
 class MixUpWrapper(object):
-    def __init__(self, alpha, num_classes, dataloader):
+    def __init__(self, alpha, dataloader):
         self.alpha = alpha
         self.dataloader = dataloader
-        self.num_classes = num_classes
 
     def mixup_loader(self, loader):
         for input, target in loader:
-            i, t = mixup(self.alpha, self.num_classes, input, target)
+            i, t = mixup(self.alpha, input, target)
             yield i, t
 
     def __iter__(self):
         return self.mixup_loader(self.dataloader)
 
+    def __len__(self):
+        return len(self.dataloader)
+
 
 class NLLMultiLabelSmooth(nn.Module):
-    def __init__(self, smoothing = 0.0):
+    def __init__(self, smoothing=0.0):
         super(NLLMultiLabelSmooth, self).__init__()
         self.confidence = 1.0 - smoothing
         self.smoothing = smoothing
@@ -53,15 +55,15 @@ class NLLMultiLabelSmooth(nn.Module):
         if self.training:
             x = x.float()
             target = target.float()
-            logprobs = torch.nn.functional.log_softmax(x, dim = -1)
-    
+            logprobs = torch.nn.functional.log_softmax(x, dim=-1)
+
             nll_loss = -logprobs * target
             nll_loss = nll_loss.sum(-1)
-    
+
             smooth_loss = -logprobs.mean(dim=-1)
-    
+
             loss = self.confidence * nll_loss + self.smoothing * smooth_loss
-    
+
             return loss.mean()
         else:
             return torch.nn.functional.cross_entropy(x, target)

+ 173 - 122
PyTorch/Classification/ConvNets/image_classification/resnet.py

@@ -32,32 +32,43 @@ import torch
 import torch.nn as nn
 import numpy as np
 
-__all__ = ['ResNet', 'build_resnet', 'resnet_versions', 'resnet_configs']
+__all__ = ["ResNet", "build_resnet", "resnet_versions", "resnet_configs"]
 
 # ResNetBuilder {{{
 
+
 class ResNetBuilder(object):
     def __init__(self, version, config):
-        self.conv3x3_cardinality = 1 if 'cardinality' not in version.keys() else version['cardinality']
+        self.conv3x3_cardinality = (
+            1 if "cardinality" not in version.keys() else version["cardinality"]
+        )
         self.config = config
 
     def conv(self, kernel_size, in_planes, out_planes, groups=1, stride=1):
         conv = nn.Conv2d(
-                in_planes, out_planes,
-                kernel_size=kernel_size, groups=groups,
-                stride=stride, padding=int((kernel_size - 1)/2),
-                bias=False)
-
-        if self.config['nonlinearity'] == 'relu': 
-            nn.init.kaiming_normal_(conv.weight,
-                    mode=self.config['conv_init'],
-                    nonlinearity=self.config['nonlinearity'])
+            in_planes,
+            out_planes,
+            kernel_size=kernel_size,
+            groups=groups,
+            stride=stride,
+            padding=int((kernel_size - 1) / 2),
+            bias=False,
+        )
+
+        if self.config["nonlinearity"] == "relu":
+            nn.init.kaiming_normal_(
+                conv.weight,
+                mode=self.config["conv_init"],
+                nonlinearity=self.config["nonlinearity"],
+            )
 
         return conv
 
     def conv3x3(self, in_planes, out_planes, stride=1):
         """3x3 convolution with padding"""
-        c = self.conv(3, in_planes, out_planes, groups=self.conv3x3_cardinality, stride=stride)
+        c = self.conv(
+            3, in_planes, out_planes, groups=self.conv3x3_cardinality, stride=stride
+        )
         return c
 
     def conv1x1(self, in_planes, out_planes, stride=1):
@@ -77,14 +88,15 @@ class ResNetBuilder(object):
 
     def batchnorm(self, planes, last_bn=False):
         bn = nn.BatchNorm2d(planes)
-        gamma_init_val = 0 if last_bn and self.config['last_bn_0_init'] else 1
+        gamma_init_val = 0 if last_bn and self.config["last_bn_0_init"] else 1
         nn.init.constant_(bn.weight, gamma_init_val)
         nn.init.constant_(bn.bias, 0)
 
         return bn
 
     def activation(self):
-        return self.config['activation']()
+        return self.config["activation"]()
+
 
 # ResNetBuilder }}}
 
@@ -95,8 +107,8 @@ class BasicBlock(nn.Module):
         self.conv1 = builder.conv3x3(inplanes, planes, stride)
         self.bn1 = builder.batchnorm(planes)
         self.relu = builder.activation()
-        self.conv2 = builder.conv3x3(planes, planes*expansion)
-        self.bn2 = builder.batchnorm(planes*expansion, last_bn=True)
+        self.conv2 = builder.conv3x3(planes, planes * expansion)
+        self.bn2 = builder.batchnorm(planes * expansion, last_bn=True)
         self.downsample = downsample
         self.stride = stride
 
@@ -121,6 +133,8 @@ class BasicBlock(nn.Module):
         out = self.relu(out)
 
         return out
+
+
 # BasicBlock }}}
 
 # SqueezeAndExcitation {{{
@@ -142,11 +156,22 @@ class SqueezeAndExcitation(nn.Module):
 
         return out
 
+
 # }}}
 
 # Bottleneck {{{
 class Bottleneck(nn.Module):
-    def __init__(self, builder, inplanes, planes, expansion, stride=1, se=False, se_squeeze=16, downsample=None):
+    def __init__(
+        self,
+        builder,
+        inplanes,
+        planes,
+        expansion,
+        stride=1,
+        se=False,
+        se_squeeze=16,
+        downsample=None,
+    ):
         super(Bottleneck, self).__init__()
         self.conv1 = builder.conv1x1(inplanes, planes)
         self.bn1 = builder.batchnorm(planes)
@@ -157,7 +182,9 @@ class Bottleneck(nn.Module):
         self.relu = builder.activation()
         self.downsample = downsample
         self.stride = stride
-        self.squeeze = SqueezeAndExcitation(planes*expansion, se_squeeze) if se else None
+        self.squeeze = (
+            SqueezeAndExcitation(planes * expansion, se_squeeze) if se else None
+        )
 
     def forward(self, x):
         residual = x
@@ -185,8 +212,20 @@ class Bottleneck(nn.Module):
 
         return out
 
+
 def SEBottleneck(builder, inplanes, planes, expansion, stride=1, downsample=None):
-    return Bottleneck(builder, inplanes, planes, expansion, stride=stride, se=True, se_squeeze=16, downsample=downsample)
+    return Bottleneck(
+        builder,
+        inplanes,
+        planes,
+        expansion,
+        stride=stride,
+        se=True,
+        se_squeeze=16,
+        downsample=downsample,
+    )
+
+
 # Bottleneck }}}
 
 # ResNet {{{
@@ -199,17 +238,22 @@ class ResNet(nn.Module):
         self.relu = builder.activation()
         self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
         self.layer1 = self._make_layer(builder, block, expansion, widths[0], layers[0])
-        self.layer2 = self._make_layer(builder, block, expansion, widths[1], layers[1], stride=2)
-        self.layer3 = self._make_layer(builder, block, expansion, widths[2], layers[2], stride=2)
-        self.layer4 = self._make_layer(builder, block, expansion, widths[3], layers[3], stride=2)
+        self.layer2 = self._make_layer(
+            builder, block, expansion, widths[1], layers[1], stride=2
+        )
+        self.layer3 = self._make_layer(
+            builder, block, expansion, widths[2], layers[2], stride=2
+        )
+        self.layer4 = self._make_layer(
+            builder, block, expansion, widths[3], layers[3], stride=2
+        )
         self.avgpool = nn.AdaptiveAvgPool2d(1)
         self.fc = nn.Linear(widths[3] * expansion, num_classes)
 
     def _make_layer(self, builder, block, expansion, planes, blocks, stride=1):
         downsample = None
         if stride != 1 or self.inplanes != planes * expansion:
-            dconv = builder.conv1x1(self.inplanes, planes * expansion,
-                                    stride=stride)
+            dconv = builder.conv1x1(self.inplanes, planes * expansion, stride=stride)
             dbn = builder.batchnorm(planes * expansion)
             if dbn is not None:
                 downsample = nn.Sequential(dconv, dbn)
@@ -217,7 +261,16 @@ class ResNet(nn.Module):
                 downsample = dconv
 
         layers = []
-        layers.append(block(builder, self.inplanes, planes, expansion, stride=stride, downsample=downsample))
+        layers.append(
+            block(
+                builder,
+                self.inplanes,
+                planes,
+                expansion,
+                stride=stride,
+                downsample=downsample,
+            )
+        )
         self.inplanes = planes * expansion
         for i in range(1, blocks):
             layers.append(block(builder, self.inplanes, planes, expansion))
@@ -241,102 +294,97 @@ class ResNet(nn.Module):
         x = self.fc(x)
 
         return x
+
+
 # ResNet }}}
 
 resnet_configs = {
-        'classic' : {
-            'conv' : nn.Conv2d,
-            'conv_init' : 'fan_out',
-            'nonlinearity' : 'relu',
-            'last_bn_0_init' : False,
-            'activation' : lambda: nn.ReLU(inplace=True),
-            },
-        'fanin' : {
-            'conv' : nn.Conv2d,
-            'conv_init' : 'fan_in',
-            'nonlinearity' : 'relu',
-            'last_bn_0_init' : False,
-            'activation' : lambda: nn.ReLU(inplace=True),
-            },
-        'grp-fanin' : {
-            'conv' : nn.Conv2d,
-            'conv_init' : 'fan_in',
-            'nonlinearity' : 'relu',
-            'last_bn_0_init' : False,
-            'activation' : lambda: nn.ReLU(inplace=True),
-            },
-        'grp-fanout' : {
-            'conv' : nn.Conv2d,
-            'conv_init' : 'fan_out',
-            'nonlinearity' : 'relu',
-            'last_bn_0_init' : False,
-            'activation' : lambda: nn.ReLU(inplace=True),
-            },
-        }
+    "classic": {
+        "conv": nn.Conv2d,
+        "conv_init": "fan_out",
+        "nonlinearity": "relu",
+        "last_bn_0_init": False,
+        "activation": lambda: nn.ReLU(inplace=True),
+    },
+    "fanin": {
+        "conv": nn.Conv2d,
+        "conv_init": "fan_in",
+        "nonlinearity": "relu",
+        "last_bn_0_init": False,
+        "activation": lambda: nn.ReLU(inplace=True),
+    },
+    "grp-fanin": {
+        "conv": nn.Conv2d,
+        "conv_init": "fan_in",
+        "nonlinearity": "relu",
+        "last_bn_0_init": False,
+        "activation": lambda: nn.ReLU(inplace=True),
+    },
+    "grp-fanout": {
+        "conv": nn.Conv2d,
+        "conv_init": "fan_out",
+        "nonlinearity": "relu",
+        "last_bn_0_init": False,
+        "activation": lambda: nn.ReLU(inplace=True),
+    },
+}
 
 resnet_versions = {
-        'resnet18' : {
-            'net' : ResNet,
-            'block' : BasicBlock,
-            'layers' : [2, 2, 2, 2],
-            'widths' : [64, 128, 256, 512],
-            'expansion' : 1,
-            'num_classes' : 1000,
-            },
-         'resnet34' : {
-            'net' : ResNet,
-            'block' : BasicBlock,
-            'layers' : [3, 4, 6, 3],
-            'widths' : [64, 128, 256, 512],
-            'expansion' : 1,
-            'num_classes' : 1000,
-            },
-         'resnet50' : {
-            'net' : ResNet,
-            'block' : Bottleneck,
-            'layers' : [3, 4, 6, 3],
-            'widths' : [64, 128, 256, 512],
-            'expansion' : 4,
-            'num_classes' : 1000,
-            },
-        'resnet101' : {
-            'net' : ResNet,
-            'block' : Bottleneck,
-            'layers' : [3, 4, 23, 3],
-            'widths' : [64, 128, 256, 512],
-            'expansion' : 4,
-            'num_classes' : 1000,
-            },
-        'resnet152' : {
-            'net' : ResNet,
-            'block' : Bottleneck,
-            'layers' : [3, 8, 36, 3],
-            'widths' : [64, 128, 256, 512],
-            'expansion' : 4,
-            'num_classes' : 1000,
-            },
-        'resnext101-32x4d' : {
-            'net' : ResNet,
-            'block' : Bottleneck,
-            'cardinality' : 32,
-            'layers' : [3, 4, 23, 3],
-            'widths' : [128, 256, 512, 1024],
-            'expansion' : 2,
-            'num_classes' : 1000,
-            },
-        'se-resnext101-32x4d' : {
-            'net' : ResNet,
-            'block' : SEBottleneck,
-            'cardinality' : 32,
-            'layers' : [3, 4, 23, 3],
-            'widths' : [128, 256, 512, 1024],
-            'expansion' : 2,
-            'num_classes' : 1000,
-            },
-        }
-
-
-def build_resnet(version, config, verbose=True):
+    "resnet18": {
+        "net": ResNet,
+        "block": BasicBlock,
+        "layers": [2, 2, 2, 2],
+        "widths": [64, 128, 256, 512],
+        "expansion": 1,
+    },
+    "resnet34": {
+        "net": ResNet,
+        "block": BasicBlock,
+        "layers": [3, 4, 6, 3],
+        "widths": [64, 128, 256, 512],
+        "expansion": 1,
+    },
+    "resnet50": {
+        "net": ResNet,
+        "block": Bottleneck,
+        "layers": [3, 4, 6, 3],
+        "widths": [64, 128, 256, 512],
+        "expansion": 4,
+    },
+    "resnet101": {
+        "net": ResNet,
+        "block": Bottleneck,
+        "layers": [3, 4, 23, 3],
+        "widths": [64, 128, 256, 512],
+        "expansion": 4,
+    },
+    "resnet152": {
+        "net": ResNet,
+        "block": Bottleneck,
+        "layers": [3, 8, 36, 3],
+        "widths": [64, 128, 256, 512],
+        "expansion": 4,
+    },
+    "resnext101-32x4d": {
+        "net": ResNet,
+        "block": Bottleneck,
+        "cardinality": 32,
+        "layers": [3, 4, 23, 3],
+        "widths": [128, 256, 512, 1024],
+        "expansion": 2,
+    },
+    "se-resnext101-32x4d": {
+        "net": ResNet,
+        "block": SEBottleneck,
+        "cardinality": 32,
+        "layers": [3, 4, 23, 3],
+        "widths": [128, 256, 512, 1024],
+        "expansion": 2,
+    },
+}
+
+
+def build_resnet(version, config, num_classes, verbose=True):
     version = resnet_versions[version]
     config = resnet_configs[config]
 
@@ -344,11 +392,14 @@ def build_resnet(version, config, verbose=True):
     if verbose:
         print("Version: {}".format(version))
         print("Config: {}".format(config))
-    model = version['net'](builder,
-                           version['block'],
-                           version['expansion'],
-                           version['layers'],
-                           version['widths'],
-                           version['num_classes'])
+        print("Num classes: {}".format(num_classes))
+    model = version["net"](
+        builder,
+        version["block"],
+        version["expansion"],
+        version["layers"],
+        version["widths"],
+        num_classes,
+    )
 
     return model

+ 2 - 0
PyTorch/Classification/ConvNets/image_classification/smoothing.py

@@ -14,10 +14,12 @@
 import torch
 import torch.nn as nn
 
+
 class LabelSmoothing(nn.Module):
     """
     NLL loss with label smoothing.
     """
+
     def __init__(self, smoothing=0.0):
         """
         Constructor for the LabelSmoothing module.

+ 264 - 217
PyTorch/Classification/ConvNets/image_classification/training.py

@@ -37,6 +37,7 @@ from . import logger as log
 from . import resnet as models
 from . import utils
 import dllogger
+
 try:
     from apex.parallel import DistributedDataParallel as DDP
     from apex.fp16_utils import *
@@ -46,30 +47,33 @@ except ImportError:
         "Please install apex from https://www.github.com/nvidia/apex to run this example."
     )
 
-ACC_METADATA = {'unit': '%','format': ':.2f'}
-IPS_METADATA = {'unit': 'img/s', 'format': ':.2f'}
-TIME_METADATA = {'unit': 's', 'format': ':.5f'}
-LOSS_METADATA = {'format': ':.5f'}
+ACC_METADATA = {"unit": "%", "format": ":.2f"}
+IPS_METADATA = {"unit": "img/s", "format": ":.2f"}
+TIME_METADATA = {"unit": "s", "format": ":.5f"}
+LOSS_METADATA = {"format": ":.5f"}
 
 
 class ModelAndLoss(nn.Module):
-    def __init__(self,
-                 arch,
-                 loss,
-                 pretrained_weights=None,
-                 cuda=True,
-                 fp16=False):
+    def __init__(
+        self,
+        arch,
+        loss,
+        pretrained_weights=None,
+        cuda=True,
+        fp16=False,
+        memory_format=torch.contiguous_format,
+    ):
         super(ModelAndLoss, self).__init__()
         self.arch = arch
 
         print("=> creating model '{}'".format(arch))
-        model = models.build_resnet(arch[0], arch[1])
+        model = models.build_resnet(arch[0], arch[1], arch[2])
         if pretrained_weights is not None:
             print("=> using pre-trained model from a file '{}'".format(arch))
             model.load_state_dict(pretrained_weights)
 
         if cuda:
-            model = model.cuda()
+            model = model.cuda().to(memory_format=memory_format)
         if fp16:
             model = network_to_half(model)
 
@@ -96,46 +100,51 @@ class ModelAndLoss(nn.Module):
             self.model.load_state_dict(state)
 
 
-def get_optimizer(parameters,
-                  fp16,
-                  lr,
-                  momentum,
-                  weight_decay,
-                  nesterov=False,
-                  state=None,
-                  static_loss_scale=1.,
-                  dynamic_loss_scale=False,
-                  bn_weight_decay=False):
+def get_optimizer(
+    parameters,
+    fp16,
+    lr,
+    momentum,
+    weight_decay,
+    nesterov=False,
+    state=None,
+    static_loss_scale=1.0,
+    dynamic_loss_scale=False,
+    bn_weight_decay=False,
+):
 
     if bn_weight_decay:
         print(" ! Weight decay applied to BN parameters ")
-        optimizer = torch.optim.SGD([v for n, v in parameters],
-                                    lr,
-                                    momentum=momentum,
-                                    weight_decay=weight_decay,
-                                    nesterov=nesterov)
+        optimizer = torch.optim.SGD(
+            [v for n, v in parameters],
+            lr,
+            momentum=momentum,
+            weight_decay=weight_decay,
+            nesterov=nesterov,
+        )
     else:
         print(" ! Weight decay NOT applied to BN parameters ")
-        bn_params = [v for n, v in parameters if 'bn' in n]
-        rest_params = [v for n, v in parameters if not 'bn' in n]
+        bn_params = [v for n, v in parameters if "bn" in n]
+        rest_params = [v for n, v in parameters if not "bn" in n]
         print(len(bn_params))
         print(len(rest_params))
-        optimizer = torch.optim.SGD([{
-            'params': bn_params,
-            'weight_decay': 0
-        }, {
-            'params': rest_params,
-            'weight_decay': weight_decay
-        }],
-                                    lr,
-                                    momentum=momentum,
-                                    weight_decay=weight_decay,
-                                    nesterov=nesterov)
+        optimizer = torch.optim.SGD(
+            [
+                {"params": bn_params, "weight_decay": 0},
+                {"params": rest_params, "weight_decay": weight_decay},
+            ],
+            lr,
+            momentum=momentum,
+            weight_decay=weight_decay,
+            nesterov=nesterov,
+        )
     if fp16:
-        optimizer = FP16_Optimizer(optimizer,
-                                   static_loss_scale=static_loss_scale,
-                                   dynamic_loss_scale=dynamic_loss_scale,
-                                   verbose=False)
+        optimizer = FP16_Optimizer(
+            optimizer,
+            static_loss_scale=static_loss_scale,
+            dynamic_loss_scale=dynamic_loss_scale,
+            verbose=False,
+        )
 
     if not state is None:
         optimizer.load_state_dict(state)
@@ -145,17 +154,17 @@ def get_optimizer(parameters,
 
 def lr_policy(lr_fn, logger=None):
     if logger is not None:
-        logger.register_metric('lr',
-                               log.LR_METER(),
-                               verbosity=dllogger.Verbosity.VERBOSE)
+        logger.register_metric(
+            "lr", log.LR_METER(), verbosity=dllogger.Verbosity.VERBOSE
+        )
 
     def _alr(optimizer, iteration, epoch):
         lr = lr_fn(iteration, epoch)
 
         if logger is not None:
-            logger.log_metric('lr', lr)
+            logger.log_metric("lr", lr)
         for param_group in optimizer.param_groups:
-            param_group['lr'] = lr
+            param_group["lr"] = lr
 
     return _alr
 
@@ -200,11 +209,9 @@ def lr_cosine_policy(base_lr, warmup_length, epochs, logger=None):
     return lr_policy(_lr_fn, logger=logger)
 
 
-def lr_exponential_policy(base_lr,
-                          warmup_length,
-                          epochs,
-                          final_multiplier=0.001,
-                          logger=None):
+def lr_exponential_policy(
+    base_lr, warmup_length, epochs, final_multiplier=0.001, logger=None
+):
     es = epochs - warmup_length
     epoch_decay = np.power(2, np.log2(final_multiplier) / es)
 
@@ -213,17 +220,15 @@ def lr_exponential_policy(base_lr,
             lr = base_lr * (epoch + 1) / warmup_length
         else:
             e = epoch - warmup_length
-            lr = base_lr * (epoch_decay**e)
+            lr = base_lr * (epoch_decay ** e)
         return lr
 
     return lr_policy(_lr_fn, logger=logger)
 
 
-def get_train_step(model_and_loss,
-                   optimizer,
-                   fp16,
-                   use_amp=False,
-                   batch_size_multiplier=1):
+def get_train_step(
+    model_and_loss, optimizer, fp16, use_amp=False, batch_size_multiplier=1
+):
     def _step(input, target, optimizer_step=True):
         input_var = Variable(input)
         target_var = Variable(target)
@@ -242,10 +247,13 @@ def get_train_step(model_and_loss,
             loss.backward()
 
         if optimizer_step:
-            opt = optimizer.optimizer if isinstance(
-                optimizer, FP16_Optimizer) else optimizer
+            opt = (
+                optimizer.optimizer
+                if isinstance(optimizer, FP16_Optimizer)
+                else optimizer
+            )
             for param_group in opt.param_groups:
-                for param in param_group['params']:
+                for param in param_group["params"]:
                     param.grad /= batch_size_multiplier
 
             optimizer.step()
@@ -258,45 +266,59 @@ def get_train_step(model_and_loss,
     return _step
 
 
-def train(train_loader,
-          model_and_loss,
-          optimizer,
-          lr_scheduler,
-          fp16,
-          logger,
-          epoch,
-          use_amp=False,
-          prof=-1,
-          batch_size_multiplier=1,
-          register_metrics=True):
+def train(
+    train_loader,
+    model_and_loss,
+    optimizer,
+    lr_scheduler,
+    fp16,
+    logger,
+    epoch,
+    use_amp=False,
+    prof=-1,
+    batch_size_multiplier=1,
+    register_metrics=True,
+):
 
     if register_metrics and logger is not None:
-        logger.register_metric('train.loss',
-                               log.LOSS_METER(),
-                               verbosity=dllogger.Verbosity.DEFAULT,
-                               metadata=LOSS_METADATA)
-        logger.register_metric('train.compute_ips',
-                               log.PERF_METER(),
-                               verbosity=dllogger.Verbosity.VERBOSE,
-                               metadata=IPS_METADATA)
-        logger.register_metric('train.total_ips',
-                               log.PERF_METER(),
-                               verbosity=dllogger.Verbosity.DEFAULT,
-                               metadata=IPS_METADATA)
-        logger.register_metric('train.data_time',
-                               log.PERF_METER(),
-                               verbosity=dllogger.Verbosity.VERBOSE,
-                               metadata=TIME_METADATA)
-        logger.register_metric('train.compute_time',
-                               log.PERF_METER(),
-                               verbosity=dllogger.Verbosity.VERBOSE,
-                               metadata=TIME_METADATA)
-
-    step = get_train_step(model_and_loss,
-                          optimizer,
-                          fp16,
-                          use_amp=use_amp,
-                          batch_size_multiplier=batch_size_multiplier)
+        logger.register_metric(
+            "train.loss",
+            log.LOSS_METER(),
+            verbosity=dllogger.Verbosity.DEFAULT,
+            metadata=LOSS_METADATA,
+        )
+        logger.register_metric(
+            "train.compute_ips",
+            log.PERF_METER(),
+            verbosity=dllogger.Verbosity.VERBOSE,
+            metadata=IPS_METADATA,
+        )
+        logger.register_metric(
+            "train.total_ips",
+            log.PERF_METER(),
+            verbosity=dllogger.Verbosity.DEFAULT,
+            metadata=IPS_METADATA,
+        )
+        logger.register_metric(
+            "train.data_time",
+            log.PERF_METER(),
+            verbosity=dllogger.Verbosity.VERBOSE,
+            metadata=TIME_METADATA,
+        )
+        logger.register_metric(
+            "train.compute_time",
+            log.PERF_METER(),
+            verbosity=dllogger.Verbosity.VERBOSE,
+            metadata=TIME_METADATA,
+        )
+
+    step = get_train_step(
+        model_and_loss,
+        optimizer,
+        fp16,
+        use_amp=use_amp,
+        batch_size_multiplier=batch_size_multiplier,
+    )
 
     model_and_loss.train()
     end = time.time()
@@ -320,12 +342,11 @@ def train(train_loader,
         it_time = time.time() - end
 
         if logger is not None:
-            logger.log_metric('train.loss', to_python_float(loss), bs)
-            logger.log_metric('train.compute_ips',
-                              calc_ips(bs, it_time - data_time))
-            logger.log_metric('train.total_ips', calc_ips(bs, it_time))
-            logger.log_metric('train.data_time', data_time)
-            logger.log_metric('train.compute_time', it_time - data_time)
+            logger.log_metric("train.loss", to_python_float(loss), bs)
+            logger.log_metric("train.compute_ips", calc_ips(bs, it_time - data_time))
+            logger.log_metric("train.total_ips", calc_ips(bs, it_time))
+            logger.log_metric("train.data_time", data_time)
+            logger.log_metric("train.compute_time", it_time - data_time)
 
         end = time.time()
 
@@ -354,55 +375,70 @@ def get_val_step(model_and_loss):
     return _step
 
 
-def validate(val_loader,
-             model_and_loss,
-             fp16,
-             logger,
-             epoch,
-             prof=-1,
-             register_metrics=True):
+def validate(
+    val_loader, model_and_loss, fp16, logger, epoch, prof=-1, register_metrics=True
+):
     if register_metrics and logger is not None:
-        logger.register_metric('val.top1',
-                               log.ACC_METER(),
-                               verbosity=dllogger.Verbosity.DEFAULT,
-                               metadata=ACC_METADATA)
-        logger.register_metric('val.top5',
-                               log.ACC_METER(),
-                               verbosity=dllogger.Verbosity.DEFAULT,
-                               metadata=ACC_METADATA)
-        logger.register_metric('val.loss',
-                               log.LOSS_METER(),
-                               verbosity=dllogger.Verbosity.DEFAULT,
-                               metadata=LOSS_METADATA)
-        logger.register_metric('val.compute_ips',
-                               log.PERF_METER(),
-                               verbosity=dllogger.Verbosity.VERBOSE,
-                               metadata=IPS_METADATA)
-        logger.register_metric('val.total_ips',
-                               log.PERF_METER(),
-                               verbosity=dllogger.Verbosity.DEFAULT,
-                               metadata=IPS_METADATA)
-        logger.register_metric('val.data_time',
-                               log.PERF_METER(),
-                               verbosity=dllogger.Verbosity.VERBOSE,
-                               metadata=TIME_METADATA)
-        logger.register_metric('val.compute_latency',
-                               log.PERF_METER(),
-                               verbosity=dllogger.Verbosity.VERBOSE,
-                               metadata=TIME_METADATA)
-        logger.register_metric('val.compute_latency_at100',
-                               log.LAT_100(),
-                               verbosity=dllogger.Verbosity.VERBOSE,
-                               metadata=TIME_METADATA)
-        logger.register_metric('val.compute_latency_at99',
-                               log.LAT_99(),
-                               verbosity=dllogger.Verbosity.VERBOSE,
-                               metadata=TIME_METADATA)
-        logger.register_metric('val.compute_latency_at95',
-                               log.LAT_95(),
-                               verbosity=dllogger.Verbosity.VERBOSE,
-                               metadata=TIME_METADATA)
-
+        logger.register_metric(
+            "val.top1",
+            log.ACC_METER(),
+            verbosity=dllogger.Verbosity.DEFAULT,
+            metadata=ACC_METADATA,
+        )
+        logger.register_metric(
+            "val.top5",
+            log.ACC_METER(),
+            verbosity=dllogger.Verbosity.DEFAULT,
+            metadata=ACC_METADATA,
+        )
+        logger.register_metric(
+            "val.loss",
+            log.LOSS_METER(),
+            verbosity=dllogger.Verbosity.DEFAULT,
+            metadata=LOSS_METADATA,
+        )
+        logger.register_metric(
+            "val.compute_ips",
+            log.PERF_METER(),
+            verbosity=dllogger.Verbosity.VERBOSE,
+            metadata=IPS_METADATA,
+        )
+        logger.register_metric(
+            "val.total_ips",
+            log.PERF_METER(),
+            verbosity=dllogger.Verbosity.DEFAULT,
+            metadata=IPS_METADATA,
+        )
+        logger.register_metric(
+            "val.data_time",
+            log.PERF_METER(),
+            verbosity=dllogger.Verbosity.VERBOSE,
+            metadata=TIME_METADATA,
+        )
+        logger.register_metric(
+            "val.compute_latency",
+            log.PERF_METER(),
+            verbosity=dllogger.Verbosity.VERBOSE,
+            metadata=TIME_METADATA,
+        )
+        logger.register_metric(
+            "val.compute_latency_at100",
+            log.LAT_100(),
+            verbosity=dllogger.Verbosity.VERBOSE,
+            metadata=TIME_METADATA,
+        )
+        logger.register_metric(
+            "val.compute_latency_at99",
+            log.LAT_99(),
+            verbosity=dllogger.Verbosity.VERBOSE,
+            metadata=TIME_METADATA,
+        )
+        logger.register_metric(
+            "val.compute_latency_at95",
+            log.LAT_95(),
+            verbosity=dllogger.Verbosity.VERBOSE,
+            metadata=TIME_METADATA,
+        )
 
     step = get_val_step(model_and_loss)
 
@@ -428,17 +464,16 @@ def validate(val_loader,
 
         top1.record(to_python_float(prec1), bs)
         if logger is not None:
-            logger.log_metric('val.top1', to_python_float(prec1), bs)
-            logger.log_metric('val.top5', to_python_float(prec5), bs)
-            logger.log_metric('val.loss', to_python_float(loss), bs)
-            logger.log_metric('val.compute_ips',
-                              calc_ips(bs, it_time - data_time))
-            logger.log_metric('val.total_ips', calc_ips(bs, it_time))
-            logger.log_metric('val.data_time', data_time)
-            logger.log_metric('val.compute_latency', it_time - data_time)
-            logger.log_metric('val.compute_latency_at95', it_time - data_time)
-            logger.log_metric('val.compute_latency_at99', it_time - data_time)
-            logger.log_metric('val.compute_latency_at100', it_time - data_time)
+            logger.log_metric("val.top1", to_python_float(prec1), bs)
+            logger.log_metric("val.top5", to_python_float(prec5), bs)
+            logger.log_metric("val.loss", to_python_float(loss), bs)
+            logger.log_metric("val.compute_ips", calc_ips(bs, it_time - data_time))
+            logger.log_metric("val.total_ips", calc_ips(bs, it_time))
+            logger.log_metric("val.data_time", data_time)
+            logger.log_metric("val.compute_latency", it_time - data_time)
+            logger.log_metric("val.compute_latency_at95", it_time - data_time)
+            logger.log_metric("val.compute_latency_at99", it_time - data_time)
+            logger.log_metric("val.compute_latency_at100", it_time - data_time)
 
         end = time.time()
 
@@ -447,86 +482,98 @@ def validate(val_loader,
 
 # Train loop {{{
 def calc_ips(batch_size, time):
-    world_size = torch.distributed.get_world_size(
-    ) if torch.distributed.is_initialized() else 1
+    world_size = (
+        torch.distributed.get_world_size() if torch.distributed.is_initialized() else 1
+    )
     tbs = world_size * batch_size
     return tbs / time
 
 
-def train_loop(model_and_loss,
-               optimizer,
-               lr_scheduler,
-               train_loader,
-               val_loader,
-               epochs,
-               fp16,
-               logger,
-               should_backup_checkpoint,
-               use_amp=False,
-               batch_size_multiplier=1,
-               best_prec1=0,
-               start_epoch=0,
-               prof=-1,
-               skip_training=False,
-               skip_validation=False,
-               save_checkpoints=True,
-               checkpoint_dir='./'):
+def train_loop(
+    model_and_loss,
+    optimizer,
+    lr_scheduler,
+    train_loader,
+    val_loader,
+    fp16,
+    logger,
+    should_backup_checkpoint,
+    use_amp=False,
+    batch_size_multiplier=1,
+    best_prec1=0,
+    start_epoch=0,
+    end_epoch=0,
+    prof=-1,
+    skip_training=False,
+    skip_validation=False,
+    save_checkpoints=True,
+    checkpoint_dir="./",
+    checkpoint_filename="checkpoint.pth.tar",
+):
 
     prec1 = -1
 
-    epoch_iter = range(start_epoch, epochs)
-    for epoch in epoch_iter:
+    print(f"RUNNING EPOCHS FROM {start_epoch} TO {end_epoch}")
+    for epoch in range(start_epoch, end_epoch):
         if logger is not None:
             logger.start_epoch()
         if not skip_training:
-            train(train_loader,
-                  model_and_loss,
-                  optimizer,
-                  lr_scheduler,
-                  fp16,
-                  logger,
-                  epoch,
-                  use_amp=use_amp,
-                  prof=prof,
-                  register_metrics=epoch == start_epoch,
-                  batch_size_multiplier=batch_size_multiplier)
+            train(
+                train_loader,
+                model_and_loss,
+                optimizer,
+                lr_scheduler,
+                fp16,
+                logger,
+                epoch,
+                use_amp=use_amp,
+                prof=prof,
+                register_metrics=epoch == start_epoch,
+                batch_size_multiplier=batch_size_multiplier,
+            )
 
         if not skip_validation:
-            prec1, nimg = validate(val_loader,
-                                   model_and_loss,
-                                   fp16,
-                                   logger,
-                                   epoch,
-                                   prof=prof,
-                                   register_metrics=epoch == start_epoch)
+            prec1, nimg = validate(
+                val_loader,
+                model_and_loss,
+                fp16,
+                logger,
+                epoch,
+                prof=prof,
+                register_metrics=epoch == start_epoch,
+            )
         if logger is not None:
             logger.end_epoch()
 
-        if save_checkpoints and (not torch.distributed.is_initialized()
-                                 or torch.distributed.get_rank() == 0):
+        if save_checkpoints and (
+            not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0
+        ):
             if not skip_validation:
-                is_best = logger.metrics['val.top1']['meter'].get_epoch() > best_prec1
-                best_prec1 = max(logger.metrics['val.top1']['meter'].get_epoch(),
-                                 best_prec1)
+                is_best = logger.metrics["val.top1"]["meter"].get_epoch() > best_prec1
+                best_prec1 = max(
+                    logger.metrics["val.top1"]["meter"].get_epoch(), best_prec1
+                )
             else:
                 is_best = False
                 best_prec1 = 0
 
             if should_backup_checkpoint(epoch):
-                backup_filename = 'checkpoint-{}.pth.tar'.format(epoch + 1)
+                backup_filename = "checkpoint-{}.pth.tar".format(epoch + 1)
             else:
                 backup_filename = None
             utils.save_checkpoint(
                 {
-                    'epoch': epoch + 1,
-                    'arch': model_and_loss.arch,
-                    'state_dict': model_and_loss.model.state_dict(),
-                    'best_prec1': best_prec1,
-                    'optimizer': optimizer.state_dict(),
+                    "epoch": epoch + 1,
+                    "arch": model_and_loss.arch,
+                    "state_dict": model_and_loss.model.state_dict(),
+                    "best_prec1": best_prec1,
+                    "optimizer": optimizer.state_dict(),
                 },
                 is_best,
                 checkpoint_dir=checkpoint_dir,
-                backup_filename=backup_filename)
+                backup_filename=backup_filename,
+                filename=checkpoint_filename,
+            )
 
 
 # }}}

+ 16 - 14
PyTorch/Classification/ConvNets/image_classification/utils.py

@@ -41,22 +41,23 @@ def should_backup_checkpoint(args):
     return _sbc
 
 
-def save_checkpoint(state,
-                    is_best,
-                    filename='checkpoint.pth.tar',
-                    checkpoint_dir='./',
-                    backup_filename=None):
-    if (not torch.distributed.is_initialized()
-        ) or torch.distributed.get_rank() == 0:
+def save_checkpoint(
+    state,
+    is_best,
+    filename="checkpoint.pth.tar",
+    checkpoint_dir="./",
+    backup_filename=None,
+):
+    if (not torch.distributed.is_initialized()) or torch.distributed.get_rank() == 0:
         filename = os.path.join(checkpoint_dir, filename)
         print("SAVING {}".format(filename))
         torch.save(state, filename)
         if is_best:
-            shutil.copyfile(filename,
-                            os.path.join(checkpoint_dir, 'model_best.pth.tar'))
+            shutil.copyfile(
+                filename, os.path.join(checkpoint_dir, "model_best.pth.tar")
+            )
         if backup_filename is not None:
-            shutil.copyfile(filename,
-                            os.path.join(checkpoint_dir, backup_filename))
+            shutil.copyfile(filename, os.path.join(checkpoint_dir, backup_filename))
 
 
 def timed_generator(gen):
@@ -77,7 +78,7 @@ def timed_function(f):
     return _timed_function
 
 
-def accuracy(output, target, topk=(1, )):
+def accuracy(output, target, topk=(1,)):
     """Computes the precision@k for the specified values of k"""
     maxk = max(topk)
     batch_size = target.size(0)
@@ -96,8 +97,9 @@ def accuracy(output, target, topk=(1, )):
 def reduce_tensor(tensor):
     rt = tensor.clone()
     dist.all_reduce(rt, op=dist.ReduceOp.SUM)
-    rt /= torch.distributed.get_world_size(
-    ) if torch.distributed.is_initialized() else 1
+    rt /= (
+        torch.distributed.get_world_size() if torch.distributed.is_initialized() else 1
+    )
     return rt
 
 

+ 336 - 270
PyTorch/Classification/ConvNets/main.py

@@ -71,187 +71,226 @@ def add_parser_arguments(parser):
     model_names = models.resnet_versions.keys()
     model_configs = models.resnet_configs.keys()
 
-    parser.add_argument('data', metavar='DIR', help='path to dataset')
-    parser.add_argument('--data-backend',
-                        metavar='BACKEND',
-                        default='dali-cpu',
-                        choices=DATA_BACKEND_CHOICES,
-                        help='data backend: ' +
-                        ' | '.join(DATA_BACKEND_CHOICES) +
-                        ' (default: dali-cpu)')
-
-    parser.add_argument('--arch',
-                        '-a',
-                        metavar='ARCH',
-                        default='resnet50',
-                        choices=model_names,
-                        help='model architecture: ' + ' | '.join(model_names) +
-                        ' (default: resnet50)')
-
-    parser.add_argument('--model-config',
-                        '-c',
-                        metavar='CONF',
-                        default='classic',
-                        choices=model_configs,
-                        help='model configs: ' + ' | '.join(model_configs) +
-                        '(default: classic)')
-
-    parser.add_argument('-j',
-                        '--workers',
-                        default=5,
-                        type=int,
-                        metavar='N',
-                        help='number of data loading workers (default: 5)')
-    parser.add_argument('--epochs',
-                        default=90,
-                        type=int,
-                        metavar='N',
-                        help='number of total epochs to run')
-    parser.add_argument('-b',
-                        '--batch-size',
-                        default=256,
-                        type=int,
-                        metavar='N',
-                        help='mini-batch size (default: 256) per gpu')
-
-    parser.add_argument(
-        '--optimizer-batch-size',
+    parser.add_argument("data", metavar="DIR", help="path to dataset")
+    parser.add_argument(
+        "--data-backend",
+        metavar="BACKEND",
+        default="dali-cpu",
+        choices=DATA_BACKEND_CHOICES,
+        help="data backend: "
+        + " | ".join(DATA_BACKEND_CHOICES)
+        + " (default: dali-cpu)",
+    )
+
+    parser.add_argument(
+        "--arch",
+        "-a",
+        metavar="ARCH",
+        default="resnet50",
+        choices=model_names,
+        help="model architecture: " + " | ".join(model_names) + " (default: resnet50)",
+    )
+
+    parser.add_argument(
+        "--model-config",
+        "-c",
+        metavar="CONF",
+        default="classic",
+        choices=model_configs,
+        help="model configs: " + " | ".join(model_configs) + "(default: classic)",
+    )
+
+    parser.add_argument(
+        "--num-classes",
+        metavar="N",
+        default=1000,
+        type=int,
+        help="number of classes in the dataset",
+    )
+
+    parser.add_argument(
+        "-j",
+        "--workers",
+        default=5,
+        type=int,
+        metavar="N",
+        help="number of data loading workers (default: 5)",
+    )
+    parser.add_argument(
+        "--epochs",
+        default=90,
+        type=int,
+        metavar="N",
+        help="number of total epochs to run",
+    )
+    parser.add_argument(
+        "--run-epochs",
         default=-1,
         type=int,
-        metavar='N',
-        help=
-        'size of a total batch size, for simulating bigger batches using gradient accumulation'
-    )
-
-    parser.add_argument('--lr',
-                        '--learning-rate',
-                        default=0.1,
-                        type=float,
-                        metavar='LR',
-                        help='initial learning rate')
-    parser.add_argument('--lr-schedule',
-                        default='step',
-                        type=str,
-                        metavar='SCHEDULE',
-                        choices=['step', 'linear', 'cosine'],
-                        help='Type of LR schedule: {}, {}, {}'.format(
-                            'step', 'linear', 'cosine'))
-
-    parser.add_argument('--warmup',
-                        default=0,
-                        type=int,
-                        metavar='E',
-                        help='number of warmup epochs')
-
-    parser.add_argument('--label-smoothing',
-                        default=0.0,
-                        type=float,
-                        metavar='S',
-                        help='label smoothing')
-    parser.add_argument('--mixup',
-                        default=0.0,
-                        type=float,
-                        metavar='ALPHA',
-                        help='mixup alpha')
-
-    parser.add_argument('--momentum',
-                        default=0.9,
-                        type=float,
-                        metavar='M',
-                        help='momentum')
-    parser.add_argument('--weight-decay',
-                        '--wd',
-                        default=1e-4,
-                        type=float,
-                        metavar='W',
-                        help='weight decay (default: 1e-4)')
-    parser.add_argument(
-        '--bn-weight-decay',
-        action='store_true',
-        help=
-        'use weight_decay on batch normalization learnable parameters, (default: false)'
-    )
-    parser.add_argument('--nesterov',
-                        action='store_true',
-                        help='use nesterov momentum, (default: false)')
-
-    parser.add_argument('--print-freq',
-                        '-p',
-                        default=10,
-                        type=int,
-                        metavar='N',
-                        help='print frequency (default: 10)')
-    parser.add_argument('--resume',
-                        default='',
-                        type=str,
-                        metavar='PATH',
-                        help='path to latest checkpoint (default: none)')
-    parser.add_argument('--pretrained-weights',
-                        default='',
-                        type=str,
-                        metavar='PATH',
-                        help='load weights from here')
-
-    parser.add_argument('--fp16',
-                        action='store_true',
-                        help='Run model fp16 mode.')
-    parser.add_argument(
-        '--static-loss-scale',
+        metavar="N",
+        help="run only N epochs, used for checkpointing runs",
+    )
+    parser.add_argument(
+        "-b",
+        "--batch-size",
+        default=256,
+        type=int,
+        metavar="N",
+        help="mini-batch size (default: 256) per gpu",
+    )
+
+    parser.add_argument(
+        "--optimizer-batch-size",
+        default=-1,
+        type=int,
+        metavar="N",
+        help="size of a total batch size, for simulating bigger batches using gradient accumulation",
+    )
+
+    parser.add_argument(
+        "--lr",
+        "--learning-rate",
+        default=0.1,
+        type=float,
+        metavar="LR",
+        help="initial learning rate",
+    )
+    parser.add_argument(
+        "--lr-schedule",
+        default="step",
+        type=str,
+        metavar="SCHEDULE",
+        choices=["step", "linear", "cosine"],
+        help="Type of LR schedule: {}, {}, {}".format("step", "linear", "cosine"),
+    )
+
+    parser.add_argument(
+        "--warmup", default=0, type=int, metavar="E", help="number of warmup epochs"
+    )
+
+    parser.add_argument(
+        "--label-smoothing",
+        default=0.0,
+        type=float,
+        metavar="S",
+        help="label smoothing",
+    )
+    parser.add_argument(
+        "--mixup", default=0.0, type=float, metavar="ALPHA", help="mixup alpha"
+    )
+
+    parser.add_argument(
+        "--momentum", default=0.9, type=float, metavar="M", help="momentum"
+    )
+    parser.add_argument(
+        "--weight-decay",
+        "--wd",
+        default=1e-4,
+        type=float,
+        metavar="W",
+        help="weight decay (default: 1e-4)",
+    )
+    parser.add_argument(
+        "--bn-weight-decay",
+        action="store_true",
+        help="use weight_decay on batch normalization learnable parameters, (default: false)",
+    )
+    parser.add_argument(
+        "--nesterov",
+        action="store_true",
+        help="use nesterov momentum, (default: false)",
+    )
+
+    parser.add_argument(
+        "--print-freq",
+        "-p",
+        default=10,
+        type=int,
+        metavar="N",
+        help="print frequency (default: 10)",
+    )
+    parser.add_argument(
+        "--resume",
+        default=None,
+        type=str,
+        metavar="PATH",
+        help="path to latest checkpoint (default: none)",
+    )
+    parser.add_argument(
+        "--pretrained-weights",
+        default="",
+        type=str,
+        metavar="PATH",
+        help="load weights from here",
+    )
+
+    parser.add_argument("--fp16", action="store_true", help="Run model fp16 mode.")
+    parser.add_argument(
+        "--static-loss-scale",
         type=float,
         default=1,
-        help=
-        'Static loss scale, positive power of 2 values can improve fp16 convergence.'
+        help="Static loss scale, positive power of 2 values can improve fp16 convergence.",
+    )
+    parser.add_argument(
+        "--dynamic-loss-scale",
+        action="store_true",
+        help="Use dynamic loss scaling.  If supplied, this argument supersedes "
+        + "--static-loss-scale.",
+    )
+    parser.add_argument(
+        "--prof", type=int, default=-1, metavar="N", help="Run only N iterations"
     )
     parser.add_argument(
-        '--dynamic-loss-scale',
-        action='store_true',
-        help='Use dynamic loss scaling.  If supplied, this argument supersedes '
-        + '--static-loss-scale.')
-    parser.add_argument('--prof',
-                        type=int,
-                        default=-1,
-                        metavar='N',
-                        help='Run only N iterations')
-    parser.add_argument('--amp',
-                        action='store_true',
-                        help='Run model AMP (automatic mixed precision) mode.')
+        "--amp",
+        action="store_true",
+        help="Run model AMP (automatic mixed precision) mode.",
+    )
 
-    parser.add_argument('--seed',
-                        default=None,
-                        type=int,
-                        help='random seed used for numpy and pytorch')
+    parser.add_argument(
+        "--seed", default=None, type=int, help="random seed used for numpy and pytorch"
+    )
 
     parser.add_argument(
-        '--gather-checkpoints',
-        action='store_true',
-        help=
-        'Gather checkpoints throughout the training, without this flag only best and last checkpoints will be stored'
+        "--gather-checkpoints",
+        action="store_true",
+        help="Gather checkpoints throughout the training, without this flag only best and last checkpoints will be stored",
     )
 
-    parser.add_argument('--raport-file',
-                        default='experiment_raport.json',
-                        type=str,
-                        help='file in which to store JSON experiment raport')
+    parser.add_argument(
+        "--raport-file",
+        default="experiment_raport.json",
+        type=str,
+        help="file in which to store JSON experiment raport",
+    )
 
-    parser.add_argument('--evaluate',
-                        action='store_true',
-                        help='evaluate checkpoint/model')
-    parser.add_argument('--training-only',
-                        action='store_true',
-                        help='do not evaluate')
+    parser.add_argument(
+        "--evaluate", action="store_true", help="evaluate checkpoint/model"
+    )
+    parser.add_argument("--training-only", action="store_true", help="do not evaluate")
 
     parser.add_argument(
-        '--no-checkpoints',
-        action='store_false',
-        dest='save_checkpoints',
-        help='do not store any checkpoints, useful for benchmarking')
+        "--no-checkpoints",
+        action="store_false",
+        dest="save_checkpoints",
+        help="do not store any checkpoints, useful for benchmarking",
+    )
 
+    parser.add_argument("--checkpoint-filename", default="checkpoint.pth.tar", type=str)
+    
     parser.add_argument(
-        '--workspace',
+        "--workspace",
         type=str,
-        default='./',
-        metavar='DIR',
-        help='path to directory where checkpoints will be stored')
+        default="./",
+        metavar="DIR",
+        help="path to directory where checkpoints will be stored",
+    )
+    parser.add_argument(
+        "--memory-format",
+        type=str,
+        default="nchw",
+        choices=["nchw", "nhwc"],
+        help="memory layout, nchw or nhwc",
+    )
 
 
 def main(args):
@@ -260,9 +299,9 @@ def main(args):
     best_prec1 = 0
 
     args.distributed = False
-    if 'WORLD_SIZE' in os.environ:
-        args.distributed = int(os.environ['WORLD_SIZE']) > 1
-        args.local_rank = int(os.environ['LOCAL_RANK'])
+    if "WORLD_SIZE" in os.environ:
+        args.distributed = int(os.environ["WORLD_SIZE"]) > 1
+        args.local_rank = int(os.environ["LOCAL_RANK"])
 
     args.gpu = 0
     args.world_size = 1
@@ -270,7 +309,7 @@ def main(args):
     if args.distributed:
         args.gpu = args.local_rank % torch.cuda.device_count()
         torch.cuda.set_device(args.gpu)
-        dist.init_process_group(backend='nccl', init_method='env://')
+        dist.init_process_group(backend="nccl", init_method="env://")
         args.world_size = torch.distributed.get_world_size()
 
     if args.amp and args.fp16:
@@ -287,19 +326,20 @@ def main(args):
         def _worker_init_fn(id):
             np.random.seed(seed=args.seed + args.local_rank + id)
             random.seed(args.seed + args.local_rank + id)
+
     else:
 
         def _worker_init_fn(id):
             pass
 
     if args.fp16:
-        assert torch.backends.cudnn.enabled, "fp16 mode requires cudnn backend to be enabled."
+        assert (
+            torch.backends.cudnn.enabled
+        ), "fp16 mode requires cudnn backend to be enabled."
 
     if args.static_loss_scale != 1.0:
         if not args.fp16:
-            print(
-                "Warning:  if --fp16 is not used, static_loss_scale will be ignored."
-            )
+            print("Warning:  if --fp16 is not used, static_loss_scale will be ignored.")
 
     if args.optimizer_batch_size < 0:
         batch_size_multiplier = 1
@@ -307,34 +347,42 @@ def main(args):
         tbs = args.world_size * args.batch_size
         if args.optimizer_batch_size % tbs != 0:
             print(
-                "Warning: simulated batch size {} is not divisible by actual batch size {}"
-                .format(args.optimizer_batch_size, tbs))
+                "Warning: simulated batch size {} is not divisible by actual batch size {}".format(
+                    args.optimizer_batch_size, tbs
+                )
+            )
         batch_size_multiplier = int(args.optimizer_batch_size / tbs)
         print("BSM: {}".format(batch_size_multiplier))
 
     pretrained_weights = None
     if args.pretrained_weights:
         if os.path.isfile(args.pretrained_weights):
-            print("=> loading pretrained weights from '{}'".format(
-                args.pretrained_weights))
+            print(
+                "=> loading pretrained weights from '{}'".format(
+                    args.pretrained_weights
+                )
+            )
             pretrained_weights = torch.load(args.pretrained_weights)
         else:
             print("=> no pretrained weights found at '{}'".format(args.resume))
 
     start_epoch = 0
     # optionally resume from a checkpoint
-    if args.resume:
+    if args.resume is not None:
         if os.path.isfile(args.resume):
             print("=> loading checkpoint '{}'".format(args.resume))
             checkpoint = torch.load(
-                args.resume,
-                map_location=lambda storage, loc: storage.cuda(args.gpu))
-            start_epoch = checkpoint['epoch']
-            best_prec1 = checkpoint['best_prec1']
-            model_state = checkpoint['state_dict']
-            optimizer_state = checkpoint['optimizer']
-            print("=> loaded checkpoint '{}' (epoch {})".format(
-                args.resume, checkpoint['epoch']))
+                args.resume, map_location=lambda storage, loc: storage.cuda(args.gpu)
+            )
+            start_epoch = checkpoint["epoch"]
+            best_prec1 = checkpoint["best_prec1"]
+            model_state = checkpoint["state_dict"]
+            optimizer_state = checkpoint["optimizer"]
+            print(
+                "=> loaded checkpoint '{}' (epoch {})".format(
+                    args.resume, checkpoint["epoch"]
+                )
+            )
         else:
             print("=> no checkpoint found at '{}'".format(args.resume))
             model_state = None
@@ -349,124 +397,142 @@ def main(args):
     elif args.label_smoothing > 0.0:
         loss = lambda: LabelSmoothing(args.label_smoothing)
 
-    model_and_loss = ModelAndLoss((args.arch, args.model_config),
-                                  loss,
-                                  pretrained_weights=pretrained_weights,
-                                  cuda=True,
-                                  fp16=args.fp16)
+    memory_format = (
+        torch.channels_last if args.memory_format == "nhwc" else torch.contiguous_format
+    )
+
+    model_and_loss = ModelAndLoss(
+        (args.arch, args.model_config, args.num_classes),
+        loss,
+        pretrained_weights=pretrained_weights,
+        cuda=True,
+        fp16=args.fp16,
+        memory_format=memory_format,
+    )
 
     # Create data loaders and optimizers as needed
-    if args.data_backend == 'pytorch':
+    if args.data_backend == "pytorch":
         get_train_loader = get_pytorch_train_loader
         get_val_loader = get_pytorch_val_loader
-    elif args.data_backend == 'dali-gpu':
+    elif args.data_backend == "dali-gpu":
         get_train_loader = get_dali_train_loader(dali_cpu=False)
         get_val_loader = get_dali_val_loader()
-    elif args.data_backend == 'dali-cpu':
+    elif args.data_backend == "dali-cpu":
         get_train_loader = get_dali_train_loader(dali_cpu=True)
         get_val_loader = get_dali_val_loader()
-    elif args.data_backend == 'syntetic':
+    elif args.data_backend == "syntetic":
         get_val_loader = get_syntetic_loader
         get_train_loader = get_syntetic_loader
 
-    train_loader, train_loader_len = get_train_loader(args.data,
-                                                      args.batch_size,
-                                                      1000,
-                                                      args.mixup > 0.0,
-                                                      workers=args.workers,
-                                                      fp16=args.fp16)
+    train_loader, train_loader_len = get_train_loader(
+        args.data,
+        args.batch_size,
+        args.num_classes,
+        args.mixup > 0.0,
+        start_epoch=start_epoch,
+        workers=args.workers,
+        fp16=args.fp16,
+        memory_format=memory_format,
+    )
     if args.mixup != 0.0:
-        train_loader = MixUpWrapper(args.mixup, 1000, train_loader)
-
-    val_loader, val_loader_len = get_val_loader(args.data,
-                                                args.batch_size,
-                                                1000,
-                                                False,
-                                                workers=args.workers,
-                                                fp16=args.fp16)
-
-    if not torch.distributed.is_initialized() or torch.distributed.get_rank(
-    ) == 0:
-        logger = log.Logger(args.print_freq, [
-            dllogger.StdOutBackend(dllogger.Verbosity.DEFAULT,
-                               step_format=log.format_step),
-            dllogger.JSONStreamBackend(
-                dllogger.Verbosity.VERBOSE,
-                os.path.join(args.workspace, args.raport_file))
-        ])
+        train_loader = MixUpWrapper(args.mixup, train_loader)
+
+    val_loader, val_loader_len = get_val_loader(
+        args.data,
+        args.batch_size,
+        args.num_classes,
+        False,
+        workers=args.workers,
+        fp16=args.fp16,
+        memory_format=memory_format,
+    )
+
+    if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
+        logger = log.Logger(
+            args.print_freq,
+            [
+                dllogger.StdOutBackend(
+                    dllogger.Verbosity.DEFAULT, step_format=log.format_step
+                ),
+                dllogger.JSONStreamBackend(
+                    dllogger.Verbosity.VERBOSE,
+                    os.path.join(args.workspace, args.raport_file),
+                ),
+            ],
+            start_epoch=start_epoch - 1,
+        )
 
     else:
-        logger = log.Logger(args.print_freq, [])
+        logger = log.Logger(args.print_freq, [], start_epoch=start_epoch - 1)
 
     logger.log_parameter(args.__dict__, verbosity=dllogger.Verbosity.DEFAULT)
 
-    optimizer = get_optimizer(list(model_and_loss.model.named_parameters()),
-                              args.fp16,
-                              args.lr,
-                              args.momentum,
-                              args.weight_decay,
-                              nesterov=args.nesterov,
-                              bn_weight_decay=args.bn_weight_decay,
-                              state=optimizer_state,
-                              static_loss_scale=args.static_loss_scale,
-                              dynamic_loss_scale=args.dynamic_loss_scale)
-
-    if args.lr_schedule == 'step':
-        lr_policy = lr_step_policy(args.lr, [30, 60, 80],
-                                   0.1,
-                                   args.warmup,
-                                   logger=logger)
-    elif args.lr_schedule == 'cosine':
-        lr_policy = lr_cosine_policy(args.lr,
-                                     args.warmup,
-                                     args.epochs,
-                                     logger=logger)
-    elif args.lr_schedule == 'linear':
-        lr_policy = lr_linear_policy(args.lr,
-                                     args.warmup,
-                                     args.epochs,
-                                     logger=logger)
+    optimizer = get_optimizer(
+        list(model_and_loss.model.named_parameters()),
+        args.fp16,
+        args.lr,
+        args.momentum,
+        args.weight_decay,
+        nesterov=args.nesterov,
+        bn_weight_decay=args.bn_weight_decay,
+        state=optimizer_state,
+        static_loss_scale=args.static_loss_scale,
+        dynamic_loss_scale=args.dynamic_loss_scale,
+    )
+
+    if args.lr_schedule == "step":
+        lr_policy = lr_step_policy(
+            args.lr, [30, 60, 80], 0.1, args.warmup, logger=logger
+        )
+    elif args.lr_schedule == "cosine":
+        lr_policy = lr_cosine_policy(args.lr, args.warmup, args.epochs, logger=logger)
+    elif args.lr_schedule == "linear":
+        lr_policy = lr_linear_policy(args.lr, args.warmup, args.epochs, logger=logger)
 
     if args.amp:
         model_and_loss, optimizer = amp.initialize(
             model_and_loss,
             optimizer,
-            opt_level="O2",
-            loss_scale="dynamic"
-            if args.dynamic_loss_scale else args.static_loss_scale)
+            opt_level="O1",
+            loss_scale="dynamic" if args.dynamic_loss_scale else args.static_loss_scale,
+        )
 
     if args.distributed:
         model_and_loss.distributed()
 
     model_and_loss.load_model_state(model_state)
 
-    train_loop(model_and_loss,
-               optimizer,
-               lr_policy,
-               train_loader,
-               val_loader,
-               args.epochs,
-               args.fp16,
-               logger,
-               should_backup_checkpoint(args),
-               use_amp=args.amp,
-               batch_size_multiplier=batch_size_multiplier,
-               start_epoch=start_epoch,
-               best_prec1=best_prec1,
-               prof=args.prof,
-               skip_training=args.evaluate,
-               skip_validation=args.training_only,
-               save_checkpoints=args.save_checkpoints and not args.evaluate,
-               checkpoint_dir=args.workspace)
+    train_loop(
+        model_and_loss,
+        optimizer,
+        lr_policy,
+        train_loader,
+        val_loader,
+        args.fp16,
+        logger,
+        should_backup_checkpoint(args),
+        use_amp=args.amp,
+        batch_size_multiplier=batch_size_multiplier,
+        start_epoch=start_epoch,
+        end_epoch=(start_epoch + args.run_epochs)
+        if args.run_epochs != -1
+        else args.epochs,
+        best_prec1=best_prec1,
+        prof=args.prof,
+        skip_training=args.evaluate,
+        skip_validation=args.training_only,
+        save_checkpoints=args.save_checkpoints and not args.evaluate,
+        checkpoint_dir=args.workspace,
+        checkpoint_filename=args.checkpoint_filename,
+    )
     exp_duration = time.time() - exp_start_time
-    if not torch.distributed.is_initialized() or torch.distributed.get_rank(
-    ) == 0:
+    if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
         logger.end()
     print("Experiment ended")
 
 
-if __name__ == '__main__':
-    parser = argparse.ArgumentParser(description='PyTorch ImageNet Training')
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="PyTorch ImageNet Training")
 
     add_parser_arguments(parser)
     args = parser.parse_args()

+ 57 - 33
PyTorch/Classification/ConvNets/multiproc.py

@@ -78,46 +78,70 @@ from argparse import ArgumentParser, REMAINDER
 
 import torch
 
+
 def parse_args():
     """
     Helper function parsing the command line options
     @retval ArgumentParser
     """
-    parser = ArgumentParser(description="PyTorch distributed training launch "
-                                        "helper utilty that will spawn up "
-                                        "multiple distributed processes")
+    parser = ArgumentParser(
+        description="PyTorch distributed training launch "
+        "helper utilty that will spawn up "
+        "multiple distributed processes"
+    )
 
     # Optional arguments for the launch helper
-    parser.add_argument("--nnodes", type=int, default=1,
-                        help="The number of nodes to use for distributed "
-                             "training")
-    parser.add_argument("--node_rank", type=int, default=0,
-                        help="The rank of the node for multi-node distributed "
-                             "training")
-    parser.add_argument("--nproc_per_node", type=int, default=1,
-                        help="The number of processes to launch on each node, "
-                             "for GPU training, this is recommended to be set "
-                             "to the number of GPUs in your system so that "
-                             "each process can be bound to a single GPU.")
-    parser.add_argument("--master_addr", default="127.0.0.1", type=str,
-                        help="Master node (rank 0)'s address, should be either "
-                             "the IP address or the hostname of node 0, for "
-                             "single node multi-proc training, the "
-                             "--master_addr can simply be 127.0.0.1")
-    parser.add_argument("--master_port", default=29500, type=int,
-                        help="Master node (rank 0)'s free port that needs to "
-                             "be used for communciation during distributed "
-                             "training")
+    parser.add_argument(
+        "--nnodes",
+        type=int,
+        default=1,
+        help="The number of nodes to use for distributed " "training",
+    )
+    parser.add_argument(
+        "--node_rank",
+        type=int,
+        default=0,
+        help="The rank of the node for multi-node distributed " "training",
+    )
+    parser.add_argument(
+        "--nproc_per_node",
+        type=int,
+        default=1,
+        help="The number of processes to launch on each node, "
+        "for GPU training, this is recommended to be set "
+        "to the number of GPUs in your system so that "
+        "each process can be bound to a single GPU.",
+    )
+    parser.add_argument(
+        "--master_addr",
+        default="127.0.0.1",
+        type=str,
+        help="Master node (rank 0)'s address, should be either "
+        "the IP address or the hostname of node 0, for "
+        "single node multi-proc training, the "
+        "--master_addr can simply be 127.0.0.1",
+    )
+    parser.add_argument(
+        "--master_port",
+        default=29500,
+        type=int,
+        help="Master node (rank 0)'s free port that needs to "
+        "be used for communciation during distributed "
+        "training",
+    )
 
     # positional
-    parser.add_argument("training_script", type=str,
-                        help="The full path to the single GPU training "
-                             "program/script to be launched in parallel, "
-                             "followed by all the arguments for the "
-                             "training script")
+    parser.add_argument(
+        "training_script",
+        type=str,
+        help="The full path to the single GPU training "
+        "program/script to be launched in parallel, "
+        "followed by all the arguments for the "
+        "training script",
+    )
 
     # rest from the training program
-    parser.add_argument('training_script_args', nargs=REMAINDER)
+    parser.add_argument("training_script_args", nargs=REMAINDER)
     return parser.parse_args()
 
 
@@ -142,13 +166,13 @@ def main():
         current_env["LOCAL_RANK"] = str(local_rank)
 
         # spawn the processes
-        cmd = [sys.executable,
-               "-u",
-               args.training_script] + args.training_script_args
+        cmd = [sys.executable, "-u", args.training_script] + args.training_script_args
 
         print(cmd)
 
-        stdout = None if local_rank == 0 else open("GPU_"+str(local_rank)+".log", "w")
+        stdout = (
+            None if local_rank == 0 else open("GPU_" + str(local_rank) + ".log", "w")
+        )
 
         process = subprocess.Popen(cmd, env=current_env, stdout=stdout)
         processes.append(process)

+ 1 - 0
PyTorch/Classification/ConvNets/requirements.txt

@@ -1 +1,2 @@
+pytorch-ignite
 git+git://github.com/NVIDIA/dllogger.git@26a0f8f1958de2c0c460925ff6102a4d2486d6cc#egg=dllogger

+ 126 - 124
PyTorch/Classification/ConvNets/resnet50v1.5/README.md

@@ -6,7 +6,6 @@ achieve state-of-the-art accuracy, and is tested and maintained by NVIDIA.
 ## Table Of Contents
 
 * [Model overview](#model-overview)
-  * [Model architecture](#model-architecture)
   * [Default configuration](#default-configuration)
     * [Optimizer](#optimizer)
     * [Data augmentation](#data-augmentation)
@@ -15,33 +14,32 @@ achieve state-of-the-art accuracy, and is tested and maintained by NVIDIA.
     * [Features](#features)
   * [Mixed precision training](#mixed-precision-training)
     * [Enabling mixed precision](#enabling-mixed-precision)
+    * [Enabling TF32](#enabling-tf32)
 * [Setup](#setup)
   * [Requirements](#requirements)
 * [Quick Start Guide](#quick-start-guide)
 * [Advanced](#advanced)
   * [Scripts and sample code](#scripts-and-sample-code)
-    * [Parameters](#parameters)
-    * [Command-line options](#command-line-options)
-    * [Getting the data](#getting-the-data)
-        * [Dataset guidelines](#dataset-guidelines)
-        * [Multi-dataset](#multi-dataset)
-    * [Training process](#training-process)
-    * [Inference process](#inference-process)
-
+  * [Command-line options](#command-line-options)
+  * [Dataset guidelines](#dataset-guidelines)
+  * [Training process](#training-process)
+  * [Inference process](#inference-process)
 * [Performance](#performance)
   * [Benchmarking](#benchmarking)
     * [Training performance benchmark](#training-performance-benchmark)
     * [Inference performance benchmark](#inference-performance-benchmark)
   * [Results](#results)
     * [Training accuracy results](#training-accuracy-results)
-      * [Training accuracy: NVIDIA DGX-1 (8x V100 16G)](#training-accuracy-nvidia-dgx-1-(8x-v100-16G))
-      * [Example plots](*example-plots)
+      * [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
+      * [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
+      * [Training accuracy: NVIDIA DGX-2 (16x V100 32GB)](#training-accuracy-nvidia-dgx-2-16x-v100-32gb)
+      * [Example plots](#example-plots)
     * [Training performance results](#training-performance-results)
-      * [Training performance: NVIDIA DGX-1 (8x V100 16G)](#training-performance-nvidia-dgx-1-(8x-v100-16G))
-    * [Training time for 90 epochs](#training-time-for-90-epochs)
-      * [Training time: NVIDIA DGX-1 (8x V100 16G)](#training-time-nvidia-dgx-1-(8x-v100-16G))
+      * [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
+      * [Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)](#training-performance-nvidia-dgx-1-16gb-8x-v100-16gb)
+      * [Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)](#training-performance-nvidia-dgx-1-32gb-8x-v100-32gb)
   * [Inference performance results](#inference-performance-results)
-      * [Inference performance: NVIDIA DGX-1 (1x V100 16G)](#inference-performance-nvidia-dgx-1-(1x-v100-16G))
+      * [Inference performance: NVIDIA DGX-1 16GB (1x V100 16GB)](#inference-performance-nvidia-dgx-1-1x-v100-16gb)
       * [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4)
 * [Release notes](#release-notes)
   * [Changelog](#changelog)
@@ -57,6 +55,10 @@ This difference makes ResNet50 v1.5 slightly more accurate (~0.5% top1) than v1,
 
 The model is initialized as described in [Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification](https://arxiv.org/pdf/1502.01852.pdf)
 
+This model is trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results over 2x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
+
+We are currently working on adding [NHWC data layout](https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html) support for Mixed Precision training.
+
 ### Default configuration
 
 The following sections highlight the default configurations for the ResNet50 model.
@@ -66,31 +68,20 @@ The following sections highlight the default configurations for the ResNet50 mod
 This model uses SGD with momentum optimizer with the following hyperparameters:
 
 * Momentum (0.875)
-
-* Learning rate (LR) = 0.256 for 256 batch size, for other batch sizes we lineary
+* Learning rate (LR) = 0.256 for 256 batch size, for other batch sizes we linearly
 scale the learning rate.
-
 * Learning rate schedule - we use cosine LR schedule
-
 * For bigger batch sizes (512 and up) we use linear warmup of the learning rate
 during the first couple of epochs
 according to [Training ImageNet in 1 hour](https://arxiv.org/abs/1706.02677).
 Warmup length depends on the total training length.
-
 * Weight decay (WD)= 3.0517578125e-05 (1/32768).
-
 * We do not apply WD on Batch Norm trainable parameters (gamma/bias)
-
 * Label smoothing = 0.1
-
 * We train for:
-
     * 50 Epochs -> configuration that reaches 75.9% top1 accuracy
-
     * 90 Epochs -> 90 epochs is a standard for ImageNet networks
-
     * 250 Epochs -> best possible accuracy.
-
 * For 250 epoch training we also use [MixUp regularization](https://arxiv.org/pdf/1710.09412.pdf).
 
 
@@ -104,7 +95,6 @@ This model uses the following data augmentation:
     * Scale from 8% to 100%
     * Aspect ratio from 3/4 to 4/3
   * Random horizontal flip
-
 * For inference:
   * Normalization
   * Scale to 256x256
@@ -141,7 +131,7 @@ The following features are supported by this model:
 #### Features
 
 - NVIDIA DALI - DALI is a library accelerating data preparation pipeline. To accelerate your input pipeline, you only need to define your data loader
-with the DALI library. For more information about DALI, refer to the [DALI product documentation](https://docs.nvidia.com/deeplearning/sdk/index.html#data-loading).
+with the DALI library. For more information about DALI, refer to the [DALI product documentation](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html).
 
 - [APEX](https://github.com/NVIDIA/apex) is a PyTorch extension that contains utility libraries, such as [Automatic Mixed Precision (AMP)](https://nvidia.github.io/apex/amp.html), which require minimal network code changes to leverage Tensor Cores performance. Refer to the [Enabling mixed precision](#enabling-mixed-precision) section for more details.
 
@@ -152,20 +142,19 @@ which speeds up data loading when CPU becomes a bottleneck.
 DALI can use CPU or GPU, and outperforms the PyTorch native dataloader.
 
 Run training with `--data-backends dali-gpu` or `--data-backends dali-cpu` to enable DALI.
-For DGX1 we recommend `--data-backends dali-cpu`, for DGX2 we recommend `--data-backends dali-gpu`.
+For DGXA100 and DGX1 we recommend `--data-backends dali-cpu`, for DGX2 we recommend `--data-backends dali-gpu`.
 
 ### Mixed precision training
 
-Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
+Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
 1.  Porting the model to use the FP16 data type where appropriate.
 2.  Adding loss scaling to preserve small gradient values.
 
-The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
+The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA 8 in the NVIDIA Deep Learning SDK.
 
 For information about:
 -   How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
 -   Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
--   How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
 -   APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
 
 #### Enabling mixed precision
@@ -177,33 +166,34 @@ In PyTorch, loss scaling can be easily applied by using scale_loss() method prov
 For an in-depth walk through on AMP, check out sample usage [here](https://github.com/NVIDIA/apex/tree/master/apex/amp#usage-and-getting-started). [APEX](https://github.com/NVIDIA/apex) is a PyTorch extension that contains utility libraries, such as AMP, which require minimal network code changes to leverage tensor cores performance.
 
 To enable mixed precision, you can:
-- Import AMP from APEX, for example:
+- Import AMP from APEX:
 
-  ```
+  ```python
   from apex import amp
   ```
-- Initialize an AMP handle, for example:
 
-  ```
-  amp_handle = amp.init(enabled=True, verbose=True)
-  ```
-- Wrap your optimizer with the AMP handle, for example:
+- Wrap model and optimizer in amp.initialize:
 
+  ```python
+  model, optimizer = amp.initialize(model, optimizer, opt_level="O1", loss_scale="dynamic")
   ```
-  optimizer = amp_handle.wrap_optimizer(optimizer)
+
+- Scale loss before backpropagation:
+  ```python
+  with amp.scale_loss(loss, optimizer) as scaled_loss:
+    scaled_loss.backward()
   ```
-- Scale loss before backpropagation (assuming loss is stored in a variable called losses)
-  - Default backpropagate for FP32:
 
-    ```
-    losses.backward()
-    ```
-  - Scale loss and backpropagate with AMP:
+#### Enabling TF32
+
+TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. 
+
+TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
+
+For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post.
+
+TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
 
-    ```
-    with optimizer.scale_loss(losses) as scaled_losses:
-       scaled_losses.backward()
-    ```
 
 ## Setup
 
@@ -214,8 +204,11 @@ The following section lists the requirements that you need to meet in order to s
 This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
 
 * [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PyTorch 19.10-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
-* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
+* [PyTorch 20.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
+* Supported GPUs:
+    * [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
+    * [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
+    * [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
 
 For more information about how to get started with NGC containers, see the
 following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
@@ -240,6 +233,11 @@ The ResNet50 script operates on ImageNet 1k, a widely popular image classificati
 
 PyTorch can work directly on JPEGs, therefore, preprocessing/augmentation is not needed.
 
+To train your model using mixed or TF32 precision with Tensor Cores or using FP32,
+perform the following steps using the default parameters of the resnet50 model on the ImageNet dataset.
+For the specifics concerning training and inference, see the [Advanced](#advanced) section.
+
+
 1. [Download the images](http://image-net.org/download-images).
 
 2. Extract the training data:
@@ -269,16 +267,17 @@ docker build . -t nvidia_rn50
 nvidia-docker run --rm -it -v <path to imagenet>:/data/imagenet --ipc=host nvidia_rn50
 ```
 
+
 ### 5. Start training
 
-To run training for a standard configuration (DGX1V/DGX2V, FP16/FP32, 50/90/250 Epochs),
+To run training for a standard configuration (DGXA100/DGX1/DGX2, AMP/TF32/FP32, 50/90/250 Epochs),
 run one of the scripts in the `./resnet50v1.5/training` directory
-called `./resnet50v1.5/training/{DGX1, DGX2}_RN50_{AMP, FP16, FP32}_{50,90,250}E.sh`.
+called `./resnet50v1.5/training/{AMP, TF32, FP32}/{DGXA100, DGX1, DGX2}_RN50_{AMP, TF32, FP32}_{50,90,250}E.sh`.
 
 Ensure ImageNet is mounted in the `/data/imagenet` directory.
 
 Example:
-    `bash ./resnet50v1.5/training/DGX1_RN50_FP16_250E.sh <path were to store checkpoints and logs>`
+    `bash ./resnet50v1.5/training/AMP/DGX1_RN50_AMP_250E.sh <path were to store checkpoints and logs>`
 
 ### 6. Start inference
 
@@ -292,7 +291,7 @@ To run inference on JPEG image, you have to first extract the model weights from
 
 Then run classification script:
 
-`python classify.py --arch resnet50 -c fanin --weights <path to weights from previous step> --precision AMP|FP16|FP32 --image <path to JPEG image>`
+`python classify.py --arch resnet50 -c fanin --weights <path to weights from previous step> --precision AMP|FP32 --image <path to JPEG image>`
 
 
 ## Advanced
@@ -317,7 +316,7 @@ To run a non standard configuration use:
 Use `python ./main.py -h` to obtain the list of available options in the `main.py` script.
 
 
-### Commmand-line options:
+### Command-line options:
 
 To see the full list of available options and their descriptions, use the `-h` or `--help` command-line option, for example:
 
@@ -326,16 +325,17 @@ To see the full list of available options and their descriptions, use the `-h` o
 
 ```
 usage: main.py [-h] [--data-backend BACKEND] [--arch ARCH]
-               [--model-config CONF] [-j N] [--epochs N] [-b N]
-               [--optimizer-batch-size N] [--lr LR] [--lr-schedule SCHEDULE]
-               [--warmup E] [--label-smoothing S] [--mixup ALPHA]
-               [--momentum M] [--weight-decay W] [--bn-weight-decay]
-               [--nesterov] [--print-freq N] [--resume PATH]
-               [--pretrained-weights PATH] [--fp16]
+               [--model-config CONF] [--num-classes N] [-j N] [--epochs N]
+               [--run-epochs N] [-b N] [--optimizer-batch-size N] [--lr LR]
+               [--lr-schedule SCHEDULE] [--warmup E] [--label-smoothing S]
+               [--mixup ALPHA] [--momentum M] [--weight-decay W]
+               [--bn-weight-decay] [--nesterov] [--print-freq N]
+               [--resume PATH] [--pretrained-weights PATH] [--fp16]
                [--static-loss-scale STATIC_LOSS_SCALE] [--dynamic-loss-scale]
-               [--prof N] [--amp] [--local_rank LOCAL_RANK] [--seed SEED]
-               [--gather-checkpoints] [--raport-file RAPORT_FILE] [--evaluate]
-               [--training-only] [--no-checkpoints] [--workspace DIR]
+               [--prof N] [--amp] [--seed SEED] [--gather-checkpoints]
+               [--raport-file RAPORT_FILE] [--evaluate] [--training-only]
+               [--no-checkpoints] [--checkpoint-filename CHECKPOINT_FILENAME]
+               [--workspace DIR] [--memory-format {nchw,nhwc}]
                DIR
 
 PyTorch ImageNet Training
@@ -349,13 +349,15 @@ optional arguments:
                         data backend: pytorch | syntetic | dali-gpu | dali-cpu
                         (default: dali-cpu)
   --arch ARCH, -a ARCH  model architecture: resnet18 | resnet34 | resnet50 |
-                        resnet101 | resnet152 | resnet50 | se-
-                        resnet50 (default: resnet50)
+                        resnet101 | resnet152 | resnext101-32x4d | se-
+                        resnext101-32x4d (default: resnet50)
   --model-config CONF, -c CONF
                         model configs: classic | fanin | grp-fanin | grp-
                         fanout(default: classic)
+  --num-classes N       number of classes in the dataset
   -j N, --workers N     number of data loading workers (default: 5)
   --epochs N            number of total epochs to run
+  --run-epochs N        run only N epochs, used for checkpointing runs
   -b N, --batch-size N  mini-batch size (default: 256) per gpu
   --optimizer-batch-size N
                         size of a total batch size, for simulating bigger
@@ -385,9 +387,6 @@ optional arguments:
                         supersedes --static-loss-scale.
   --prof N              Run only N iterations
   --amp                 Run model AMP (automatic mixed precision) mode.
-  --local_rank LOCAL_RANK
-                        Local rank of python process. Set up by distributed
-                        launcher
   --seed SEED           random seed used for numpy and pytorch
   --gather-checkpoints  Gather checkpoints throughout the training, without
                         this flag only best and last checkpoints will be
@@ -397,7 +396,10 @@ optional arguments:
   --evaluate            evaluate checkpoint/model
   --training-only       do not evaluate
   --no-checkpoints      do not store any checkpoints, useful for benchmarking
+  --checkpoint-filename CHECKPOINT_FILENAME
   --workspace DIR       path to directory where checkpoints will be stored
+  --memory-format {nchw,nhwc}
+                        memory layout, nchw or nhwc
 ```
 
 
@@ -466,9 +468,7 @@ To run inference on JPEG image, you have to first extract the model weights from
 
 Then run classification script:
 
-`python classify.py --arch resnet50 -c fanin --weights <path to weights from previous step> --precision AMP|FP16|FP32 --image <path to JPEG image>`
-
-Example output:
+`python classify.py --arch resnet50 -c fanin --weights <path to weights from previous step> --precision AMP|FP32 --image <path to JPEG image>`
 
 
 
@@ -484,21 +484,26 @@ To benchmark training, run:
 
 * For 1 GPU
     * FP32
-`python ./main.py --arch resnet50 --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
-    * FP16
-`python ./main.py --arch resnet50 --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --fp16 --static-loss-scale 256 <path to imagenet>`
+`python ./main.py --arch resnet50 -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
     * AMP
-`python ./main.py --arch resnet50 --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --amp --static-loss-scale 256 <path to imagenet>`
+`python ./main.py --arch resnet50 -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --amp --static-loss-scale 256 <path to imagenet>`
 * For multiple GPUs
     * FP32
-`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnet50 --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
-    * FP16
-`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnet50 --training-only -p 1 --raport-file benchmark.json --fp16 --static-loss-scale 256 --epochs 1 --prof 100 <path to imagenet>`
+`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnet50 -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
     * AMP
-`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnet50 --training-only -p 1 --raport-file benchmark.json --amp --static-loss-scale 256 --epochs 1 --prof 100 <path to imagenet>`
+`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnet50 -b <batch_size> --training-only -p 1 --raport-file benchmark.json --amp --static-loss-scale 256 --epochs 1 --prof 100 <path to imagenet>`
 
 Each of these scripts will run 100 iterations and save results in the `benchmark.json` file.
 
+Batch size should be picked appropriately depending on the hardware configuration.
+
+| *Platform* | *Precision* | *Batch Size* |
+|:----------:|:-----------:|:------------:|
+| DGXA100    | AMP         | 256          |
+| DGXA100    | TF32        | 256          |
+| DGX-1      | AMP         | 256          |
+| DGX-1      | FP32        | 128          |
+
 #### Inference performance benchmark
 
 To benchmark inference, run:
@@ -507,34 +512,45 @@ To benchmark inference, run:
 
 `python ./main.py --arch resnet50 -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate <path to imagenet>`
 
-* FP16
-
-`python ./main.py --arch resnet50 -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate --fp16 <path to imagenet>`
-
 * AMP
 
 `python ./main.py --arch resnet50 -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate --amp <path to imagenet>`
 
 Each of these scripts will run 100 iterations and save results in the `benchmark.json` file.
 
+Batch size should be picked appropriately depending on the hardware configuration.
+
+| *Platform* | *Precision* | *Batch Size* |
+|:----------:|:-----------:|:------------:|
+| DGXA100    | AMP         | 256          |
+| DGXA100    | TF32        | 256          |
+| DGX-1      | AMP         | 256          |
+| DGX-1      | FP32        | 128          |
 
 ### Results
 
-Our results were obtained by running the applicable training script     in the pytorch-19.10 NGC container.
+Our results were obtained by running the applicable training script     in the pytorch-20.06 NGC container.
 
 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
 
 #### Training accuracy results
 
-##### Training accuracy: NVIDIA DGX-1 (8x V100 16G)
+##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
+
+| **epochs** | **Mixed Precision Top1** | **TF32 Top1** |
+|:------:|:--------------------:|:--------------:|
+|     90 |    76.93 +/- 0.23    | 76.85 +/- 0.30 |
+
+
+##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
 
 | **epochs** | **Mixed Precision Top1** | **FP32 Top1** |
 |:-:|:-:|:-:|
 | 50 | 76.25 +/- 0.04 | 76.26 +/- 0.07 |
-| 90 | 77.23 +/- 0.04 | 77.08 +/- 0.08 |
+|     90 |    77.09 +/- 0.10    | 77.01 +/- 0.16 |
 | 250 | 78.42 +/- 0.04 | 78.30 +/- 0.16 |
 
-##### Training accuracy: NVIDIA DGX-2 (16x V100 32G)
+##### Training accuracy: NVIDIA DGX-2 (16x V100 32GB)
 
 | **epochs** | **Mixed Precision Top1** | **FP32 Top1** |
 |:-:|:-:|:-:|
@@ -556,48 +572,31 @@ The following images show a 250 epochs configuration on a DGX-1V.
 
 #### Training performance results
 
-##### Traininig performance: NVIDIA DGX1-16G (8x V100 16G)
+##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
 
-| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision speedup** | **Mixed Precision Strong Scaling** | **FP32 Strong Scaling** |
-|:-:|:-:|:-:|:-:|:-:|:-:|
-| 1 | 893.09 img/s | 380.44 img/s | 2.35x | 1.00x | 1.00x |
-| 8 | 6888.75 img/s | 2945.37 img/s | 2.34x | 7.71x | 7.74x |
+|**GPUs**|**Mixed Precision**|  **TF32**   |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**TF32 Strong Scaling**|**TF32 Training Time (90E)**|
+|:------:|:-----------------:|:-----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
+|   1    |   1240.81 img/s   |680.15 img/s |           1.82x           |              1.00x               |               ~27 hours               |         1.00x         |         ~49 hours          |
+|   8    |   9604.92 img/s   |5379.82 img/s|           1.79x           |              7.74x               |               ~4 hours                |         7.91x         |          ~6 hours          |
 
-##### Traininig performance: NVIDIA DGX1-32G (8x V100 32G)
+##### Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)
 
-| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision speedup** | **Mixed Precision Strong Scaling** | **FP32 Strong Scaling** |
-|:-:|:-:|:-:|:-:|:-:|:-:|
-| 1 | 849.63 img/s | 373.93 img/s | 2.27x | 1.00x | 1.00x |
-| 8 | 6614.15 img/s | 2911.22 img/s | 2.27x | 7.78x | 7.79x |
-
-##### Traininig performance: NVIDIA DGX2 (16x V100 32G)
-
-| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision speedup** | **Mixed Precision Strong Scaling** | **FP32 Strong Scaling** |
-|:-:|:-:|:-:|:-:|:-:|:-:|
-| 1 | 894.41 img/s | 402.23 img/s | 2.22x | 1.00x | 1.00x |
-| 16 | 13443.82 img/s | 6263.41 img/s | 2.15x | 15.03x | 15.57x |
-
-#### Training Time for 90 Epochs
+|**GPUs**|**Mixed Precision**|  **FP32**   |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**FP32 Strong Scaling**|**FP32 Training Time (90E)**|
+|:------:|:-----------------:|:-----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
+|   1    |   856.52 img/s    |373.21 img/s |           2.30x           |              1.00x               |               ~39 hours               |         1.00x         |         ~89 hours          |
+|   8    |   6635.90 img/s   |2899.62 img/s|           2.29x           |              7.75x               |               ~5 hours                |         7.77x         |         ~12 hours          |
 
-##### Training time: NVIDIA DGX-1 (8x V100 16G)
-
-| **GPUs** | **Mixed Precision training time** | **FP32 training time** |
-|:-:|:-:|:-:|
-| 1 | ~ 41 h | ~ 95 h |
-| 8 | ~ 7 h | ~ 14 h |
-
-##### Training time: NVIDIA DGX-2 (16x V100 32G)
-
-| **GPUs** | **Mixed Precision training time** | **FP32 training time** |
-|:-:|:-:|:-:|
-| 1 | ~ 41 h | ~ 90 h |
-| 16 | ~ 5 h | ~ 8 h |
+##### Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)
 
+|**GPUs**|**Mixed Precision**|  **FP32**   |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**FP32 Strong Scaling**|**FP32 Training Time (90E)**|
+|:------:|:-----------------:|:-----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
+|   1    |   816.00 img/s    |359.76 img/s |           2.27x           |              1.00x               |               ~41 hours               |         1.00x         |         ~93 hours          |
+|   8    |   6347.26 img/s   |2813.23 img/s|           2.26x           |              7.78x               |               ~5 hours                |         7.82x         |         ~12 hours          |
 
 
 #### Inference performance results
 
-##### Inference performance: NVIDIA DGX-1 (1x V100 16G)
+##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
 
 ###### FP32 Inference Latency
 
@@ -680,6 +679,9 @@ The following images show a 250 epochs configuration on a DGX-1V.
 4. July 2019
   * DALI-CPU dataloader
   * Updated README
+5. July 2020
+  * Added A100 scripts
+  * Updated README
 
 ### Known issues
 

+ 1 - 1
PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX1_RN50_AMP_250E.sh

@@ -1 +1 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 250 --mixup 0.2
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j8 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 250 --mixup 0.2

+ 1 - 1
PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX1_RN50_AMP_50E.sh

@@ -1 +1 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 50
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j8 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 50

+ 1 - 1
PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX1_RN50_AMP_90E.sh

@@ -1 +1 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 90
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j8 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 90

+ 1 - 1
PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX2_RN50_AMP_250E.sh

@@ -1 +1 @@
-python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 250 --mixup 0.2
+python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j8 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 250 --mixup 0.2

+ 1 - 1
PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX2_RN50_AMP_50E.sh

@@ -1 +1 @@
-python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 50
+python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j8 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 50

+ 1 - 1
PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGX2_RN50_AMP_90E.sh

@@ -1 +1 @@
-python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 90
+python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j8 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 90

+ 1 - 0
PyTorch/Classification/ConvNets/resnet50v1.5/training/AMP/DGXA100_RN50_AMP_90E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --amp --static-loss-scale 128 --epochs 90

+ 0 - 1
PyTorch/Classification/ConvNets/resnet50v1.5/training/FP16/DGX1_RN50_FP16_250E.sh

@@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --fp16 --static-loss-scale 128 --epochs 250 --mixup 0.2

+ 0 - 1
PyTorch/Classification/ConvNets/resnet50v1.5/training/FP16/DGX1_RN50_FP16_50E.sh

@@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --fp16 --static-loss-scale 128 --epochs 50

+ 0 - 1
PyTorch/Classification/ConvNets/resnet50v1.5/training/FP16/DGX1_RN50_FP16_90E.sh

@@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --fp16 --static-loss-scale 128 --epochs 90

+ 0 - 1
PyTorch/Classification/ConvNets/resnet50v1.5/training/FP16/DGX2_RN50_FP16_250E.sh

@@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --fp16 --static-loss-scale 128 --epochs 250 --mixup 0.2

+ 0 - 1
PyTorch/Classification/ConvNets/resnet50v1.5/training/FP16/DGX2_RN50_FP16_50E.sh

@@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --fp16 --static-loss-scale 128 --epochs 50

+ 0 - 1
PyTorch/Classification/ConvNets/resnet50v1.5/training/FP16/DGX2_RN50_FP16_90E.sh

@@ -1 +0,0 @@
-python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --fp16 --static-loss-scale 128 --epochs 90

+ 1 - 1
PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX1_RN50_FP32_250E.sh

@@ -1 +1 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 250 --mixup 0.2
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j8 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 250 --mixup 0.2

+ 1 - 1
PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX1_RN50_FP32_50E.sh

@@ -1 +1 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 50
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j8 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 50

+ 1 - 1
PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX1_RN50_FP32_90E.sh

@@ -1 +1 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j5 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 90
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j8 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 90

+ 1 - 1
PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX2_RN50_FP32_250E.sh

@@ -1 +1 @@
-python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 250 --mixup 0.2
+python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j8 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 250 --mixup 0.2

+ 1 - 1
PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX2_RN50_FP32_50E.sh

@@ -1 +1 @@
-python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 50
+python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j8 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 50

+ 1 - 1
PyTorch/Classification/ConvNets/resnet50v1.5/training/FP32/DGX2_RN50_FP32_90E.sh

@@ -1 +1 @@
-python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j5 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 90
+python ./multiproc.py --nproc_per_node 16 ./main.py /imagenet --data-backend dali-gpu --raport-file raport.json -j8 -p 100 --lr 4.096 --optimizer-batch-size 4096 --warmup 16 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 128 --epochs 90

+ 1 - 0
PyTorch/Classification/ConvNets/resnet50v1.5/training/TF32/DGXA100_RN50_TF32_90E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --data-backend dali-cpu --raport-file raport.json -j16 -p 100 --lr 2.048 --optimizer-batch-size 2048 --warmup 8 --arch resnet50 -c fanin --label-smoothing 0.1 --lr-schedule cosine --mom 0.875 --wd 3.0517578125e-05 --workspace ${1:-./} -b 256 --epochs 90

+ 123 - 123
PyTorch/Classification/ConvNets/resnext101-32x4d/README.md

@@ -15,33 +15,31 @@ achieve state-of-the-art accuracy, and is tested and maintained by NVIDIA.
     * [Features](#features)
   * [Mixed precision training](#mixed-precision-training)
     * [Enabling mixed precision](#enabling-mixed-precision)
+    * [Enabling TF32](#enabling-tf32)
 * [Setup](#setup)
   * [Requirements](#requirements)
 * [Quick Start Guide](#quick-start-guide)
 * [Advanced](#advanced)
   * [Scripts and sample code](#scripts-and-sample-code)
-    * [Parameters](#parameters)
-    * [Command-line options](#command-line-options)
-    * [Getting the data](#getting-the-data)
-        * [Dataset guidelines](#dataset-guidelines)
-        * [Multi-dataset](#multi-dataset)
-    * [Training process](#training-process)
-    * [Inference process](#inference-process)
-
+  * [Command-line options](#command-line-options)
+  * [Dataset guidelines](#dataset-guidelines)
+  * [Training process](#training-process)
+  * [Inference process](#inference-process)
 * [Performance](#performance)
   * [Benchmarking](#benchmarking)
     * [Training performance benchmark](#training-performance-benchmark)
     * [Inference performance benchmark](#inference-performance-benchmark)
   * [Results](#results)
     * [Training accuracy results](#training-accuracy-results)
-      * [Training accuracy: NVIDIA DGX-1 (8x V100 16G)](#training-accuracy-nvidia-dgx-1-(8x-v100-16G))
-      * [Example plots](*example-plots)
+      * [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
+      * [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
+      * [Example plots](#example-plots)
     * [Training performance results](#training-performance-results)
-      * [Training performance: NVIDIA DGX-1 (8x V100 16G)](#training-performance-nvidia-dgx-1-(8x-v100-16G))
-    * [Training time for 90 epochs](#training-time-for-90-epochs)
-      * [Training time: NVIDIA DGX-1 (8x V100 16G)](#training-time-nvidia-dgx-1-(8x-v100-16G))
+      * [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
+      * [Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)](#training-performance-nvidia-dgx-1-16gb-8x-v100-16gb)
+      * [Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)](#training-performance-nvidia-dgx-1-32gb-8x-v100-32gb)
   * [Inference performance results](#inference-performance-results)
-      * [Inference performance: NVIDIA DGX-1 (1x V100 16G)](#inference-performance-nvidia-dgx-1-(1x-v100-16G))
+      * [Inference performance: NVIDIA DGX-1 16GB (1x V100 16GB)](#inference-performance-nvidia-dgx-1-1x-v100-16gb)
       * [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4)
 * [Release notes](#release-notes)
   * [Changelog](#changelog)
@@ -53,11 +51,15 @@ The ResNeXt101-32x4d is a model introduced in the [Aggregated Residual Transform
 
 It is based on regular ResNet model, substituting 3x3 convolutions inside the bottleneck block for 3x3 grouped convolutions.
 
+This model is trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 3x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
+
+We use [NHWC data layout](https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html) when training using Mixed Precision.
+
 ### Model architecture
 
 ![ResNextArch](./img/ResNeXtArch.png)
 
-_ Image source: [Aggregated Residual Transformations for Deep Neural Networks](https://arxiv.org/pdf/1611.05431.pdf) _
+_Image source: [Aggregated Residual Transformations for Deep Neural Networks](https://arxiv.org/pdf/1611.05431.pdf)_
 
 Image shows difference between ResNet bottleneck block and ResNeXt bottleneck block.
 
@@ -72,29 +74,19 @@ The following sections highlight the default configurations for the ResNeXt101-3
 This model uses SGD with momentum optimizer with the following hyperparameters:
 
 * Momentum (0.875)
-
-* Learning rate (LR) = 0.256 for 256 batch size, for other batch sizes we lineary
+* Learning rate (LR) = 0.256 for 256 batch size, for other batch sizes we linearly
 scale the learning rate.
-
 * Learning rate schedule - we use cosine LR schedule
-
 * For bigger batch sizes (512 and up) we use linear warmup of the learning rate
 during the first couple of epochs
 according to [Training ImageNet in 1 hour](https://arxiv.org/abs/1706.02677).
 Warmup length depends on the total training length.
-
 * Weight decay (WD)= 6.103515625e-05 (1/16384).
-
 * We do not apply WD on Batch Norm trainable parameters (gamma/bias)
-
 * Label smoothing = 0.1
-
 * We train for:
-
     * 90 Epochs -> 90 epochs is a standard for ImageNet networks
-
     * 250 Epochs -> best possible accuracy.
-
 * For 250 epoch training we also use [MixUp regularization](https://arxiv.org/pdf/1710.09412.pdf).
 
 
@@ -108,7 +100,6 @@ This model uses the following data augmentation:
     * Scale from 8% to 100%
     * Aspect ratio from 3/4 to 4/3
   * Random horizontal flip
-
 * For inference:
   * Normalization
   * Scale to 256x256
@@ -120,13 +111,13 @@ The following features are supported by this model:
 
 | Feature               | ResNeXt101-32x4d
 |-----------------------|--------------------------
-|[DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html)   |   Yes
+|[DALI](https://docs.nvidia.com/deeplearning/dali/release-notes/index.html)   |   Yes
 |[APEX AMP](https://nvidia.github.io/apex/amp.html) | Yes |
 
 #### Features
 
 - NVIDIA DALI - DALI is a library accelerating data preparation pipeline. To accelerate your input pipeline, you only need to define your data loader
-with the DALI library. For more information about DALI, refer to the [DALI product documentation](https://docs.nvidia.com/deeplearning/sdk/index.html#data-loading).
+with the DALI library. For more information about DALI, refer to the [DALI product documentation](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html).
 
 - [APEX](https://github.com/NVIDIA/apex) is a PyTorch extension that contains utility libraries, such as [Automatic Mixed Precision (AMP)](https://nvidia.github.io/apex/amp.html), which require minimal network code changes to leverage Tensor Cores performance. Refer to the [Enabling mixed precision](#enabling-mixed-precision) section for more details.
 
@@ -137,7 +128,7 @@ which speeds up data loading when CPU becomes a bottleneck.
 DALI can use CPU or GPU, and outperforms the PyTorch native dataloader.
 
 Run training with `--data-backends dali-gpu` or `--data-backends dali-cpu` to enable DALI.
-For ResNeXt101-32x4d, for DGX1 and DGX2 we recommend `--data-backends dali-cpu`.
+For ResNeXt101-32x4d, for DGXA100, DGX1 and DGX2 we recommend `--data-backends dali-cpu`.
 
 ### Mixed precision training
 
@@ -145,12 +136,11 @@ Mixed precision is the combined use of different numerical precisions in a compu
 1.  Porting the model to use the FP16 data type where appropriate.
 2.  Adding loss scaling to preserve small gradient values.
 
-The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
+The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA 8 in the NVIDIA Deep Learning SDK.
 
 For information about:
 -   How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
 -   Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
--   How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
 -   APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
 
 #### Enabling mixed precision
@@ -162,33 +152,34 @@ In PyTorch, loss scaling can be easily applied by using scale_loss() method prov
 For an in-depth walk through on AMP, check out sample usage [here](https://github.com/NVIDIA/apex/tree/master/apex/amp#usage-and-getting-started). [APEX](https://github.com/NVIDIA/apex) is a PyTorch extension that contains utility libraries, such as AMP, which require minimal network code changes to leverage tensor cores performance.
 
 To enable mixed precision, you can:
-- Import AMP from APEX, for example:
+- Import AMP from APEX:
 
-  ```
+  ```python
   from apex import amp
   ```
-- Initialize an AMP handle, for example:
 
-  ```
-  amp_handle = amp.init(enabled=True, verbose=True)
-  ```
-- Wrap your optimizer with the AMP handle, for example:
+- Wrap model and optimizer in amp.initialize:
 
+  ```python
+  model, optimizer = amp.initialize(model, optimizer, opt_level="O1", loss_scale="dynamic")
   ```
-  optimizer = amp_handle.wrap_optimizer(optimizer)
+
+- Scale loss before backpropagation:
+  ```python
+  with amp.scale_loss(loss, optimizer) as scaled_loss:
+    scaled_loss.backward()
   ```
-- Scale loss before backpropagation (assuming loss is stored in a variable called losses)
-  - Default backpropagate for FP32:
 
-    ```
-    losses.backward()
-    ```
-  - Scale loss and backpropagate with AMP:
+#### Enabling TF32
+
+TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. 
+
+TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
+
+For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post.
+
+TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
 
-    ```
-    with optimizer.scale_loss(losses) as scaled_losses:
-       scaled_losses.backward()
-    ```
 
 ## Setup
 
@@ -199,8 +190,11 @@ The following section lists the requirements that you need to meet in order to s
 This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
 
 * [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PyTorch 19.10-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
-* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
+* [PyTorch 20.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
+* Supported GPUs:
+    * [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
+    * [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
+    * [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
 
 For more information about how to get started with NGC containers, see the
 following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
@@ -225,6 +219,11 @@ The ResNeXt101-32x4d script operates on ImageNet 1k, a widely popular image clas
 
 PyTorch can work directly on JPEGs, therefore, preprocessing/augmentation is not needed.
 
+To train your model using mixed or TF32 precision with Tensor Cores or using FP32,
+perform the following steps using the default parameters of the resnext101-32x4d model on the ImageNet dataset.
+For the specifics concerning training and inference, see the [Advanced](#advanced) section.
+
+
 1. [Download the images](http://image-net.org/download-images).
 
 2. Extract the training data:
@@ -256,14 +255,14 @@ nvidia-docker run --rm -it -v <path to imagenet>:/imagenet --ipc=host nvidia_rnx
 
 ### 5. Start training
 
-To run training for a standard configuration (DGX1V, AMP/FP32, 90/250 Epochs),
+To run training for a standard configuration (DGXA100/DGX1/DGX2, AMP/TF32/FP32, 90/250 Epochs),
 run one of the scripts in the `./resnext101-32x4d/training` directory
-called `./resnext101-32x4d/training/{AMP,FP32}/{DGX1}_RNXT101-32x4d_{AMP, FP32}_{90,250}E.sh`.
+called `./resnext101-32x4d/training/{AMP, TF32, FP32}/{DGXA100, DGX1, DGX2}_RNXT101-32x4d_{AMP, TF32, FP32}_{90,250}E.sh`.
 
 Ensure ImageNet is mounted in the `/imagenet` directory.
 
 Example:
-    `bash ./resnext101-32x4d/training/DGX1_RNXT101-32x4d_FP16_250E.sh <path were to store checkpoints and logs>`
+    `bash ./resnext101-32x4d/training/AMP/DGX1_RNXT101-32x4d_AMP_250E.sh <path were to store checkpoints and logs>`
 
 ### 6. Start inference
 
@@ -277,7 +276,7 @@ To run inference on JPEG image, you have to first extract the model weights from
 
 Then run classification script:
 
-`python classify.py --arch resnext101-32x4d -c fanin --weights <path to weights from previous step> --precision AMP|FP16|FP32 --image <path to JPEG image>`
+`python classify.py --arch resnext101-32x4d -c fanin --weights <path to weights from previous step> --precision AMP|FP32 --image <path to JPEG image>`
 
 
 ## Advanced
@@ -302,7 +301,7 @@ To run a non standard configuration use:
 Use `python ./main.py -h` to obtain the list of available options in the `main.py` script.
 
 
-### Commmand-line options:
+### Command-line options:
 
 To see the full list of available options and their descriptions, use the `-h` or `--help` command-line option, for example:
 
@@ -311,16 +310,17 @@ To see the full list of available options and their descriptions, use the `-h` o
 
 ```
 usage: main.py [-h] [--data-backend BACKEND] [--arch ARCH]
-               [--model-config CONF] [-j N] [--epochs N] [-b N]
-               [--optimizer-batch-size N] [--lr LR] [--lr-schedule SCHEDULE]
-               [--warmup E] [--label-smoothing S] [--mixup ALPHA]
-               [--momentum M] [--weight-decay W] [--bn-weight-decay]
-               [--nesterov] [--print-freq N] [--resume PATH]
-               [--pretrained-weights PATH] [--fp16]
+               [--model-config CONF] [--num-classes N] [-j N] [--epochs N]
+               [--run-epochs N] [-b N] [--optimizer-batch-size N] [--lr LR]
+               [--lr-schedule SCHEDULE] [--warmup E] [--label-smoothing S]
+               [--mixup ALPHA] [--momentum M] [--weight-decay W]
+               [--bn-weight-decay] [--nesterov] [--print-freq N]
+               [--resume PATH] [--pretrained-weights PATH] [--fp16]
                [--static-loss-scale STATIC_LOSS_SCALE] [--dynamic-loss-scale]
-               [--prof N] [--amp] [--local_rank LOCAL_RANK] [--seed SEED]
-               [--gather-checkpoints] [--raport-file RAPORT_FILE] [--evaluate]
-               [--training-only] [--no-checkpoints] [--workspace DIR]
+               [--prof N] [--amp] [--seed SEED] [--gather-checkpoints]
+               [--raport-file RAPORT_FILE] [--evaluate] [--training-only]
+               [--no-checkpoints] [--checkpoint-filename CHECKPOINT_FILENAME]
+               [--workspace DIR] [--memory-format {nchw,nhwc}]
                DIR
 
 PyTorch ImageNet Training
@@ -339,8 +339,10 @@ optional arguments:
   --model-config CONF, -c CONF
                         model configs: classic | fanin | grp-fanin | grp-
                         fanout(default: classic)
+  --num-classes N       number of classes in the dataset
   -j N, --workers N     number of data loading workers (default: 5)
   --epochs N            number of total epochs to run
+  --run-epochs N        run only N epochs, used for checkpointing runs
   -b N, --batch-size N  mini-batch size (default: 256) per gpu
   --optimizer-batch-size N
                         size of a total batch size, for simulating bigger
@@ -370,9 +372,6 @@ optional arguments:
                         supersedes --static-loss-scale.
   --prof N              Run only N iterations
   --amp                 Run model AMP (automatic mixed precision) mode.
-  --local_rank LOCAL_RANK
-                        Local rank of python process. Set up by distributed
-                        launcher
   --seed SEED           random seed used for numpy and pytorch
   --gather-checkpoints  Gather checkpoints throughout the training, without
                         this flag only best and last checkpoints will be
@@ -382,7 +381,10 @@ optional arguments:
   --evaluate            evaluate checkpoint/model
   --training-only       do not evaluate
   --no-checkpoints      do not store any checkpoints, useful for benchmarking
+  --checkpoint-filename CHECKPOINT_FILENAME
   --workspace DIR       path to directory where checkpoints will be stored
+  --memory-format {nchw,nhwc}
+                        memory layout, nchw or nhwc
 ```
 
 
@@ -452,9 +454,7 @@ To run inference on JPEG image, you have to first extract the model weights from
 
 Then run classification script:
 
-`python classify.py --arch resnext101-32x4d -c fanin --weights <path to weights from previous step> --precision AMP|FP16|FP32 --image <path to JPEG image>`
-
-Example output:
+`python classify.py --arch resnext101-32x4d -c fanin --weights <path to weights from previous step> --precision AMP|
 
 
 
@@ -470,53 +470,68 @@ To benchmark training, run:
 
 * For 1 GPU
     * FP32
-`python ./main.py --arch resnext101-32x4d --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
-    * FP16
-`python ./main.py --arch resnext101-32x4d --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --fp16 --static-loss-scale 256 <path to imagenet>`
+`python ./main.py --arch resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
     * AMP
-`python ./main.py --arch resnext101-32x4d --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --amp --static-loss-scale 256 <path to imagenet>`
+`python ./main.py --arch resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --amp --static-loss-scale 256 --memory-format nhwc <path to imagenet>`
 * For multiple GPUs
     * FP32
-`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnext101-32x4d --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
-    * FP16
-`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnext101-32x4d --training-only -p 1 --raport-file benchmark.json --fp16 --static-loss-scale 256 --epochs 1 --prof 100 <path to imagenet>`
+`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
     * AMP
-`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnext101-32x4d --training-only -p 1 --raport-file benchmark.json --amp --static-loss-scale 256 --epochs 1 --prof 100 <path to imagenet>`
+`python ./multiproc.py --nproc_per_node 8 ./main.py --arch resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --amp --static-loss-scale 256 --epochs 1 --prof 100 --memory-format nhwc <path to imagenet>`
 
 Each of these scripts will run 100 iterations and save results in the `benchmark.json` file.
 
+Batch size should be picked appropriately depending on the hardware configuration.
+
+| *Platform* | *Precision* | *Batch Size* |
+|:----------:|:-----------:|:------------:|
+| DGXA100    | AMP         | 128          |
+| DGXA100    | TF32        | 128          |
+| DGX-1      | AMP         | 128          |
+| DGX-1      | FP32        | 64           |
+
 #### Inference performance benchmark
 
 To benchmark inference, run:
 
 * FP32
 
-`python ./main.py --arch resnext101-32x4d -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate <path to imagenet>`
-
-* FP16
-
-`python ./main.py --arch resnext101-32x4d -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate --fp16 <path to imagenet>`
+`python ./main.py --arch resnext101-32x4d -b <batch_size> -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate <path to imagenet>`
 
 * AMP
 
-`python ./main.py --arch resnext101-32x4d -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate --amp <path to imagenet>`
+`python ./main.py --arch resnext101-32x4d -b <batch_size> -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate --amp --memory-format nhwc <path to imagenet>`
 
 Each of these scripts will run 100 iterations and save results in the `benchmark.json` file.
 
+Batch size should be picked appropriately depending on the hardware configuration.
+
+| *Platform* | *Precision* | *Batch Size* |
+|:----------:|:-----------:|:------------:|
+| DGXA100    | AMP         | 128          |
+| DGXA100    | TF32        | 128          |
+| DGX-1      | AMP         | 128          |
+| DGX-1      | FP32        | 64           |
 
 ### Results
 
-Our results were obtained by running the applicable training script     in the pytorch-19.10 NGC container.
+Our results were obtained by running the applicable training script     in the pytorch-20.06 NGC container.
 
 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
 
 #### Training accuracy results
 
-##### Training accuracy: NVIDIA DGX-1 (8x V100 16G)
+##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
+
+| **epochs** | **Mixed Precision Top1** | **TF32 Top1** |
+|:------:|:--------------------:|:--------------:|
+|   90   |    79.37 +/- 0.13    | 79.38 +/- 0.13 |
+
+##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
 
 | **epochs** | **Mixed Precision Top1** | **FP32 Top1** |
 |:-:|:-:|:-:|
-| 90 | 79.23 +/- 0.09 | 79.23 +/- 0.09 |
+|   90   |    79.43 +/- 0.04    | 79.40 +/- 0.10 |
 | 250 | 79.92 +/- 0.13 | 80.06 +/- 0.06 |
 
 
@@ -533,48 +548,30 @@ The following images show a 250 epochs configuration on a DGX-1V.
 
 #### Training performance results
 
-##### Traininig performance: NVIDIA DGX1-16G (8x V100 16G)
-
-| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision speedup** | **Mixed Precision Strong Scaling** | **FP32 Strong Scaling** |
-|:-:|:-:|:-:|:-:|:-:|:-:|
-| 1 | 313.43 img/s | 146.66 img/s | 2.14x | 1.00x | 1.00x |
-| 8 | 2384.85 img/s | 1116.58 img/s | 2.14x | 7.61x | 7.61x |
-
-##### Traininig performance: NVIDIA DGX1-32G (8x V100 32G)
-
-| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision speedup** | **Mixed Precision Strong Scaling** | **FP32 Strong Scaling** |
-|:-:|:-:|:-:|:-:|:-:|:-:|
-| 1 | 297.83 img/s | 143.27 img/s | 2.08x | 1.00x | 1.00x |
-| 8 | 2270.85 img/s | 1104.62 img/s | 2.06x | 7.62x | 7.71x |
-
-##### Traininig performance: NVIDIA DGX2 (16x V100 32G)
+##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
 
-| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision speedup** | **Mixed Precision Strong Scaling** | **FP32 Strong Scaling** |
-|:-:|:-:|:-:|:-:|:-:|:-:|
-| 1 | 308.42 img/s | 151.67 img/s | 2.03x | 1.00x | 1.00x |
-| 16 | 4473.37 img/s | 2261.97 img/s | 1.98x | 14.50x | 14.91x |
+|**GPUs**|**Mixed Precision**|  **TF32**   |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**TF32 Strong Scaling**|**TF32 Training Time (90E)**|
+|:------:|:-----------------:|:-----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
+|   1    |   908.40 img/s    |300.42 img/s |           3.02x           |              1.00x               |               ~37 hours               |         1.00x         |         ~111 hours         |
+|   8    |   6887.59 img/s   |2380.51 img/s|           2.89x           |              7.58x               |               ~5 hours                |         7.92x         |         ~14 hours          |
 
-#### Training Time for 90 Epochs
+##### Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)
 
-##### Training time: NVIDIA DGX-1 (8x V100 16G)
-
-| **GPUs** | **Mixed Precision training time** | **FP32 training time** |
-|:-:|:-:|:-:|
-| 1 | ~ 114 h | ~ 242 h |
-| 8 | ~ 17 h | ~ 34 h |
-
-##### Training time: NVIDIA DGX-2 (16x V100 32G)
-
-| **GPUs** | **Mixed Precision training time** | **FP32 training time** |
-|:-:|:-:|:-:|
-| 1 | ~ 116 h | ~ 234 h |
-| 16 | ~ 10 h | ~ 18 h |
+|**GPUs**|**Mixed Precision**|  **FP32**   |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**FP32 Strong Scaling**|**FP32 Training Time (90E)**|
+|:------:|:-----------------:|:-----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
+|   1    |   534.91 img/s    |150.05 img/s |           3.56x           |              1.00x               |               ~62 hours               |         1.00x         |         ~222 hours         |
+|   8    |   4000.79 img/s   |1151.01 img/s|           3.48x           |              7.48x               |               ~9 hours                |         7.67x         |         ~29 hours          |
 
+##### Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)
 
+|**GPUs**|**Mixed Precision**|  **FP32**   |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**FP32 Strong Scaling**|**FP32 Training Time (90E)**|
+|:------:|:-----------------:|:-----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
+|   1    |   516.07 img/s    |139.80 img/s |           3.69x           |              1.00x               |               ~65 hours               |         1.00x         |         ~238 hours         |
+|   8    |   3861.95 img/s   |1070.94 img/s|           3.61x           |              7.48x               |               ~9 hours                |         7.66x         |         ~31 hours          |
 
 #### Inference performance results
 
-##### Inference performance: NVIDIA DGX-1 (1x V100 16G)
+##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
 
 ###### FP32 Inference Latency
 
@@ -642,6 +639,9 @@ The following images show a 250 epochs configuration on a DGX-1V.
 
 1. October 2019
   * Initial release
+2. July 2020
+  * Added A100 scripts
+  * Updated README
 
 ### Known issues
 

+ 1 - 1
PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGX1_RNXT101-32x4d_AMP_250E.sh

@@ -1 +1 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j5 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 250 --warmup 8 --wd 6.103515625e-05 --mixup 0.2
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 250 --warmup 8 --wd 6.103515625e-05 --mixup 0.2 --memory-format nhwc

+ 1 - 1
PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGX1_RNXT101-32x4d_AMP_90E.sh

@@ -1 +1 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j5 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05 --memory-format nhwc

+ 1 - 0
PyTorch/Classification/ConvNets/resnext101-32x4d/training/AMP/DGXA100_RNXT101-32x4d_AMP_90E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j16 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05 --memory-format nhwc

+ 1 - 1
PyTorch/Classification/ConvNets/resnext101-32x4d/training/FP32/DGX1_RNXT101-32x4d_FP32_250E.sh

@@ -1 +1 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j5 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 64 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 250 --warmup 8 --wd 6.103515625e-05 --mixup 0.2
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 64 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 250 --warmup 8 --wd 6.103515625e-05 --mixup 0.2

+ 1 - 1
PyTorch/Classification/ConvNets/resnext101-32x4d/training/FP32/DGX1_RNXT101-32x4d_FP32_90E.sh

@@ -1 +1 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j5 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 64 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 64 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05

+ 1 - 0
PyTorch/Classification/ConvNets/resnext101-32x4d/training/TF32/DGXA100_RNXT101-32x4d_TF32_90E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j16 -p 100 --data-backend dali-cpu --arch resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05

+ 122 - 122
PyTorch/Classification/ConvNets/se-resnext101-32x4d/README.md

@@ -15,33 +15,31 @@ achieve state-of-the-art accuracy, and is tested and maintained by NVIDIA.
     * [Features](#features)
   * [Mixed precision training](#mixed-precision-training)
     * [Enabling mixed precision](#enabling-mixed-precision)
+    * [Enabling TF32](#enabling-tf32)
 * [Setup](#setup)
   * [Requirements](#requirements)
 * [Quick Start Guide](#quick-start-guide)
 * [Advanced](#advanced)
   * [Scripts and sample code](#scripts-and-sample-code)
-    * [Parameters](#parameters)
-    * [Command-line options](#command-line-options)
-    * [Getting the data](#getting-the-data)
-        * [Dataset guidelines](#dataset-guidelines)
-        * [Multi-dataset](#multi-dataset)
-    * [Training process](#training-process)
-    * [Inference process](#inference-process)
-
+  * [Command-line options](#command-line-options)
+  * [Dataset guidelines](#dataset-guidelines)
+  * [Training process](#training-process)
+  * [Inference process](#inference-process)
 * [Performance](#performance)
   * [Benchmarking](#benchmarking)
     * [Training performance benchmark](#training-performance-benchmark)
     * [Inference performance benchmark](#inference-performance-benchmark)
   * [Results](#results)
     * [Training accuracy results](#training-accuracy-results)
-      * [Training accuracy: NVIDIA DGX-1 (8x V100 16G)](#training-accuracy-nvidia-dgx-1-(8x-v100-16G))
-      * [Example plots](*example-plots)
+      * [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
+      * [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
+      * [Example plots](#example-plots)
     * [Training performance results](#training-performance-results)
-      * [Training performance: NVIDIA DGX-1 (8x V100 16G)](#training-performance-nvidia-dgx-1-(8x-v100-16G))
-    * [Training time for 90 epochs](#training-time-for-90-epochs)
-      * [Training time: NVIDIA DGX-1 (8x V100 16G)](#training-time-nvidia-dgx-1-(8x-v100-16G))
+      * [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
+      * [Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)](#training-performance-nvidia-dgx-1-16gb-8x-v100-16gb)
+      * [Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)](#training-performance-nvidia-dgx-1-32gb-8x-v100-32gb)
   * [Inference performance results](#inference-performance-results)
-      * [Inference performance: NVIDIA DGX-1 (1x V100 16G)](#inference-performance-nvidia-dgx-1-(1x-v100-16G))
+      * [Inference performance: NVIDIA DGX-1 16GB (1x V100 16GB)](#inference-performance-nvidia-dgx-1-1x-v100-16gb)
       * [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4)
 * [Release notes](#release-notes)
   * [Changelog](#changelog)
@@ -56,11 +54,15 @@ in [Squeeze-and-Excitation Networks](https://arxiv.org/pdf/1709.01507.pdf) paper
 
 Squeeze and Excitation module architecture for ResNet-type models:
 
+This model is trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 3x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
+
+We use [NHWC data layout](https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html) when training using Mixed Precision.
+
 ### Model architecture
 
 ![SEArch](./img/SEArch.png)
 
-_ Image source: [Squeeze-and-Excitation Networks](https://arxiv.org/pdf/1709.01507.pdf) _
+_Image source: [Squeeze-and-Excitation Networks](https://arxiv.org/pdf/1709.01507.pdf)_
 
 Image shows the architecture of SE block and where is it placed in ResNet bottleneck block.
 
@@ -73,29 +75,19 @@ The following sections highlight the default configurations for the SE-ResNeXt10
 This model uses SGD with momentum optimizer with the following hyperparameters:
 
 * Momentum (0.875)
-
-* Learning rate (LR) = 0.256 for 256 batch size, for other batch sizes we lineary
+* Learning rate (LR) = 0.256 for 256 batch size, for other batch sizes we linearly
 scale the learning rate.
-
 * Learning rate schedule - we use cosine LR schedule
-
 * For bigger batch sizes (512 and up) we use linear warmup of the learning rate
 during the first couple of epochs
 according to [Training ImageNet in 1 hour](https://arxiv.org/abs/1706.02677).
 Warmup length depends on the total training length.
-
 * Weight decay (WD)= 6.103515625e-05 (1/16384).
-
 * We do not apply WD on Batch Norm trainable parameters (gamma/bias)
-
 * Label smoothing = 0.1
-
 * We train for:
-
     * 90 Epochs -> 90 epochs is a standard for ImageNet networks
-
     * 250 Epochs -> best possible accuracy.
-
 * For 250 epoch training we also use [MixUp regularization](https://arxiv.org/pdf/1710.09412.pdf).
 
 
@@ -109,7 +101,6 @@ This model uses the following data augmentation:
     * Scale from 8% to 100%
     * Aspect ratio from 3/4 to 4/3
   * Random horizontal flip
-
 * For inference:
   * Normalization
   * Scale to 256x256
@@ -121,13 +112,13 @@ The following features are supported by this model:
 
 | Feature               | ResNeXt101-32x4d
 |-----------------------|--------------------------
-|[DALI](https://docs.nvidia.com/deeplearning/sdk/dali-release-notes/index.html)   |   Yes
+|[DALI](https://docs.nvidia.com/deeplearning/dali/release-notes/index.html)   |   Yes
 |[APEX AMP](https://nvidia.github.io/apex/amp.html) | Yes |
 
 #### Features
 
 - NVIDIA DALI - DALI is a library accelerating data preparation pipeline. To accelerate your input pipeline, you only need to define your data loader
-with the DALI library. For more information about DALI, refer to the [DALI product documentation](https://docs.nvidia.com/deeplearning/sdk/index.html#data-loading).
+with the DALI library. For more information about DALI, refer to the [DALI product documentation](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html).
 
 - [APEX](https://github.com/NVIDIA/apex) is a PyTorch extension that contains utility libraries, such as [Automatic Mixed Precision (AMP)](https://nvidia.github.io/apex/amp.html), which require minimal network code changes to leverage Tensor Cores performance. Refer to the [Enabling mixed precision](#enabling-mixed-precision) section for more details.
 
@@ -146,12 +137,11 @@ Mixed precision is the combined use of different numerical precisions in a compu
 1.  Porting the model to use the FP16 data type where appropriate.
 2.  Adding loss scaling to preserve small gradient values.
 
-The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
+The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA 8 in the NVIDIA Deep Learning SDK.
 
 For information about:
 -   How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
 -   Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
--   How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
 -   APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
 
 #### Enabling mixed precision
@@ -163,33 +153,34 @@ In PyTorch, loss scaling can be easily applied by using scale_loss() method prov
 For an in-depth walk through on AMP, check out sample usage [here](https://github.com/NVIDIA/apex/tree/master/apex/amp#usage-and-getting-started). [APEX](https://github.com/NVIDIA/apex) is a PyTorch extension that contains utility libraries, such as AMP, which require minimal network code changes to leverage tensor cores performance.
 
 To enable mixed precision, you can:
-- Import AMP from APEX, for example:
+- Import AMP from APEX:
 
-  ```
+  ```python
   from apex import amp
   ```
-- Initialize an AMP handle, for example:
 
-  ```
-  amp_handle = amp.init(enabled=True, verbose=True)
-  ```
-- Wrap your optimizer with the AMP handle, for example:
+- Wrap model and optimizer in amp.initialize:
 
+  ```python
+  model, optimizer = amp.initialize(model, optimizer, opt_level="O1", loss_scale="dynamic")
   ```
-  optimizer = amp_handle.wrap_optimizer(optimizer)
+
+- Scale loss before backpropagation:
+  ```python
+  with amp.scale_loss(loss, optimizer) as scaled_loss:
+    scaled_loss.backward()
   ```
-- Scale loss before backpropagation (assuming loss is stored in a variable called losses)
-  - Default backpropagate for FP32:
 
-    ```
-    losses.backward()
-    ```
-  - Scale loss and backpropagate with AMP:
+#### Enabling TF32
+
+TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. 
+
+TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
+
+For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post.
+
+TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
 
-    ```
-    with optimizer.scale_loss(losses) as scaled_losses:
-       scaled_losses.backward()
-    ```
 
 ## Setup
 
@@ -200,8 +191,11 @@ The following section lists the requirements that you need to meet in order to s
 This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
 
 * [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PyTorch 19.10-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
-* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
+* [PyTorch 20.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) or newer
+* Supported GPUs:
+    * [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
+    * [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
+    * [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
 
 For more information about how to get started with NGC containers, see the
 following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
@@ -226,6 +220,11 @@ The ResNeXt101-32x4d script operates on ImageNet 1k, a widely popular image clas
 
 PyTorch can work directly on JPEGs, therefore, preprocessing/augmentation is not needed.
 
+To train your model using mixed or TF32 precision with Tensor Cores or using FP32,
+perform the following steps using the default parameters of the se-resnext101-32x4d model on the ImageNet dataset.
+For the specifics concerning training and inference, see the [Advanced](#advanced) section.
+
+
 1. [Download the images](http://image-net.org/download-images).
 
 2. Extract the training data:
@@ -257,14 +256,14 @@ nvidia-docker run --rm -it -v <path to imagenet>:/imagenet --ipc=host nvidia_se-
 
 ### 5. Start training
 
-To run training for a standard configuration (DGX1V/DGX2V, AMP/FP32, 90/250 Epochs),
+To run training for a standard configuration (DGXA100/DGX1/DGX2, AMP/TF32/FP32, 90/250 Epochs),
 run one of the scripts in the `./se-resnext101-32x4d/training` directory
-called `./se-resnext101-32x4d/training/{DGX1, DGX2}_SE-RNXT101-32x4d_{AMP, FP32}_{90,250}E.sh`.
+called `./se-resnext101-32x4d/training/{AMP, TF32, FP32}/{DGXA100, DGX1, DGX2}_SE-RNXT101-32x4d_{AMP, TF32, FP32}_{90,250}E.sh`.
 
 Ensure ImageNet is mounted in the `/imagenet` directory.
 
 Example:
-    `bash ./se-resnext101-32x4d/training/DGX1_SE-RNXT101-32x4d_FP16_250E.sh`
+    `bash ./se-resnext101-32x4d/training/AMP/DGX1_SE-RNXT101-32x4d_AMP_250E.sh <path were to store checkpoints and logs>`
 
 ### 6. Start inference
 
@@ -278,7 +277,7 @@ To run inference on JPEG image, you have to first extract the model weights from
 
 Then run classification script:
 
-`python classify.py --arch se-resnext101-32x4d -c fanin --weights <path to weights from previous step> --precision AMP|FP16|FP32 --image <path to JPEG image>`
+`python classify.py --arch se-resnext101-32x4d -c fanin --weights <path to weights from previous step> --precision AMP|FP32 --image <path to JPEG image>`
 
 
 ## Advanced
@@ -303,7 +302,7 @@ To run a non standard configuration use:
 Use `python ./main.py -h` to obtain the list of available options in the `main.py` script.
 
 
-### Commmand-line options:
+### Command-line options:
 
 To see the full list of available options and their descriptions, use the `-h` or `--help` command-line option, for example:
 
@@ -312,16 +311,17 @@ To see the full list of available options and their descriptions, use the `-h` o
 
 ```
 usage: main.py [-h] [--data-backend BACKEND] [--arch ARCH]
-               [--model-config CONF] [-j N] [--epochs N] [-b N]
-               [--optimizer-batch-size N] [--lr LR] [--lr-schedule SCHEDULE]
-               [--warmup E] [--label-smoothing S] [--mixup ALPHA]
-               [--momentum M] [--weight-decay W] [--bn-weight-decay]
-               [--nesterov] [--print-freq N] [--resume PATH]
-               [--pretrained-weights PATH] [--fp16]
+               [--model-config CONF] [--num-classes N] [-j N] [--epochs N]
+               [--run-epochs N] [-b N] [--optimizer-batch-size N] [--lr LR]
+               [--lr-schedule SCHEDULE] [--warmup E] [--label-smoothing S]
+               [--mixup ALPHA] [--momentum M] [--weight-decay W]
+               [--bn-weight-decay] [--nesterov] [--print-freq N]
+               [--resume PATH] [--pretrained-weights PATH] [--fp16]
                [--static-loss-scale STATIC_LOSS_SCALE] [--dynamic-loss-scale]
-               [--prof N] [--amp] [--local_rank LOCAL_RANK] [--seed SEED]
-               [--gather-checkpoints] [--raport-file RAPORT_FILE] [--evaluate]
-               [--training-only] [--no-checkpoints] [--workspace DIR]
+               [--prof N] [--amp] [--seed SEED] [--gather-checkpoints]
+               [--raport-file RAPORT_FILE] [--evaluate] [--training-only]
+               [--no-checkpoints] [--checkpoint-filename CHECKPOINT_FILENAME]
+               [--workspace DIR] [--memory-format {nchw,nhwc}]
                DIR
 
 PyTorch ImageNet Training
@@ -340,8 +340,10 @@ optional arguments:
   --model-config CONF, -c CONF
                         model configs: classic | fanin | grp-fanin | grp-
                         fanout(default: classic)
+  --num-classes N       number of classes in the dataset
   -j N, --workers N     number of data loading workers (default: 5)
   --epochs N            number of total epochs to run
+  --run-epochs N        run only N epochs, used for checkpointing runs
   -b N, --batch-size N  mini-batch size (default: 256) per gpu
   --optimizer-batch-size N
                         size of a total batch size, for simulating bigger
@@ -371,9 +373,6 @@ optional arguments:
                         supersedes --static-loss-scale.
   --prof N              Run only N iterations
   --amp                 Run model AMP (automatic mixed precision) mode.
-  --local_rank LOCAL_RANK
-                        Local rank of python process. Set up by distributed
-                        launcher
   --seed SEED           random seed used for numpy and pytorch
   --gather-checkpoints  Gather checkpoints throughout the training, without
                         this flag only best and last checkpoints will be
@@ -383,7 +382,10 @@ optional arguments:
   --evaluate            evaluate checkpoint/model
   --training-only       do not evaluate
   --no-checkpoints      do not store any checkpoints, useful for benchmarking
+  --checkpoint-filename CHECKPOINT_FILENAME
   --workspace DIR       path to directory where checkpoints will be stored
+  --memory-format {nchw,nhwc}
+                        memory layout, nchw or nhwc
 ```
 
 
@@ -453,9 +455,7 @@ To run inference on JPEG image, you have to first extract the model weights from
 
 Then run classification script:
 
-`python classify.py --arch se-resnext101-32x4d -c fanin --weights <path to weights from previous step> --precision AMP|FP16|FP32 --image <path to JPEG image>`
-
-Example output:
+`python classify.py --arch se-resnext101-32x4d -c fanin --weights <path to weights from previous step> --precision AMP|FP32 --image <path to JPEG image>`
 
 
 
@@ -471,53 +471,68 @@ To benchmark training, run:
 
 * For 1 GPU
     * FP32
-`python ./main.py --arch se-resnext101-32x4d --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
-    * FP16
-`python ./main.py --arch se-resnext101-32x4d --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --fp16 --static-loss-scale 256 <path to imagenet>`
+`python ./main.py --arch se-resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
     * AMP
-`python ./main.py --arch se-resnext101-32x4d --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --amp --static-loss-scale 256 <path to imagenet>`
+`python ./main.py --arch se-resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --amp --static-loss-scale 256 --memory-format nhwc <path to imagenet>`
 * For multiple GPUs
     * FP32
-`python ./multiproc.py --nproc_per_node 8 ./main.py --arch se-resnext101-32x4d --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
-    * FP16
-`python ./multiproc.py --nproc_per_node 8 ./main.py --arch se-resnext101-32x4d --training-only -p 1 --raport-file benchmark.json --fp16 --static-loss-scale 256 --epochs 1 --prof 100 <path to imagenet>`
+`python ./multiproc.py --nproc_per_node 8 ./main.py --arch se-resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --epochs 1 --prof 100 <path to imagenet>`
     * AMP
-`python ./multiproc.py --nproc_per_node 8 ./main.py --arch se-resnext101-32x4d --training-only -p 1 --raport-file benchmark.json --amp --static-loss-scale 256 --epochs 1 --prof 100 <path to imagenet>`
+`python ./multiproc.py --nproc_per_node 8 ./main.py --arch se-resnext101-32x4d -b <batch_size> --training-only -p 1 --raport-file benchmark.json --amp --static-loss-scale 256 --memory-format nhwc --epochs 1 --prof 100 <path to imagenet>`
 
 Each of these scripts will run 100 iterations and save results in the `benchmark.json` file.
 
+Batch size should be picked appropriately depending on the hardware configuration.
+
+| *Platform* | *Precision* | *Batch Size* |
+|:----------:|:-----------:|:------------:|
+| DGXA100    | AMP         | 128          |
+| DGXA100    | TF32        | 128          |
+| DGX-1      | AMP         | 128          |
+| DGX-1      | FP32        | 64           |
+
 #### Inference performance benchmark
 
 To benchmark inference, run:
 
 * FP32
 
-`python ./main.py --arch se-resnext101-32x4d -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate <path to imagenet>`
-
-* FP16
-
-`python ./main.py --arch se-resnext101-32x4d -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate --fp16 <path to imagenet>`
+`python ./main.py --arch se-resnext101-32x4d -b <batch_size> -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate <path to imagenet>`
 
 * AMP
 
-`python ./main.py --arch se-resnext101-32x4d -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate --amp <path to imagenet>`
+`python ./main.py --arch se-resnext101-32x4d -b <batch_size> -p 1 --raport-file benchmark.json --epochs 1 --prof 100 --evaluate --amp --memory-format nhwc <path to imagenet>`
 
 Each of these scripts will run 100 iterations and save results in the `benchmark.json` file.
 
+Batch size should be picked appropriately depending on the hardware configuration.
+
+| *Platform* | *Precision* | *Batch Size* |
+|:----------:|:-----------:|:------------:|
+| DGXA100    | AMP         | 128          |
+| DGXA100    | TF32        | 128          |
+| DGX-1      | AMP         | 128          |
+| DGX-1      | FP32        | 64           |
 
 ### Results
 
-Our results were obtained by running the applicable training script     in the pytorch-19.10 NGC container.
+Our results were obtained by running the applicable training script     in the pytorch-20.06 NGC container.
 
 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
 
 #### Training accuracy results
 
-##### Training accuracy: NVIDIA DGX-1 (8x V100 16G)
+##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
+
+| **epochs** | **Mixed Precision Top1** | **TF32 Top1** |
+|:------:|:--------------------:|:--------------:|
+|   90   |    79.95 +/- 0.09    | 79.97 +/- 0.08 |
+
+##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
 
 | **epochs** | **Mixed Precision Top1** | **FP32 Top1** |
 |:-:|:-:|:-:|
-| 90 | 80.03 +/- 0.10 | 79.86 +/- 0.13 |
+|   90   |    80.04 +/- 0.10    | 79.93 +/- 0.10 |
 | 250 | 80.96 +/- 0.04 | 80.97 +/- 0.09 |
 
 
@@ -534,48 +549,30 @@ The following images show a 250 epochs configuration on a DGX-1V.
 
 #### Training performance results
 
-##### Traininig performance: NVIDIA DGX1-16G (8x V100 16G)
-
-| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision speedup** | **Mixed Precision Strong Scaling** | **FP32 Strong Scaling** |
-|:-:|:-:|:-:|:-:|:-:|:-:|
-| 1 | 266.65 img/s | 128.23 img/s | 2.08x | 1.00x | 1.00x |
-| 8 | 2031.17 img/s | 977.45 img/s | 2.08x | 7.62x | 7.62x |
-
-##### Traininig performance: NVIDIA DGX1-32G (8x V100 32G)
-
-| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision speedup** | **Mixed Precision Strong Scaling** | **FP32 Strong Scaling** |
-|:-:|:-:|:-:|:-:|:-:|:-:|
-| 1 | 255.22 img/s | 125.13 img/s | 2.04x | 1.00x | 1.00x |
-| 8 | 1959.35 img/s | 963.21 img/s | 2.03x | 7.68x | 7.70x |
-
-##### Traininig performance: NVIDIA DGX2 (16x V100 32G)
+##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
 
-| **GPUs** | **Mixed Precision** | **FP32** | **Mixed Precision speedup** | **Mixed Precision Strong Scaling** | **FP32 Strong Scaling** |
-|:-:|:-:|:-:|:-:|:-:|:-:|
-| 1 | 261.58 img/s | 130.85 img/s | 2.00x | 1.00x | 1.00x |
-| 16 | 3776.03 img/s | 1953.13 img/s | 1.93x | 14.44x | 14.93x |
+|**GPUs**|**Mixed Precision**|  **TF32**   |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**TF32 Strong Scaling**|**TF32 Training Time (90E)**|
+|:------:|:-----------------:|:-----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
+|   1    |   641.57 img/s    |258.75 img/s |           2.48x           |              1.00x               |               ~52 hours               |         1.00x         |         ~129 hours         |
+|   8    |   4758.40 img/s   |2038.03 img/s|           2.33x           |              7.42x               |               ~7 hours                |         7.88x         |         ~17 hours          |
 
-#### Training Time for 90 Epochs
+##### Training performance: NVIDIA DGX-1 16GB (8x V100 16GB)
 
-##### Training time: NVIDIA DGX-1 (8x V100 16G)
-
-| **GPUs** | **Mixed Precision training time** | **FP32 training time** |
-|:-:|:-:|:-:|
-| 1 | ~ 134 h | ~ 277 h |
-| 8 | ~ 19 h | ~ 38 h |
-
-##### Training time: NVIDIA DGX-2 (16x V100 32G)
-
-| **GPUs** | **Mixed Precision training time** | **FP32 training time** |
-|:-:|:-:|:-:|
-| 1 | ~ 137 h | ~ 271 h |
-| 16 | ~ 11 h | ~ 20 h |
+|**GPUs**|**Mixed Precision**|  **FP32**  |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**FP32 Strong Scaling**|**FP32 Training Time (90E)**|
+|:------:|:-----------------:|:----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
+|   1    |   383.15 img/s    |130.48 img/s|           2.94x           |              1.00x               |               ~87 hours               |         1.00x         |         ~255 hours         |
+|   8    |   2695.10 img/s   |996.04 img/s|           2.71x           |              7.03x               |               ~13 hours               |         7.63x         |         ~34 hours          |
 
+##### Training performance: NVIDIA DGX-1 32GB (8x V100 32GB)
 
+|**GPUs**|**Mixed Precision**|  **FP32**  |**Mixed Precision Speedup**|**Mixed Precision Strong Scaling**|**Mixed Precision Training Time (90E)**|**FP32 Strong Scaling**|**FP32 Training Time (90E)**|
+|:------:|:-----------------:|:----------:|:-------------------------:|:--------------------------------:|:-------------------------------------:|:---------------------:|:--------------------------:|
+|   1    |   364.65 img/s    |123.46 img/s|           2.95x           |              1.00x               |               ~92 hours               |         1.00x         |         ~270 hours         |
+|   8    |   2540.49 img/s   |959.94 img/s|           2.65x           |              6.97x               |               ~13 hours               |         7.78x         |         ~35 hours          |
 
 #### Inference performance results
 
-##### Inference performance: NVIDIA DGX-1 (1x V100 16G)
+##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
 
 ###### FP32 Inference Latency
 
@@ -643,6 +640,9 @@ The following images show a 250 epochs configuration on a DGX-1V.
 
 1. October 2019
   * Initial release
+2. July 2020
+  * Added A100 scripts
+  * Updated README
 
 ### Known issues
 

+ 1 - 1
PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGX1_SE-RNXT101-32x4d_AMP_250E.sh

@@ -1 +1 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j5 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 250 --warmup 8 --wd 6.103515625e-05 --mixup 0.2
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 250 --warmup 8 --wd 6.103515625e-05 --mixup 0.2 --memory-format nhwc

+ 1 - 1
PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGX1_SE-RNXT101-32x4d_AMP_90E.sh

@@ -1 +1 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j5 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05 --memory-format nhwc

+ 1 - 0
PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/AMP/DGXA100_SE-RNXT101-32x4d_AMP_90E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j16 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --amp --static-loss-scale 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05 --memory-format nhwc

+ 1 - 1
PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/FP32/DGX1_SE-RNXT101-32x4d_FP32_250E.sh

@@ -1 +1 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j5 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 64 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 250 --warmup 8 --wd 6.103515625e-05 --mixup 0.2
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 64 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs 250 --warmup 8 --wd 6.103515625e-05 --mixup 0.2

+ 1 - 1
PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/FP32/DGX1_SE-RNXT101-32x4d_FP32_90E.sh

@@ -1 +1 @@
-python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j5 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 64 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j8 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 64 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05

+ 1 - 0
PyTorch/Classification/ConvNets/se-resnext101-32x4d/training/TF32/DGXA100_SE-RNXT101-32x4d_TF32_90E.sh

@@ -0,0 +1 @@
+python ./multiproc.py --nproc_per_node 8 ./main.py /imagenet --raport-file raport.json -j16 -p 100 --data-backend pytorch --arch se-resnext101-32x4d -c fanin --label-smoothing 0.1 --workspace $1 -b 128 --optimizer-batch-size 1024 --lr 1.024 --mom 0.875 --lr-schedule cosine --epochs  90 --warmup 8 --wd 6.103515625e-05