Преглед на файлове

[UNet(med)/TF] New perf results, cross-validation, optimizer and README

Przemek Strzelczyk преди 6 години
родител
ревизия
dee0fe36c5
променени са 39 файла, в които са добавени 1414 реда и са изтрити 1595 реда
  1. 0 0
      TensorFlow/Segmentation/UNet_Medical/.gitmodules
  2. 2 1
      TensorFlow/Segmentation/UNet_Medical/Dockerfile
  3. 481 362
      TensorFlow/Segmentation/UNet_Medical/README.md
  4. 0 19
      TensorFlow/Segmentation/UNet_Medical/dllogger/__init__.py
  5. 0 60
      TensorFlow/Segmentation/UNet_Medical/dllogger/autologging.py
  6. 128 484
      TensorFlow/Segmentation/UNet_Medical/dllogger/logger.py
  7. 0 255
      TensorFlow/Segmentation/UNet_Medical/dllogger/tags.py
  8. 1 1
      TensorFlow/Segmentation/UNet_Medical/download_dataset.py
  9. 3 12
      TensorFlow/Segmentation/UNet_Medical/examples/unet_FP32_1GPU.sh
  10. 3 22
      TensorFlow/Segmentation/UNet_Medical/examples/unet_FP32_8GPU.sh
  11. 3 3
      TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_BENCHMARK_FP32.sh
  12. 3 3
      TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_BENCHMARK_TF-AMP.sh
  13. 1 1
      TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_BENCHMARK_TF-TRT.sh
  14. 3 3
      TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_FP32.sh
  15. 3 3
      TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_TF-AMP.sh
  16. 1 1
      TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_TF-TRT.sh
  17. 4 14
      TensorFlow/Segmentation/UNet_Medical/examples/unet_TF-AMP_1GPU.sh
  18. 3 23
      TensorFlow/Segmentation/UNet_Medical/examples/unet_TF-AMP_8GPU.sh
  19. 3 3
      TensorFlow/Segmentation/UNet_Medical/examples/unet_TRAIN_BENCHMARK_FP32_1GPU.sh
  20. 3 13
      TensorFlow/Segmentation/UNet_Medical/examples/unet_TRAIN_BENCHMARK_FP32_8GPU.sh
  21. 3 3
      TensorFlow/Segmentation/UNet_Medical/examples/unet_TRAIN_BENCHMARK_TF-AMP_1GPU.sh
  22. 3 13
      TensorFlow/Segmentation/UNet_Medical/examples/unet_TRAIN_BENCHMARK_TF-AMP_8GPU.sh
  23. 24 0
      TensorFlow/Segmentation/UNet_Medical/examples/unet_TRAIN_FP32_1GPU.sh
  24. 24 0
      TensorFlow/Segmentation/UNet_Medical/examples/unet_TRAIN_FP32_8GPU.sh
  25. 24 0
      TensorFlow/Segmentation/UNet_Medical/examples/unet_TRAIN_TF-AMP_1GPU.sh
  26. 24 0
      TensorFlow/Segmentation/UNet_Medical/examples/unet_TRAIN_TF-AMP_8GPU.sh
  27. 86 0
      TensorFlow/Segmentation/UNet_Medical/export.py
  28. 66 56
      TensorFlow/Segmentation/UNet_Medical/main.py
  29. 2 2
      TensorFlow/Segmentation/UNet_Medical/model/layers.py
  30. 3 3
      TensorFlow/Segmentation/UNet_Medical/model/unet.py
  31. 3 2
      TensorFlow/Segmentation/UNet_Medical/requirements.txt
  32. 270 0
      TensorFlow/Segmentation/UNet_Medical/tf_exports/tf_export.py
  33. 55 45
      TensorFlow/Segmentation/UNet_Medical/utils/cmd_util.py
  34. 53 40
      TensorFlow/Segmentation/UNet_Medical/utils/data_loader.py
  35. 14 12
      TensorFlow/Segmentation/UNet_Medical/utils/hooks/profiling_hook.py
  36. 11 9
      TensorFlow/Segmentation/UNet_Medical/utils/hooks/training_hook.py
  37. 24 21
      TensorFlow/Segmentation/UNet_Medical/utils/model_fn.py
  38. 80 0
      TensorFlow/Segmentation/UNet_Medical/utils/parse_results.py
  39. 0 106
      TensorFlow/Segmentation/UNet_Medical/utils/var_storage.py

+ 0 - 0
TensorFlow/Segmentation/UNet_Medical/.gitmodules


+ 2 - 1
TensorFlow/Segmentation/UNet_Medical/Dockerfile

@@ -1,4 +1,5 @@
-FROM nvcr.io/nvidia/tensorflow:19.06-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.01-tf1-py3
+FROM ${FROM_IMAGE_NAME}
 
 ADD . /workspace/unet
 WORKDIR /workspace/unet

+ 481 - 362
TensorFlow/Segmentation/UNet_Medical/README.md

@@ -1,435 +1,554 @@
-# UNet Medical Image Segmentation for TensorFlow
-
+# U-Net Medical Image Segmentation for TensorFlow 1.x
+ 
 This repository provides a script and recipe to train U-Net Medical to achieve state of the art accuracy, and is tested and maintained by NVIDIA.
-
+ 
 ## Table of contents
-
-* [Model overview](#model-overview)
-    * [Default configuration](#default-configuration)
-    * [Model architecture](#model-architecture)
-    * [Feature support matrix](#feature-support-matrix)
-        *  [Features](#features)
-    * [Mixed precision training](#mixed-precision-training)
-        * [Enabling mixed precision](#enabling-mixed-precision) 
-* [Setup](#setup)
-    * [Requirements](#requirements)
-* [Quick Start Guide](#quick-start-guide)
-* [Advanced](#advanced)
-    * [Scripts and sample code](#scripts-and-sample-code)
-    * [Parameters](#parameters)
-    * [Command line options](#command-line-options)
-    * [Getting the data](#getting-the-data)
-        * [Dataset guidelines](#dataset-guidelines)
-    * [Training process](#training-process)
-        * [Optimizer](#optimizer)
-        * [Augmentation](#augmentation)
-    * [Inference process](#inference-process) 
-* [Performance](#performance)
-    * [Benchmarking](#benchmarking)
-        * [Training performance benchmark](#training-performance-benchmark)
-        * [Inference performance benchmark](#inference-performance-benchmark)
-    * [Results](#results)
-        * [Training accuracy results](#training-accuracy-results)
-            * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g) 
-        * [Training performance results](#training-performance-results)
-            * [NVIDIA DGX-1 (1x V100 16G)](#nvidia-dgx-1-1x-v100-16g)
-            * [NVIDIA DGX-1 (8x V100 16G)](#nvidia-dgx-1-8x-v100-16g)
-        * [Inference performance results](#inference-performance-results)
-            * [NVIDIA DGX-1 (1x V100 16G)](#nvidia-dgx-1-1x-v100-16g)
-* [Release notes](#release-notes)
-    * [Changelog](#changelog)
-    * [Known issues](#known-issues)
-
+ 
+- [Model overview](#model-overview)
+   * [Model architecture](#model-architecture)
+   * [Default configuration](#default-configuration)
+   * [Feature support matrix](#feature-support-matrix)
+     * [Features](#features)
+   * [Mixed precision training](#mixed-precision-training)
+     * [Enabling mixed precision](#enabling-mixed-precision)
+- [Setup](#setup)
+   * [Requirements](#requirements)
+- [Quick Start Guide](#quick-start-guide)
+- [Advanced](#advanced)
+   * [Scripts and sample code](#scripts-and-sample-code)
+   * [Parameters](#parameters)
+   * [Command-line options](#command-line-options)
+   * [Getting the data](#getting-the-data)
+     * [Dataset guidelines](#dataset-guidelines)
+     * [Multi-dataset](#multi-dataset)
+   * [Training process](#training-process)
+   * [Inference process](#inference-process)
+- [Performance](#performance)   
+   * [Benchmarking](#benchmarking)
+     * [Training performance benchmark](#training-performance-benchmark)
+     * [Inference performance benchmark](#inference-performance-benchmark)
+   * [Results](#results)
+     * [Training accuracy results](#training-accuracy-results)
+       * [Training accuracy: NVIDIA DGX-1 (8x V100 16G)](#training-accuracy-nvidia-dgx-1-8x-v100-16g)
+     * [Training performance results](#training-performance-results)
+       * [Training performance: NVIDIA DGX-1 (8x V100 16G)](#training-performance-nvidia-dgx-1-8x-v100-16g)
+     * [Inference performance results](#inference-performance-results)
+        * [Inference performance: NVIDIA DGX-1 (1x V100 16G)](#inference-performance-nvidia-dgx-1-1x-v100-16g)
+- [Release notes](#release-notes)
+   * [Changelog](#changelog)
+   * [Known issues](#known-issues)
+ 
 ## Model overview
-
-The U-Net model is a convolutional neural network for 2D image segmentation. This repository contains a U-Net implementation as described in the paper [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597), without any alteration.
-
-This model is trained with mixed precision using tensor cores on NVIDIA Volta GPUs. Therefore, researchers can get results much faster than training without Tensor Cores, while experiencing the benefits of mixed precision training (for example, up to 3.5x performance boost). This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
-
+ 
+The U-Net model is a convolutional neural network for 2D image segmentation. This repository contains a U-Net implementation as described in the original paper [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597), without any alteration.
+ 
+This model is trained with mixed precision using Tensor Cores on NVIDIA Volta and Turing GPUs. Therefore, researchers can get results  2.2x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
+ 
 ### Model architecture
-
-U-Net was first introduced by Olaf Ronneberger, Philip Fischer, and Thomas Brox in the paper: U-Net: Convolutional Networks for Biomedical Image Segmentation.  U-Net allows for seamless segmentation of 2D images, with high accuracy and performance, and can be adapted to solve many different segmentation problems.
-
-The following figure shows the construction of the UNet model and its different components. UNet is composed of a contractive and an expanding path, that aims at building a bottleneck in its centermost part through a combination of convolution and pooling operations. After this bottleneck, the image is reconstructed through a combination of convolutions and upsampling. Skip connections are added with the goal of helping the backward flow of gradients in order to improve the training.
-
-![UNet](images/unet.png)
-
+ 
+U-Net was first introduced by Olaf Ronneberger, Philip Fischer, and Thomas Brox in the paper: [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597). U-Net allows for seamless segmentation of 2D images, with high accuracy and performance, and can be adapted to solve many different segmentation problems.
+ 
+The following figure shows the construction of the U-Net model and its different components. U-Net is composed of a contractive and an expanding path, that aims at building a bottleneck in its centermost part through a combination of convolution and pooling operations. After this bottleneck, the image is reconstructed through a combination of convolutions and upsampling. Skip connections are added with the goal of helping the backward flow of gradients in order to improve the training.
+ 
+![U-Net](images/unet.png)
+ 
 ### Default configuration
-
+ 
 U-Net consists of a contractive (left-side) and expanding (right-side) path. It repeatedly applies unpadded convolutions followed by max pooling for downsampling. Every step in the expanding path consists of an upsampling of the feature maps and a concatenation with the correspondingly cropped feature map from the contractive path.
-
-The following features were implemented in this model:
-* Data-parallel multi-GPU training with Horovod.
-* Mixed precision support with TensorFlow Automatic Mixed Precision (TF-AMP), which enables mixed precision training without any changes to the code-base by performing automatic graph rewrites and loss scaling controlled by an environmental variable.
-* Tensor Core operations to maximize throughput using NVIDIA Volta GPUs.
-* Static loss scaling for tensor cores (mixed precision) training.
-
-The following performance optimizations were implemented in this model:
-* XLA support (experimental). For TensorFlow, easily adding mixed-precision support is available from NVIDIA’s APEX, a TensorFlow extension that contains utility libraries, such as AMP, which require minimal network code changes to leverage tensor cores performance.
-
+ 
 ### Feature support matrix
-
+ 
 The following features are supported by this model.
-
-| **Feature** | **UNet_Medical_TF** |
-|:---:|:--------:|
-| Horovod Multi-GPU (NCCL) | Yes |
-
+ 
+| **Feature** | **U-Net Medical** |
+|---------------------------------|-----|
+| Automatic mixed precision (AMP) | Yes |
+| Horovod Multi-GPU (NCCL)        | Yes |
+| Accelerated Linear Algebra (XLA)| Yes |
+ 
 #### Features
-
-**Horovod** - Horovod is a distributed training framework for TensorFlow, Keras, PyTorch and MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use.  For more information about how to get started with Horovod, see the [Horovod: Official repository](https://github.com/horovod/horovod).
-
+ 
+**Automatic Mixed Precision (AMP)**
+ 
+This implementation of U-Net uses AMP to implement mixed precision training. It allows us to use FP16 training with FP32 master weights by modifying just a few lines of code.
+ 
+**Horovod**
+ 
+Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. For more information about how to get started with Horovod, see the [Horovod: Official repository](https://github.com/horovod/horovod).
+ 
+Multi-GPU training with Horovod
+ 
+Our model uses Horovod to implement efficient multi-GPU training with NCCL. For details, see example sources in this repository or see the [TensorFlow tutorial](https://github.com/horovod/horovod/#usage).
+ 
+**XLA support (experimental)**
+ 
+XLA is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes. The results are improvements in speed and memory usage: most internal benchmarks run ~1.1-1.5x faster after XLA is enabled.
+ 
 ### Mixed precision training
-
+ 
 Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [tensor cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures.  Using mixed precision training requires two steps:
 1. Porting the model to use the FP16 data type where appropriate.
 2. Adding loss scaling to preserve small gradient values.
-
+ 
 The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
-
+ 
 For information about:
 - How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.
 - Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog.
 - How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide.
-- APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).
-
+ 
 #### Enabling mixed precision
-
-In order to enable mixed precision training, the following environment variables must be defined with the correct value before the training starts:
+ 
+This implementation exploits the TensorFlow Automatic Mixed Precision feature. In order to enable mixed precision training, the following environment variables must be defined with the correct value before the training starts:
 ```
 TF_ENABLE_AUTO_MIXED_PRECISION=1
 ```
-Exporting these variables ensures that loss scaling is performed correctly and automatically. 
-By supplying the `--use_amp` flag to the `main.py` script while training in FP32, the following variables are set to their correct value for mixed precision training inside the `./utils/runner.py` script:
+Exporting these variables ensures that loss scaling is performed correctly and automatically.
+By supplying the `--use_amp` flag to the `main.py` script while training in FP32, the following variables are set to their correct value for mixed precision training:
 ```
-if params['use_amp']:
-   LOGGER.log("TF AMP is activated")
-   os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
+if params.use_amp:
+  os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
 ```
-
+ 
 ## Setup
-
-The following section lists the requirements in order to start training the U-Net model.
-
+ 
+The following section lists the requirements in order to start training the U-Net Medical model.
+ 
 ### Requirements
-
-This repository contains a `Dockerfile` which extends the TensorFlow NGC container and encapsulates some additional dependencies. Aside from these dependencies, ensure you have the following components:
-* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [tensorflow:19.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-tensorflow)
-* [NVIDIA Volta based GPU](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
-
-For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning DGX Documentation:
-
-* [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
-* [Accessing And Pulling From The NGC container registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
-* [Running Tensorflow](https://docs.nvidia.com/deeplearning/dgx/tensorflow-release-notes/running.html#running)
-
+ 
+This repository contains Dockerfile which extends the TensorFlow NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
+- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
+- TensorFlow 20.02-tf1-py3 [NGC container](https://ngc.nvidia.com/registry/nvidia-tensorflow)
+- [NVIDIA Volta GPU](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
+ 
+For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
+- [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
+- [Accessing And Pulling From The NGC container registry](https://docs.nvidia.com/deeplearning/dgx/user-guide/index.html#accessing_registry)
+- [Running TensorFlow](https://docs.nvidia.com/deeplearning/dgx/tensorflow-release-notes/running.html#running)
+ 
+For those unable to use the TensorFlow NGC container, to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
 ## Quick Start Guide
-
-To train your model using mixed precision with tensor cores or using FP32, perform the following steps using the default parameters of the U-Net model on the [EM segmentation challenge dataset](http://brainiac2.mit.edu/isbi_challenge/home).
-
-### Clone the repository
-```
-git clone https://github.com/NVIDIA/DeepLearningExamples
-cd DeepLearningExamples/TensorFlow/Segmentation/UNet_Medical
-```
-
-### Download and preprocess the dataset
-
-The U-Net script  main.py operates on data from the [ISBI Challenge](http://brainiac2.mit.edu/isbi_challenge/home), the dataset originally employed in the [U-Net paper](https://arxiv.org/abs/1505.04597). Upon registration, the challenge's data is made available through the following links:
-
-* [train-volume.tif](http://brainiac2.mit.edu/isbi_challenge/sites/default/files/train-volume.tif)
-* [train-labels.tif](http://brainiac2.mit.edu/isbi_challenge/sites/default/files/train-labels.tif)
-* [train-volume.tif](http://brainiac2.mit.edu/isbi_challenge/sites/default/files/test-volume.tif)
-
-The script `download_dataset.py` is provided for data download. It is possible to select the destination folder when downloading the files by using the `--data_dir` flag.  For example: 
-```
-python download_dataset.py --data_dir ./dataset
-```
-Training and test data are composed of 3 multi-page `TIF` files, each containing 30 2D-images. The training and test datasets are given as stacks of 30 2D-images provided as a multi-page `TIF` that can be read using the Pillow library and NumPy (both Python packages are installed by the `Dockerfile`):
-```
-From PIL import Image, ImageSequence
-Import numpy as np
-
-im = Image.open(path)
-slices = [np.array(i) for i in ImageSequence.Iterator(im)]
-```
-Once downloaded the data using the `download_dataset.py` script, it can be used to run the training and benchmark scripts described below, by pointing `main.py` to its location using the `--data_dir` flag.
-
-**Note:** Masks are only provided for training data.
-
-### Build the U-Net TensorFlow container
-
-After Docker is correctly set up, the U-Net TensorFlow container can be built with:
-```
-user@~/Documents/unet_medical_tf # docker build -t unet_tf .
-```
-
-### Start an interactive session in the NGC container to run training/inference.
-
-Run the previously built Docker container:
-```
-user@~/path/to/unet_medical_tf # docker run --runtime=nvidia --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -v /path/to/dataset:/data unet_tf:latest bash
-```
-**Note:** Ensure to mount your dataset using the -v flag to make it available for training inside the NVIDIA Docker container.
-
-### Start training
-
-To run training for a default configuration (for example 1/8 GPUs FP32/TF-AMP), run one of the scripts in the `./examples` directory, as follows:
-```
-bash examples/unet_{FP32, TF-AMP}_{1,8}.sh <path to main.py> <path to dataset> <path to results directory>
-```
-For example:
-```
-root@8e522945990f:/workspace/unet# bash examples/unet_FP32_1GPU.sh . /data results
-```
-
-### Start inference/predictions
-To run inference on a checkpointed model, run:
-```
-bash examples/unet_INFER_{FP32, TF-AMP}.sh <path to main.py> <path to dataset> <path to results directory>
-```
-For example:
-```
-root@8e522945990f:/workspace/unet# bash examples/unet_INFER_FP32.sh . /data results
-```
-
+ 
+To train your model using mixed precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the U-Net model on the [EM segmentation challenge dataset](http://brainiac2.mit.edu/isbi_challenge/home). These steps enable you to build the U-Net TensorFlow NGC container, train and evaluate your model, and generate predictions on the test data. Furthermore, you can then choose to:
+* compare your evaluation accuracy with our [Training accuracy results](#training-accuracy-results),
+* compare your training performance with our [Training performance benchmark](#training-performance-benchmark),
+* compare your inference performance with our [Inference performance benchmark](#inference-performance-benchmark).
+ 
+For the specifics concerning training and inference, see the [Advanced](#advanced) section.
+ 
+1. Clone the repository.
+ 
+   Executing this command will create your local repository with all the code to run U-Net.
+  
+   ```bash
+   git clone https://github.com/NVIDIA/DeepLearningExamples
+   cd DeepLearningExamples/TensorFlow/Segmentation/U-Net_Medical_TF
+ 
+2. Build the U-Net TensorFlow NGC container.
+ 
+   This command will use the `Dockerfile` to create a Docker image named `unet_tf`, downloading all the required components automatically.
+  
+   ```
+   docker build -t unet_tf .
+   ```
+  
+   The NGC container contains all the components optimized for usage on NVIDIA hardware.
+ 
+3. Start an interactive session in the NGC container to run preprocessing/training/inference.
+ 
+   The following command will launch the container and mount the `./data` directory as a volume to the `/data` directory inside the container, and `./results` directory to the `/results` directory in the container.
+  
+   ```bash
+   mkdir data
+   mkdir results
+   docker run --runtime=nvidia -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --rm --ipc=host -v ${PWD}/data:/data -v ${PWD}/results:/results unet_tf:latest /bin/bash
+   ```
+  
+   Any datasets and experiment results (logs, checkpoints, etc.) saved to `/data` or `/results` will be accessible
+   in the `./data` or `./results` directory on the host, respectively.
+ 
+4. Download and preprocess the data.
+  
+   The U-Net script `main.py` operates on data from the [ISBI Challenge](http://brainiac2.mit.edu/isbi_challenge/home), the dataset originally employed in the [U-Net paper](https://arxiv.org/abs/1505.04597).
+  
+   The script `download_dataset.py` is provided for data download. It is possible to select the destination folder when downloading the files by using the `--data_dir` flag.  For example:
+   ```bash
+   python download_dataset.py --data_dir /data
+   ```
+  
+   Training and test data are composed of 3 multi-page `TIF` files, each containing 30 2D-images (around 30 Mb total). Once downloaded, the data with the `download_dataset.py` script can be used to run the training and benchmark scripts described below, by pointing `main.py` to its location using the `--data_dir` flag.
+  
+   **Note:** Masks are only provided for training data.
+ 
+5. Start training.
+  
+   After the Docker container is launched, the training with the [default hyperparameters](#default-parameters) (for example 1/8 GPUs FP32/TF-AMP) can be started with:
+  
+   ```bash
+   bash examples/unet_{FP32, TF-AMP}_{1,8}GPU.sh <path/to/dataset> <path/to/checkpoint>
+   ```
+  
+   For example, to run with full precision (FP32) on 1 GPU from the project’s folder, simply use:
+  
+   ```bash
+   bash examples/unet_FP32_1GPU.sh /data /results
+   ```
+  
+   This script will launch a training on a single fold and store the model’s checkpoint in <path/to/checkpoint> directory. 
+  
+   The script can be run directly by modifying flags if necessary, especially the number of GPUs, which is defined after the `-np` flag. Since the test volume does not have labels, 20% of the training data is used for validation in 5-fold cross-validation manner. The number of fold can be changed using `--crossvalidation_idx` with an integer in range 0-4. For example, to run with 4 GPUs using fold 1 use:
+  
+   ```bash
+   horovodrun -np 4 python main.py --data_dir /data --model_dir /results --batch_size 1 --exec_mode train --crossvalidation_idx 1 --use_xla --use_amp
+   ```
+  
+   Training will result in a checkpoint file being written to `./results` on the host machine.
+ 
+6. Start validation/evaluation.
+  
+   The trained model can be evaluated by passing the `--exec_mode evaluate` flag. Since evaluation is carried out on a validation dataset, the `--crossvalidation_idx` parameter should be filled. For example:
+  
+   ```bash
+   python main.py --data_dir /data --model_dir /results --batch_size 1 --exec_mode evaluate --crossvalidation_idx 0 --use_xla --use_amp
+   ```
+  
+   Evaluation can also be triggered jointly after training by passing the `--exec_mode train_and_evaluate` flag.
+ 
+7. Start inference/predictions.
+   To run inference on a checkpointed model, run:
+   ```bash
+   bash examples/unet_INFER_{FP32, TF-AMP}.sh <path/to/dataset> <path/to/checkpoint>
+   ```
+   For example:
+   ```bash
+   bash examples/unet_INFER_FP32.sh /data /results
+   ```
+  
+   Now that you have your model trained and evaluated, you can choose to compare your training results with our [Training accuracy results](#training-accuracy-results). You can also choose to benchmark the performance of your training [Training performance benchmark](#training-performance-benchmark), or [Inference performance benchmark](#inference-performance-benchmark). Following the steps in these sections will ensure that you achieve the same accuracy and performance results as stated in the [Results](#results) section.
+ 
 ## Advanced
-
+ 
 The following sections provide greater details of the dataset, running training and inference, and the training results.
-
+ 
 ### Scripts and sample code
-
+ 
 In the root directory, the most important files are:
 * `main.py`: Serves as the entry point to the application.
-* `Dockerfile`: Container with the basic set of dependencies to run UNet
-* `requirements.txt`: Set of extra requirements for running UNet
-* `download_data.py`: Automatically downloads the dataset for training
-
-The `utils/` folder encapsulates the necessary tools to train and perform inference using UNet. Its main components are:
-* `runner.py`: Implements the logic for training and inference
-* `data_loader.py`: Implements the data loading and augmentation
-* `hooks/profiler.py`: Collects different metrics to be used for benchmarking and testing
-* `var_storage.py`: Helper functions for TF-AMP
-
-The `model/` folder contains information about the building blocks of UNet and the way they are assembled. Its contents are:
-* `layers.py`: Defines the different blocks that are used to assemble UNet
+* `Dockerfile`: Container with the basic set of dependencies to run U-Net.
+* `requirements.txt`: Set of extra requirements for running U-Net.
+* `download_data.py`: Automatically downloads the dataset for training.
+ 
+The `utils/` folder encapsulates the necessary tools to train and perform inference using U-Net. Its main components are:
+* `cmd_util.py`: Implements the command-line arguments parsing.
+* `data_loader.py`: Implements the data loading and augmentation.
+* `model_fn.py`: Implements the logic for training and inference.
+* `hooks/training_hook.py`: Collects different metrics during training.
+* `hooks/profiling_hook.py`: Collects different metrics to be used for benchmarking and testing.
+* `parse_results.py`: Implements the intermediate results parsing.
+ 
+The `model/` folder contains information about the building blocks of U-Net and the way they are assembled. Its contents are:
+* `layers.py`: Defines the different blocks that are used to assemble U-Net
 * `unet.py`: Defines the model architecture using the blocks from the `layers.py` script
-
+ 
 Other folders included in the root directory are:
 * `dllogger/`: Contains the utils for logging
-* `examples/`: Provides examples for training and benchmarking UNet
+* `examples/`: Provides examples for training and benchmarking U-Net
 * `images/`: Contains a model diagram
-
+ 
 ### Parameters
+ 
 The complete list of the available parameters for the main.py script contains:
-* `--exec_mode`: Select the execution mode to run the model (default: train_and_predict)
-* `--model_dir`: Set the output directory for information related to the model (default: result/)
-* `--data_dir`: Set the input directory containing the dataset (defaut: None)
-* `--batch_size`: Size of each minibatch per GPU (default: 1)
-* `--max_steps`: Maximum number of steps (batches) for training (default: 1000)
-* `--seed`: Set random seed for reproducibility (default: 0)
-* `--weight_decay`: Weight decay coefficient (default: 0.0005)
-* `--log_every`: Log performance every n steps (default: 100)
-* `--warmup_steps`: Skip logging during the first n steps (default: 200)
-* `--learning_rate`: Model’s learning rate (default: 0.01)
-* `--momentum`: Momentum coefficient for model’s optimizer (default: 0.99)
-* `--decay_steps`: Number of steps before learning rate decay (default: 5000)
-* `--decay_rate`: Decay rate for polynomial learning rate decay (default 0.95)
-* `--augment`: Enable data augmentation (default: False)
-* `--benchmark`: Enable performance benchmarking (default: False)
-* `--use_amp`: Enable automatic mixed precision (default: False)
-
+* `--exec_mode`: Select the execution mode to run the model (default: `train`). Modes available:
+  * `evaluate` - loads checkpoint (if available) and performs evaluation on validation subset (requires `--crossvalidation_idx` other than `None`).
+  * `train_and_evaluate` - trains model from scratch and performs validation at the end (requires `--crossvalidation_idx` other than `None`).
+  * `predict` - loads checkpoint (if available) and runs inference on the test set. Stores the results in `--model_dir` directory.
+  * `train_and_predict` - trains model from scratch and performs inference.
+* `--model_dir`: Set the output directory for information related to the model (default: `/results`).
+* `--log_dir`: Set the output directory for logs (default: None).
+* `--data_dir`: Set the input directory containing the dataset (default: `None`).
+* `--batch_size`: Size of each minibatch per GPU (default: `1`).
+* `--crossvalidation_idx`: Selected fold for cross-validation (default: `None`).
+* `--max_steps`: Maximum number of steps (batches) for training (default: `1000`).
+* `--seed`: Set random seed for reproducibility (default: `0`).
+* `--weight_decay`: Weight decay coefficient (default: `0.0005`).
+* `--log_every`: Log performance every n steps (default: `100`).
+* `--learning_rate`: Model’s learning rate (default: `0.0001`).
+* `--augment`: Enable data augmentation (default: `False`).
+* `--benchmark`: Enable performance benchmarking (default: `False`). If the flag is set, the script runs in a benchmark mode - each iteration is timed and the performance result (in images per second) is printed at the end. Works for both `train` and `predict` execution modes.
+* `--warmup_steps`: Used during benchmarking - the number of steps to skip (default: `200`). First iterations are usually much slower since the graph is being constructed. Skipping the initial iterations is required for a fair performance assessment.
+* `--use_xla`: Enable accelerated linear algebra optimization (default: `False`).
+* `--use_amp`: Enable automatic mixed precision (default: `False`).
+ 
 ### Command line options
-
-To see the full list of available options and their descriptions, use the `-h` or `--help` command line option, for example: 
+ 
+To see the full list of available options and their descriptions, use the `-h` or `--help` command-line option, for example:
+```bash
+python main.py --help
 ```
-root@ac1c9afe0a0b:/workspace/unet# python main.py
-usage: main.py [-h] 
-            [--exec_mode {train,train_and_predict,predict,benchmark}]
-            [--model_dir MODEL_DIR] 
-            --data_dir DATA_DIR 
-            [--batch_size BATCH_SIZE] 
-            [--max_steps MAX_STEPS]
-            [--seed SEED]
-            [--weight_decay WEIGHT_DECAY]
-            [--log_every LOG_EVERY]
-            [--warmup_steps WARMUP_STEPS]
-            [--learning_rate LEARNING_RATE]
-            [--momentum MOMENTUM]
-            [--decay_steps DECAY_STEPS]
-            [--decay_rate DECAY_RATE]
-            [--augment]
-            [--no-augment]
-            [--benchmark]
-            [--no-benchmark]
-            [--use_amp]
+ 
+The following example output is printed when running the model:
+```python main.py --help
+usage: main.py [-h]
+              [--exec_mode {train,train_and_predict,predict,evaluate,train_and_evaluate}]
+              [--model_dir MODEL_DIR] --data_dir DATA_DIR [--log_dir LOG_DIR]
+              [--batch_size BATCH_SIZE] [--learning_rate LEARNING_RATE]
+              [--crossvalidation_idx CROSSVALIDATION_IDX]
+              [--max_steps MAX_STEPS] [--weight_decay WEIGHT_DECAY]
+              [--log_every LOG_EVERY] [--warmup_steps WARMUP_STEPS]
+              [--seed SEED] [--augment] [--no-augment] [--benchmark]
+              [--no-benchmark] [--use_amp] [--use_xla]
+ 
+U-Net-medical
+ 
+optional arguments:
+ -h, --help            show this help message and exit
+ --exec_mode {train,train_and_predict,predict,evaluate,train_and_evaluate}
+                       Execution mode of running the model
+ --model_dir MODEL_DIR
+                       Output directory for information related to the model
+ --data_dir DATA_DIR   Input directory containing the dataset for training
+                       the model
+ --log_dir LOG_DIR     Output directory for training logs
+ --batch_size BATCH_SIZE
+                       Size of each minibatch per GPU
+ --learning_rate LEARNING_RATE
+                       Learning rate coefficient for AdamOptimizer
+ --crossvalidation_idx CROSSVALIDATION_IDX
+                       Chosen fold for cross-validation. Use None to disable
+                       cross-validation
+ --max_steps MAX_STEPS
+                       Maximum number of steps (batches) used for training
+ --weight_decay WEIGHT_DECAY
+                       Weight decay coefficient
+ --log_every LOG_EVERY
+                       Log performance every n steps
+ --warmup_steps WARMUP_STEPS
+                       Number of warmup steps
+ --seed SEED           Random seed
+ --augment             Perform data augmentation during training
+ --no-augment
+ --benchmark           Collect performance metrics during training
+ --no-benchmark
+ --use_amp             Train using TF-AMP
+ --use_xla             Train using XLA
 ```
-
-### Getting the data
-
-The U-Net model was trained in the [EM segmentation challenge dataset](http://brainiac2.mit.edu/isbi_challenge/home). Test images provided by the organization were used to produce the resulting masks for submission.
-
-Training and test data is comprised of three 512x512x30 `TIF` volumes (`test-volume.tif`, `train-volume.tif` and `train-labels.tif`). Files `test-volume.tif` and `train-volume.tif` contain grayscale 2D slices to be segmented. Additionally, training masks are provided in `train-labels.tif` as a 512x512x30 `TIF` volume, where each pixel has one of two classes: 
-* 0 indicating the presence of cellular membrane, and 
+ 
+The U-Net model was trained in the [EM segmentation challenge dataset](http://brainiac2.mit.edu/isbi_challenge/home). Test images provided by the organization were used to produce the resulting masks for submission. Upon registration, the challenge's data is made available through the following links:
+ 
+* [train-volume.tif](http://brainiac2.mit.edu/isbi_challenge/sites/default/files/train-volume.tif)
+* [train-labels.tif](http://brainiac2.mit.edu/isbi_challenge/sites/default/files/train-labels.tif)
+* [train-volume.tif](http://brainiac2.mit.edu/isbi_challenge/sites/default/files/test-volume.tif)
+ 
+Training and test data are comprised of three 512x512x30 `TIF` volumes (`test-volume.tif`, `train-volume.tif` and `train-labels.tif`). Files `test-volume.tif` and `train-volume.tif` contain grayscale 2D slices to be segmented. Additionally, training masks are provided in `train-labels.tif` as a 512x512x30 `TIF` volume, where each pixel has one of two classes:
+* 0 indicating the presence of cellular membrane,
 * 1 corresponding to background.
-
-The objective is to produce a set of masks that segment the data as accurately as possible. The results are expected to be submitted as a 32-bit `TIF` 3D image, which values between `0` (100% membrane certainty) and `1` (100% non-membrane certainty). 
-
+ 
+The objective is to produce a set of masks that segment the data as accurately as possible. The results are expected to be submitted as a 32-bit `TIF` 3D image, with values between `0` (100% membrane certainty) and `1` (100% non-membrane certainty).
+ 
 #### Dataset guidelines
-
-The process of loading, normalizing and augmenting the data contained in the dataset can be found in the `data_loader.py` script. 
-
-Initially, data is loaded from a multi-page `TIF` file and converted to 512x512x30 NumPy arrays with the use of Pillow. These NumPy arrays are fed to the model through `tf.data.Dataset.from_tensor_slices()`, in order to achieve high performance.
-
-Intensities on the volumes are then normalized to an interval `[-1, 1]`, whereas labels are one-hot encoded for their later use in pixel wise cross entropy loss, becoming 512x512x30x2 tensors.
-
+ 
+The training and test datasets are given as stacks of 30 2D-images provided as a multi-page `TIF` that can be read using the Pillow library and NumPy (both Python packages are installed by the `Dockerfile`).
+ 
+Initially, data is loaded from a multi-page `TIF` file and converted to 512x512x30 NumPy arrays with the use of Pillow. The process of loading, normalizing and augmenting the data contained in the dataset can be found in the `data_loader.py` script.
+ 
+These NumPy arrays are fed to the model through `tf.data.Dataset.from_tensor_slices()`, in order to achieve high performance.
+ 
+The voxel intensities then normalized to an interval `[-1, 1]`, whereas labels are one-hot encoded for their later use in dice or pixel-wise cross-entropy loss, becoming 512x512x30x2 tensors.
+ 
 If augmentation is enabled, the following set of augmentation techniques are applied:
 * Random horizontal flipping
 * Random vertical flipping
-* Elastic deformation through dense_image_warp
-* Random rotation
 * Crop to a random dimension and resize to input dimension
 * Random brightness shifting
-
-At the end, intensities are clipped to the `[-1, 1]` interval.
-
-
+ 
+In the end, images are reshaped to 388x388 and padded to 572x572 to fit the input of the network. Masks are only reshaped to 388x388 to fit the output of the network. Moreover, pixel intensities are clipped to the `[-1, 1]` interval.
+ 
+#### Multi-dataset
+ 
+This implementation is tuned for the EM segmentation challenge dataset. Using other datasets is possible, but might require changes to the code (data loader) and tuning some hyperparameters (e.g. learning rate, number of iterations).
+ 
+In the current implementation, the data loader works with NumPy arrays by loading them at the initialization, and passing them for training in slices by `tf.data.Dataset.from_tensor_slices()`. If you’re able to fit your dataset into the memory, then convert the data into three NumPy arrays - training images, training labels, and testing images (optional). If your dataset is large, you will have to adapt the optimizer for the lazy-loading of data. For a walk-through, check the [TensorFlow tf.data API guide](https://www.tensorflow.org/guide/data_performance)
+ 
+The performance of the model depends on the dataset size.
+Generally, the model should scale better for datasets containing more data. For a smaller dataset, you might experience lower performance.
+ 
 ### Training process
-
-#### Optimizer
-
-The model trains for 40,000 batches, with the default U-Net setup as specified in the [original paper](https://arxiv.org/abs/1505.04597):
-
-* SGD with momentum (0.99)
-* Learning rate = 0.01
-
-
-This default parametrization is employed when running scripts from the ./examples directory and when running main.py without explicitly overriding these fields.
-* Augmentation
-* During training, we perform the following augmentation techniques:
-* Random flip left and right
-* Random flip up and down
-* Elastic deformation
-* Random rotation
-* Random crop and resize
-* Random brightness changes
-
-To run a pre-parameterized configuration (1 or 8 GPUs, FP32 or AMP), run one of the scripts in the `./examples` directory, for example:
+ 
+The model trains for a total 40,000 batches (40,000 / number of GPUs), with the default U-Net setup:
+* Adam optimizer with learning rate of 0.0001.
+ 
+This default parametrization is applied when running scripts from the `./examples` directory and when running `main.py` without explicitly overriding these parameters. By default, the training is in full precision. To enable AMP, pass the `--use_amp` flag. AMP can be enabled for every mode of execution.
+ 
+The default configuration minimizes a function _L = 1 - DICE + cross entropy_ during training.
+ 
+The training can be run directly without using the predefined scripts. The name of the training script is `main.py`. Because of the multi-GPU support, training should always be run with the Horovod distributed launcher like this:
+```bash
+horovodrun -np <number/of/gpus> python main.py --data_dir /data [other parameters]
 ```
-./examples/unet_{FP32, TF-AMP}_{1, 8}GPU.sh <path/to/main.py> <path/to/dataset> <path/to/checkpoints> <batch size>
-```
-Use `-h` or `--help` to obtain a list of available options in the `main.py` script.
-
-**Note:** When calling the `main.py` script manually, data augmentation is disabled. In order to enable data augmentation, use the `--augment` flag at the end of your invocation.
-
-Use the `--model_dir` flag to select the location where to store the artifacts of the training.
-
+ 
+*Note:* When calling the `main.py` script manually, data augmentation is disabled. In order to enable data augmentation, use the `--augment` flag in your invocation.
+ 
+The main result of the training are checkpoints stored by default in `./results/` on the host machine, and in the `/results` in the container. This location can be controlled
+by the `--model_dir` command-line argument, if a different location was mounted while starting the container. In the case when the training is run in `train_and_predict` mode, the inference will take place after the training is finished, and inference results will be stored to the `/results` directory.
+ 
+If the `--exec_mode train_and_evaluate` parameter was used, and if `--crossvalidation_idx` parameter is set to an integer value of {0, 1, 2, 3, 4}, the evaluation of the validation set takes place after the training is completed. The results of the evaluation will be printed to the console.
 ### Inference process
-
-To run inference on a checkpointed model, run the script below, although, it requires a pre-trained model checkpoint and tokenized input.
+ 
+Inference can be launched with the same script used for training by passing the `--exec_mode predict` flag:
+```bash
+python main.py --exec_mode predict --data_dir /data --model_dir <path/to/checkpoint> [other parameters]
 ```
-python main.py --data_dir /data --model_dir <path to checkpoint> --exec_mode predict
-```
-This script should produce the prediction results over a set of masks which will be located in `<path to checkpoint>/eval`.
-
+ 
+The script will then:
+* Load the checkpoint from the directory specified by the `<path/to/checkpoint>` directory (`/results`),
+* Run inference on the test dataset,
+* Save the resulting binary masks in a `TIF` format.
+ 
 ## Performance
-
+ 
 ### Benchmarking
-
+ 
 The following section shows how to run benchmarks measuring the model performance in training and inference modes.
-
+ 
 #### Training performance benchmark
-
-To benchmark training, run one of the scripts in `./examples/unet_TRAIN_BENCHMARK_{FP32, TF-AMP}_{1, 8}GPU.sh  <path/to/main.py> <path/to/dataset> <path/to/checkpoints> <batch size>`.
-
-Each of these scripts will by default run 200 warm-up iterations and benchmark the performance during training in the next 100 iterations. To control warmup and benchmark length, use `--warmup_steps`, and `--max_steps` flags.
-
+ 
+To benchmark training, run one of the `TRAIN_BENCHMARK` scripts in `./examples/`:
+```bash
+bash examples/unet_TRAIN_BENCHMARK_{FP32, TF-AMP}_{1, 8}GPU.sh <path/to/dataset> <path/to/checkpoints> <batch/size>
+```
+For example, to benchmark training using mixed-precision on 8 GPUs use:
+```bash
+bash examples/unet_TRAIN_BENCHMARK_TF-AMP_8GPU.sh <path/to/dataset> <path/to/checkpoints> <batch/size>
+```
+ 
+Each of these scripts will by default run 200 warm-up iterations and benchmark the performance during training in the next 800 iterations.
+ 
+To have more control, you can run the script by directly providing all relevant run parameters. For example:
+```bash
+horovodrun -np <num/of/gpus> python main.py --exec_mode train --benchmark --augment --data_dir <path/to/dataset> --model_dir <optional, path/to/checkpoint> --batch_size <batch/size> --warmup_steps <warm-up/steps> --max_steps <max/steps>
+```
+ 
+At the end of the script, a line reporting the best train throughput will be printed.
+ 
 #### Inference performance benchmark
-
-To benchmark inference, run one of the scripts in `./examples/unet_INFER_BENCHMARK_{FP32, TF-AMP}.sh <path/to/main.py> <path/to/dataset> <path/to/checkpoints> <batch size>`.
-
-Each of these scripts will by default run 200 warmup iterations and benchmark the performance during inference in the next 100 iterations. To control warmup and benchmark length, use `--warmup_steps`, and `--max_steps` flags.
-
+ 
+To benchmark inference, run one of the scripts in `./examples/`:
+```bash
+bash examples/unet_INFER_BENCHMARK_{FP32, TF-AMP}.sh <path/to/dataset> <path/to/checkpoints> <batch/size>
+```
+ 
+For example, to benchmark inference using mixed-precision:
+```bash
+bash examples/unet_INFER_BENCHMARK_TF-AMP.sh <path/to/dataset> <path/to/checkpoints> <batch/size>
+```
+ 
+Each of these scripts will by default run 200 warm-up iterations and benchmark the performance during inference in the next 400 iterations.
+ 
+To have more control, you can run the script by directly providing all relevant run parameters. For example:
+```bash
+python main.py --exec_mode predict --benchmark --data_dir <path/to/dataset> --model_dir <optional, path/to/checkpoint> --batch_size <batch/size> --warmup_steps <warm-up/steps> --max_steps <max/steps>
+```
+ 
+At the end of the script, a line reporting the best inference throughput will be printed.
+ 
 ### Results
-
-The following sections provide details on how we achieved our performance and accuracy in training and inference. 
-
+ 
+The following sections provide details on how we achieved our performance and accuracy in training and inference.
+ 
 #### Training accuracy results
+ 
+##### Training accuracy: NVIDIA DGX-1 (8x V100 16G)
+ 
+The following table lists the average DICE score across 5-fold cross-validation. Our results were obtained by running the `examples/unet_TRAIN_{FP32, TF-AMP}_{1, 8}GPU.sh` training script in the tensorflow:20.02-tf1-py3 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs.
+ 
+| GPUs | Batch size / GPU | Accuracy - FP32 | Accuracy - mixed precision | Time to train - FP32 [hours] | Time to train - mixed precision [hours] | Time to train speedup (FP32 to mixed precision) |
+|------|------------------|-----------------|----------------------------|------------------------------|----------------------------|--------------------------------|
+| 1 | 8 | 0.8884 | 0.8906 | 7.08 | 2.54 | 2.79 |
+| 8 | 8 | 0.8962 | 0.8972 | 0.97 | 0.37 | 2.64 |
+ 
+To reproduce this result, start the Docker container interactively and run one of the TRAIN scripts:
+```bash
+bash examples/unet_TRAIN_{FP32, TF-AMP}_{1, 8}GPU.sh <path/to/dataset> <path/to/checkpoint> <batch/size>
+```
+ for example
+```bash
+bash examples/unet_TRAIN_TF-AMP_8GPU.sh /data /results 8
+```
 
-##### NVIDIA DGX-1 (8x V100 16G)
-
-Our results were obtained by running the `./examples/unet_{FP32, TF-AMP}_{1, 8}GPU.sh` scripts in the tensorflow:19.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.
-
-Metrics employed by the organization are explained in detail [here](http://brainiac2.mit.edu/isbi_challenge/evaluation).
-
-The results described below were obtained after the submission of our evaluations to the [ISBI Challenge](http://brainiac2.mit.edu/isbi_challenge) organizers. 
-
-| **Number og GPUs** | **FP32 Rand Score Thin** | **FP32 Information Score Thin** | **TF-AMP Rand Score Thin** | **TF-AMP Information Score Thin** | **Total time to train with FP16 (Hrs)** | **Total time to train with FP32 (Hrs)** |
-|:---:|:--------:|:-------:|:--------:|:-------:|:--------:|:-------:|
-|1 | 0.938508265 | 0.970255682 | 0.939619101 | 0.970120138 | 7.1 | 11.28 |
-|8 | 0.932395087 | 0.9786346 | 0.941360867 | 0.976235311 | 0.9 | 1.41 |
-
+This command will launch a script which will run 5-fold cross-validation training for 40,000 iterations and print the validation DICE score and cross-entropy loss. The time reported is for one fold, which means that the training for 5 folds will take 5 times longer. The default batch size is 8, however if you have less than 16 Gb memory card and you encounter GPU memory issue you should decrease the batch size. The logs of the runs can be found in `/results` directory once the script is finished.
+ 
 #### Training performance results
-
-##### NVIDIA DGX-1 (1x V100 16G)
-
-Our results were obtained by running the `./examples/unet_TRAIN_BENCHMARK_{FP32, TF-AMP}_1GPU.sh` scripts in
-the tensorflow:19.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPU while data augmentation is enabled.
-
-
-| **Batch size** | **FP32 max img/s** | **TF-AMP max img/s** | **Speedup factor** | 
-|:---:|:--------:|:-------:|:-------:|
-| 1 | 12.37 | 21.91 | 1.77 |
-| 8 | 13.81  | 29.58 | 2.14 |
-| 16 | Out of memory | 30.77 | - |
-
-To achieve these same results, follow the [Quick start guide](#3-quick-start-guide) outlined above.
-
-##### NVIDIA DGX-1 (8x V100 16G)
-
-Our results were obtained by running the `./examples/unet_TRAIN_BENCHMARK_{FP32, TF-AMP}_8GPU.sh` scripts in
-the tensorflow:19.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPU while data augmentation is enabled.
-
-| **Batch size per GPU** | **FP32 max img/s** | **TF-AMP max img/s** | **Speedup factor** | 
-|:---:|:--------:|:-------:|:-------:|
-| 1 | 89.93 | 126.66  | 1.41 |
-| 8 | 105.35 | 130.66 | 1.24 |
-| 16 | Out of memory | 132.78  | - |
-
-To achieve these same results, follow the [Quick start guide](#3-quick-start-guide) outlined above.
-
-#### Inference performance results
-
-#### NVIDIA DGX-1 (1x V100 16G)
-
-Our results were obtained by running the `./examples/unet_INFER_BENCHMARK_{FP32, TF-AMP}.sh` scripts in
-the tensorflow:19.06-py3 NGC container on NVIDIA DGX-1 with 1x V100 16G GPU while data augmentation is enabled.
-
-| **Batch size** | **FP32 img/s** | **TF-AMP img/s** | **Speedup factor** | 
-|:---:|:--------:|:-------:|:-------:|
-| 1 | 34.27 | 62.81  | 1.83 |
-| 8 | 37.09 | 79.62 | 2.14 |
-| 16 | Out of memory | 83.33  | - |
-
-To achieve these same results, follow the [Quick start guide](#3-quick-start-guide) outlined above.
-
+ 
+##### Training performance: NVIDIA DGX-1 (8x V100 16G)
+ 
+Our results were obtained by running the `examples/unet_TRAIN_BENCHMARK_{TF-AMP, FP32}_{1, 8}GPU.sh` training script in the tensorflow:20.02-tf1-py3 NGC container on NVIDIA DGX-1 with (8x V100 16G) GPUs. Performance numbers (in items/images per second) were averaged over 1000 iterations, excluding the first 200 warm-up steps.
+ 
+| GPUs | Batch size / GPU | Throughput - FP32 [img/s] | Throughput - mixed precision [img/s] | Throughput speedup (FP32 - mixed precision) | Weak scaling - FP32 | Weak scaling - mixed precision |       
+|------|------------------|-------------------|--------------------------------|---------------------------------------------|---------------------------|--------------------------------|
+| 1 | 8 |  18.57 |  52.27 | 2.81 |  N/A |  N/A |
+| 8 | 8 | 138.50 | 366.88 | 2.65 | 7.02 | 7.46 |
+ 
+ 
+To achieve these same results, follow the steps in the [Training performance benchmark](#training-performance-benchmark) section.
+ 
+Throughput is reported in images per second. Latency is reported in milliseconds per image.
+ 
+##### Inference performance: NVIDIA DGX-1 (1x V100 16G)
+ 
+Our results were obtained by running the `examples/unet_INFER_BENCHMARK_{TF-AMP, FP32}.sh` inferencing benchmarking script in the tensorflow:20.02-tf1-py3 NGC container on NVIDIA DGX-1 with (1x V100 16G) GPU.
+ 
+FP16
+ 
+| Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
+|------|-----------|--------|---------|--------|--------|--------|
+|   1  | 572x572x1 | 133.21 |  7.507  | 7.515  | 7.517  | 7.519  |
+|   2  | 572x572x1 | 153.45 |  13.033 | 13.046 | 13.048 | 13.052 |
+|   4  | 572x572x1 | 173.67 |  23.032 | 23.054 | 23.058 | 23.066 |
+|   8  | 572x572x1 | 181.62 |  44.047 | 49.051 | 49.067 | 50.880 |
+|  16  | 572x572x1 | 184.21 |  89.377 | 94.116 | 95.024 | 96.798 |
+ 
+FP32
+ 
+| Batch size | Resolution | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
+|------|-----------|--------|---------|---------|---------|---------|
+|   1  | 572x572x1 |  49.97 | 20.018  | 20.044  | 20.048  | 20.058  |
+|   2  | 572x572x1 |  54.30 | 36.837  | 36.865  | 36.871  | 36.881  |
+|   4  | 572x572x1 |  56.27 | 71.085  | 71.150  | 71.163  | 71.187  |
+|   8  | 572x572x1 |  58.41 | 143.347 | 154.845 | 157.047 | 161.353 |
+|  16  | 572x572x1 |  74.57 | 222.532 | 237.184 | 239.990 | 245.477 |
+ 
+To achieve these same results, follow the steps in the [Inference performance benchmark](#inference-performance-benchmark) section.
+ 
+Throughput is reported in images per second. Latency is reported in milliseconds per batch.
+ 
 ## Release notes
-
+ 
 ### Changelog
-
+ 
+February 2020
+* Updated README template
+* Added cross-validation for accuracy measurements
+* Changed optimizer to Adam and updated accuracy table
+* Updated performance values
+ 
 July 2019
+* Added inference benchmark for T4
 * Added inference example scripts
 * Added inference benchmark measuring latency
 * Added TRT/TF-TRT support
 * Updated Pre-trained model on NGC registry
-
+ 
 June 2019
 * Updated README template
-
-May 2019
+ 
+April 2019
 * Initial release
-
+ 
+ 
 ### Known issues
-
+ 
 There are no known issues in this release.
+ 
+ 
+ 
+

+ 0 - 19
TensorFlow/Segmentation/UNet_Medical/dllogger/__init__.py

@@ -1,19 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-
-from .logger import LOGGER, StdOutBackend, MLPerfBackend, JsonBackend, CompactBackend, Scope, AverageMeter, StandardMeter
-from . import tags
-
-__all__ = [LOGGER, StdOutBackend, MLPerfBackend, JsonBackend, CompactBackend, Scope, AverageMeter, StandardMeter, tags]

+ 0 - 60
TensorFlow/Segmentation/UNet_Medical/dllogger/autologging.py

@@ -1,60 +0,0 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# Common values reported
-
-
-import subprocess
-import xml.etree.ElementTree as ET
-
-#TODO: print CUDA version, container version etc
-
-def log_hardware(logger):
-    # TODO: asserts - what if you cannot launch those commands?
-    # number of CPU threads
-    cpu_info_command = 'cat /proc/cpuinfo'
-    cpu_info = subprocess.run(cpu_info_command.split(), stdout=subprocess.PIPE).stdout.split()
-    cpu_num_index = len(cpu_info) - cpu_info[::-1].index(b'processor') + 1
-    cpu_num = int(cpu_info[cpu_num_index]) + 1
-
-    # CPU name
-    cpu_name_begin_index = cpu_info.index(b'name')
-    cpu_name_end_index = cpu_info.index(b'stepping')
-    cpu_name = b' '.join(cpu_info[cpu_name_begin_index + 2:cpu_name_end_index]).decode('utf-8')
-
-    logger.log(key='cpu_info', value={"num": cpu_num, "name": cpu_name})
-
-    # RAM memory
-    ram_info_command = 'free -m -h'
-    ram_info = subprocess.run(ram_info_command.split(), stdout=subprocess.PIPE).stdout.split()
-    ram_index = ram_info.index(b'Mem:') + 1
-    ram = ram_info[ram_index].decode('utf-8')
-
-    logger.log(key='mem_info', value={"ram": ram})
-
-    # GPU
-    nvidia_smi_command = 'nvidia-smi -q -x'
-    nvidia_smi_output = subprocess.run(nvidia_smi_command.split(), stdout=subprocess.PIPE).stdout
-    nvidia_smi = ET.fromstring(nvidia_smi_output)
-    gpus = nvidia_smi.findall('gpu')
-    ver = nvidia_smi.findall('driver_version')
-
-    logger.log(key="gpu_info",
-                 value={
-                      "driver_version": ver[0].text,
-                      "num": len(gpus),
-                      "name": [g.find('product_name').text for g in gpus],
-                      "mem": [g.find('fb_memory_usage').find('total').text for g in gpus]})
-
-def log_args(logger, args):
-    logger.log(key='args', value=vars(args))

+ 128 - 484
TensorFlow/Segmentation/UNet_Medical/dllogger/logger.py

@@ -12,508 +12,152 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-
-import time
+from abc import ABC, abstractmethod
+from collections import defaultdict
+from datetime import datetime
 import json
-import logging
-import inspect
-import sys
-from contextlib import contextmanager
-import functools
-from collections import OrderedDict
-import datetime
-
-from . import autologging
-
-NVLOGGER_NAME = 'nv_dl_logger'
-NVLOGGER_VERSION = '0.3.1'
-NVLOGGER_TOKEN = ':::NVLOG'
-
-MLPERF_NAME = 'mlperf_logger'
-MLPERF_VERSION = '0.5.0'
-MLPERF_TOKEN = ':::MLP'
-
-COMPACT_NAME = 'compact_logger'
-
-DEFAULT_JSON_FILENAME = 'nvlog.json'
-
-class Scope:
-    RUN = 0
-    EPOCH = 1
-    TRAIN_ITER = 2
-
-
-class Level:
-    CRITICAL = 5
-    ERROR = 4
-    WARNING = 3
-    INFO = 2
-    DEBUG = 1
-
-
-_data = OrderedDict([
-    ('model', None),
-    ('epoch', -1),
-    ('iteration', -1),
-    ('total_iteration', -1),
-    ('metrics', OrderedDict()),
-    ('timed_blocks', OrderedDict()),
-    ('current_scope', Scope.RUN)
-    ])
-
-def get_caller(root_dir=None):
-    stack_files = [s.filename.split('/')[-1] for s in inspect.stack()]
-    stack_index = 0
-    while stack_index < len(stack_files) and stack_files[stack_index] != 'logger.py':
-        stack_index += 1
-    while (stack_index < len(stack_files) and 
-            stack_files[stack_index] in ['logger.py', 'autologging.py', 'contextlib.py']):
-        stack_index += 1
-
-    caller = inspect.stack()[stack_index]
-
-    return "%s:%d" % (stack_files[stack_index], caller.lineno)
-
-class StandardMeter(object):
-
-    def __init__(self):
-        self.reset()
-
-    def reset(self):
-        self.value = None
-
-    def record(self, value):
-        self.value = value
-
-    def get_value(self):
-        return self.value
-
-    def get_last(self):
-        return self.value
-
-class AverageMeter(object):
-
-    def __init__(self):
-        self.reset()
-
-    def reset(self):
-        self.count = 0
-        self.value = 0
-        self.last = 0
-
-    def record(self, value, n = 1):
-        self.last = value
-        self.count += n
-        self.value += value * n
-
-    def get_value(self):
-        return self.value / self.count
-
-    def get_last(self):
-        return self.last
+import atexit
 
-class JsonBackend(object):
 
-    def __init__(self, log_file=DEFAULT_JSON_FILENAME, logging_scope=Scope.TRAIN_ITER,
-            iteration_interval=1):
-        self.log_file = log_file
-        self.logging_scope = logging_scope
-        self.iteration_interval = iteration_interval
+class Backend(ABC):
+    def __init__(self, verbosity):
+        self._verbosity = verbosity
 
-        self.json_log = OrderedDict([
-            ('run', OrderedDict()),
-            ('epoch', OrderedDict()),
-            ('iter', OrderedDict()),
-            ('event', OrderedDict()),
-            ])
-        
-        self.json_log['epoch']['x'] = []
-        if self.logging_scope == Scope.TRAIN_ITER:
-            self.json_log['iter']['x'] = [[]]
+    @property
+    def verbosity(self):
+        return self._verbosity
 
-    def register_metric(self, key, metric_scope):
-        if (metric_scope == Scope.TRAIN_ITER and
-                self.logging_scope == Scope.TRAIN_ITER):
-            if not key in self.json_log['iter'].keys():
-                self.json_log['iter'][key] = [[]]
-        if metric_scope == Scope.EPOCH:
-            if not key in self.json_log['epoch'].keys():
-                self.json_log['epoch'][key] = []
-
-    def log(self, key, value):
-        if _data['current_scope'] == Scope.RUN:
-            self.json_log['run'][key] = value
-        elif _data['current_scope'] == Scope.EPOCH: 
-            pass
-        elif _data['current_scope'] == Scope.TRAIN_ITER:
-            pass
-        else:
-            raise ValueError('log function for scope "', _data['current_scope'], 
-                    '" not implemented')
-
-    def log_event(self, key, value):
-        if not key in self.json_log['event'].keys():
-            self.json_log['event'][key] = []
-        entry = OrderedDict()
-        entry['epoch'] = _data['epoch']
-        entry['iter'] = _data['iteration']
-        entry['timestamp'] = time.time()
-        if value:
-            entry['value'] = value
-        self.json_log['event'][key].append(str(entry))
-
-    def log_iteration_summary(self):
-        if (self.logging_scope == Scope.TRAIN_ITER and 
-                _data['total_iteration'] % self.iteration_interval == 0):
-            for key, m in _data['metrics'].items():
-                if m.metric_scope == Scope.TRAIN_ITER:
-                    self.json_log['iter'][key][-1].append(str(m.get_last()))
-
-            # log x for iteration number
-            self.json_log['iter']['x'][-1].append(_data['iteration'])
-
-
-    def dump_json(self):
-        if self.log_file is None:
-            print(json.dumps(self.json_log, indent=4))
-        else:
-            with open(self.log_file, 'w') as f:
-                json.dump(self.json_log, fp=f, indent=4)
-
-    def log_epoch_summary(self):
-        for key, m in _data['metrics'].items():
-            if m.metric_scope == Scope.EPOCH:
-                self.json_log['epoch'][key].append(str(m.get_value()))
-            elif (m.metric_scope == Scope.TRAIN_ITER and 
-                    self.logging_scope == Scope.TRAIN_ITER):
-                # create new sublists for each iter metric in the next epoch
-                self.json_log['iter'][key].append([])
-        
-        # log x for epoch number
-        self.json_log['epoch']['x'].append(_data['epoch'])
-
-        # create new sublist for iter's x in the next epoch
-        if self.logging_scope == Scope.TRAIN_ITER:
-            self.json_log['iter']['x'].append([])
-
-        self.dump_json()
-
-    def timed_block_start(self, name):
-        pass
-
-    def timed_block_stop(self, name):
-        pass
-
-    def finish(self):
-        self.dump_json()
-
-class _ParentStdOutBackend(object):
-
-    def __init__(self, name, token, version, log_file, logging_scope, iteration_interval):
-
-        self.root_dir = None
-        self.worker = [0]
-        self.prefix = ''
-
-        self.name = name
-        self.token = token
-        self.version = version
-        self.log_file = log_file
-        self.logging_scope = logging_scope
-        self.iteration_interval = iteration_interval
-
-        self.logger = logging.getLogger(self.name)
-        self.logger.setLevel(logging.DEBUG)
-        self.logger.handlers = []
-
-        if (self.log_file is None):
-            self.stream_handler = logging.StreamHandler(stream=sys.stdout)
-            self.stream_handler.setLevel(logging.DEBUG)
-            self.logger.addHandler(self.stream_handler)
-        else:
-            self.file_handler = logging.FileHandler(self.log_file, mode='w')
-            self.file_handler.setLevel(logging.DEBUG)
-            self.logger.addHandler(self.file_handler)
-
-    def register_metric(self, key, meter=None, metric_scope=Scope.EPOCH):
-        pass
-
-    def log_epoch_summary(self):
+    @abstractmethod
+    def log(self, timestamp, elapsedtime, step, data):
         pass
 
-    def log_iteration_summary(self):
+    @abstractmethod
+    def metadata(self, timestamp, elapsedtime, metric, metadata):
         pass
 
-    def log(self, key, value):
-        if _data['current_scope'] > self.logging_scope:
-            pass
-        elif (_data['current_scope'] == Scope.TRAIN_ITER and 
-                _data['total_iteration'] % self.iteration_interval != 0):
-            pass
-        else:
-            self.log_stdout(key, value)
-
-    def log_event(self, key, value):
-        self.log_stdout(key, value)
-        
-    def log_stdout(self, key, value=None, forced=False):
-        # TODO: worker 0 
-        # only the 0-worker will log
-        #if not forced and self.worker != 0:
-        #    pass
-
-        if value is None:
-            msg = key
-        else:
-            str_json = json.dumps(str(value))
-            msg = '{key}: {value}'.format(key=key, value=str_json)
-
-        call_site = get_caller(root_dir=self.root_dir)
-        now = time.time()
 
-        message = '{prefix}{token}v{ver} {model} {secs:.9f} ({call_site}) {msg}'.format(
-            prefix=self.prefix, token=self.token, ver=self.version, secs=now, 
-            model=_data['model'],
-            call_site=call_site, msg=msg)
+class Verbosity:
+    OFF = -1
+    DEFAULT = 0
+    VERBOSE = 1
 
-        self.logger.debug(message)
 
-    def timed_block_start(self, name):
-        self.log_stdout(key=name + "_start")
-
-    def timed_block_stop(self, name):
-        self.log_stdout(key=name + "_stop")
-
-    def finish(self):
-        pass
-
-class StdOutBackend(_ParentStdOutBackend):
-
-    def __init__(self, log_file=None, logging_scope=Scope.TRAIN_ITER, iteration_interval=1):
-        _ParentStdOutBackend.__init__(self, name=NVLOGGER_NAME, token=NVLOGGER_TOKEN, 
-                version=NVLOGGER_VERSION, log_file=log_file, logging_scope=logging_scope, 
-                iteration_interval=iteration_interval)
-        
-class MLPerfBackend(_ParentStdOutBackend):
-
-    def __init__(self, log_file=None, logging_scope=Scope.TRAIN_ITER, iteration_interval=1):
-        _ParentStdOutBackend.__init__(self, name=MLPERF_NAME, token=MLPERF_TOKEN, 
-                version=MLPERF_VERSION, log_file=log_file, logging_scope=logging_scope, 
-                iteration_interval=iteration_interval)
-
-class CompactBackend(object):
-
-    def __init__(self, log_file=None, logging_scope=Scope.TRAIN_ITER, iteration_interval=1):
-        self.log_file = log_file
-        self.logging_scope = logging_scope
-        self.iteration_interval = iteration_interval
-
-        self.logger = logging.getLogger(COMPACT_NAME)
-        self.logger.setLevel(logging.DEBUG)
-        self.logger.handlers = []
-
-        if (self.log_file is None):
-            self.stream_handler = logging.StreamHandler(stream=sys.stdout)
-            self.stream_handler.setLevel(logging.DEBUG)
-            self.logger.addHandler(self.stream_handler)
-        else:
-            self.file_handler = logging.FileHandler(self.log_file, mode='w')
-            self.file_handler.setLevel(logging.DEBUG)
-            self.logger.addHandler(self.file_handler)
-    
-    def register_metric(self, key, meter=None, metric_scope=Scope.EPOCH):
-        pass
-    
-    def timestamp_prefix(self):
-        return datetime.datetime.now().strftime('[%Y-%m-%d %H:%M:%S]')
-
-    def log(self, key, value):
-        if _data['current_scope'] == Scope.RUN:
-            self.log_event(key, value)
-    
-    def log_event(self, key, value):
-        msg = self.timestamp_prefix() + ' ' + str(key)
-        if value is not None:
-            msg += ": " + str(value)
-        self.logger.debug(msg)
-    
-    def log_epoch_summary(self):
-        if self.logging_scope >= Scope.EPOCH:
-            summary = self.timestamp_prefix() + ' Epoch {:<4} '.format(str(_data['epoch']) + ':')
-            for key, m in _data['metrics'].items():
-                if m.metric_scope >= Scope.EPOCH:
-                    summary += str(key) + ": " + str(m.get_value()) + ", "
-            self.logger.debug(summary)
-
-    def log_iteration_summary(self):
-        if self.logging_scope >= Scope.TRAIN_ITER and _data['total_iteration'] % self.iteration_interval == 0:
-            summary = self.timestamp_prefix() + ' Iter {:<5} '.format(str(_data['iteration']) + ':')
-            for key, m in _data['metrics'].items():
-                if m.metric_scope == Scope.TRAIN_ITER:
-                    summary += str(key) + ": " + str(m.get_last()) + ", "
-            self.logger.debug(summary)
- 
-    def timed_block_start(self, name):
-        pass
-
-    def timed_block_stop(self, name):
-        pass
-
-    def finish(self):
-        pass
-
-class _Logger(object):
-    def __init__(self):
-
-        self.backends = [
-                CompactBackend(),
-                JsonBackend()
-                ]
-
-        self.level = Level.INFO
-   
-    def set_model_name(self, name):
-        _data['model'] = name
-
-
-    def set_backends(self, backends):
+class Logger:
+    def __init__(self, backends):
         self.backends = backends
-        
-    def register_metric(self, key, meter=None, metric_scope=Scope.EPOCH):
-        if meter is None:
-            meter = StandardMeter()
-        #TODO: move to argument of Meter?
-        meter.metric_scope = metric_scope
-        _data['metrics'][key] = meter
-        for b in self.backends:
-            b.register_metric(key, metric_scope)
-
-    def log(self, key, value=None, forced=False, level=Level.INFO):
-        if level < self.level:
-            return
+        atexit.register(self.flush)
+        self.starttime = datetime.now()
 
-        if _data['current_scope'] == Scope.TRAIN_ITER or _data['current_scope'] == Scope.EPOCH:
-            if key in _data['metrics'].keys():
-                if _data['metrics'][key].metric_scope == _data['current_scope']:
-                    _data['metrics'][key].record(value)
+    def metadata(self, metric, metadata):
+        timestamp = datetime.now()
+        elapsedtime = (timestamp - self.starttime).total_seconds()
         for b in self.backends:
-            b.log(key, value)
-
-    def debug(self, *args, **kwargs):
-        self.log(*args, level=Level.DEBUG, **kwargs)
+            b.metadata(timestamp, elapsedtime, metric, metadata)
 
-    def info(self, *args, **kwargs):
-        self.log(*args, level=Level.INFO, **kwargs)
-
-    def warning(self, *args, **kwargs):
-        self.log(*args, level=Level.WARNING, **kwargs)
-
-    def error(self, *args, **kwargs):
-        self.log(*args, level=Level.ERROR, **kwargs)
-
-    def critical(self, *args, **kwargs):
-        self.log(*args, level=Level.CRITICAL, **kwargs)
-
-    def log_event(self, key, value=None):
+    def log(self, step, data, verbosity=1):
+        timestamp = datetime.now()
+        elapsedtime = (timestamp - self.starttime).total_seconds()
         for b in self.backends:
-            b.log_event(key, value)
-    
-    def timed_block_start(self, name):
-        if not name in _data['timed_blocks']:
-            _data['timed_blocks'][name] = OrderedDict()
-        _data['timed_blocks'][name]['start'] = time.time()
-        for b in self.backends:
-            b.timed_block_start(name)
-    
-    def timed_block_stop(self, name):
-        if not name in _data['timed_blocks']:
-            raise ValueError('timed_block_stop called before timed_block_start for ' + name)
-        _data['timed_blocks'][name]['stop'] = time.time()
-        delta = _data['timed_blocks'][name]['stop'] - _data['timed_blocks'][name]['start']
-        self.log(name + '_time', delta)
-        for b in self.backends:
-            b.timed_block_stop(name)
-
-    def iteration_start(self):
-        _data['current_scope'] = Scope.TRAIN_ITER
-        _data['iteration'] += 1
-        _data['total_iteration'] += 1
+            if b.verbosity >= verbosity:
+                b.log(timestamp, elapsedtime, step, data)
 
-
-    def iteration_stop(self):
-        for b in self.backends:
-            b.log_iteration_summary()
-        _data['current_scope'] = Scope.EPOCH
-
-    def epoch_start(self):
-        _data['current_scope'] = Scope.EPOCH 
-        _data['epoch'] += 1
-        _data['iteration'] = -1
-
-        for n, m in _data['metrics'].items():
-            if m.metric_scope == Scope.TRAIN_ITER:
-                m.reset()
-
-    def epoch_stop(self):
+    def flush(self):
         for b in self.backends:
-            b.log_epoch_summary()
-        _data['current_scope'] = Scope.RUN
-
-    def finish(self):
-        for b in self.backends:
-            b.finish()
-
-    def iteration_generator_wrapper(self, gen):
-        for g in gen:
-            self.iteration_start()
-            yield g
-            self.iteration_stop()
-
-    def epoch_generator_wrapper(self, gen):
-        for g in gen:
-            self.epoch_start()
-            yield g
-            self.epoch_stop()
-
-    @contextmanager
-    def timed_block(self, prefix, value=None, forced=False):
-        """ This function helps with timed blocks
-            ----
-            Parameters:
-            prefix - one of items from TIMED_BLOCKS; the action to be timed
-            logger - NVLogger object
-            forced - if True then the events are always logged (even if it should be skipped)
-        """
-        self.timed_block_start(prefix)
-        yield self
-        self.timed_block_stop(prefix)
-
-    def log_hardware(self):
-        autologging.log_hardware(self)
-
-    def log_args(self, args):
-        autologging.log_args(self, args)
-
-    def timed_function(self, prefix, variable=None, forced=False):
-        """ This decorator helps with timed functions
-            ----
-            Parameters:
-            prefix - one of items from TIME_BLOCK; the action to be timed
-            logger - NVLogger object
-            forced - if True then the events are always logged (even if it should be skipped)
-        """
-
-        def timed_function_decorator(func):
-            @functools.wraps(func)
-            def wrapper(*args, **kwargs):
-                value = kwargs.get(variable, next(iter(args), None))
-                with self.timed_block(prefix=prefix, value=value, forced=forced):
-                    func(*args, **kwargs)
-
-            return wrapper
-
-        return timed_function_decorator
-
+            b.flush()
+
+
+def default_step_format(step):
+    return str(step)
+
+
+def default_metric_format(metric, metadata, value):
+    unit = metadata["unit"] if "unit" in metadata.keys() else ""
+    format = "{" + metadata["format"] + "}" if "format" in metadata.keys() else "{}"
+    return "{}:{} {}".format(
+        metric, format.format(value) if value is not None else value, unit
+    )
+
+
+def default_prefix_format(timestamp):
+    return "DLL {} - ".format(timestamp)
+
+
+class StdOutBackend(Backend):
+    def __init__(
+        self,
+        verbosity,
+        step_format=default_step_format,
+        metric_format=default_metric_format,
+        prefix_format=default_prefix_format,
+    ):
+        super().__init__(verbosity=verbosity)
+
+        self._metadata = defaultdict(dict)
+        self.step_format = step_format
+        self.metric_format = metric_format
+        self.prefix_format = prefix_format
+        self.elapsed = 0.0
+
+    def metadata(self, timestamp, elapsedtime, metric, metadata):
+        self._metadata[metric].update(metadata)
+
+    def log(self, timestamp, elapsedtime, step, data):
+        print(
+            "{}{} {}{}".format(
+                self.prefix_format(timestamp),
+                self.step_format(step),
+                " ".join(
+                    [
+                        self.metric_format(m, self._metadata[m], v)
+                        for m, v in data.items()
+                    ]
+                ),
+                "elapsed:"+str(elapsedtime)
+            )
+        )
+
+    def flush(self):
+        pass
 
-LOGGER = _Logger()
 
+class JSONStreamBackend(Backend):
+    def __init__(self, verbosity, filename):
+        super().__init__(verbosity=verbosity)
+        self._filename = filename
+        self.file = open(filename, "w")
+        atexit.register(self.file.close)
+
+    def metadata(self, timestamp, elapsedtime, metric, metadata):
+        self.file.write(
+            "DLLL {}\n".format(
+                json.dumps(
+                    dict(
+                        timestamp=str(timestamp.timestamp()),
+                        elapsedtime=str(elapsedtime),
+                        datetime=str(timestamp),
+                        type="METADATA",
+                        metric=metric,
+                        metadata=metadata,
+                    )
+                )
+            )
+        )
+
+    def log(self, timestamp, elapsedtime, step, data):
+        self.file.write(
+            "DLLL {}\n".format(
+                json.dumps(
+                    dict(
+                        timestamp=str(timestamp.timestamp()),
+                        datetime=str(timestamp),
+                        elapsedtime=str(elapsedtime),
+                        type="LOG",
+                        step=step,
+                        data=data,
+                    )
+                )
+            )
+        )
+
+    def flush(self):
+        self.file.flush()

+ 0 - 255
TensorFlow/Segmentation/UNet_Medical/dllogger/tags.py

@@ -1,255 +0,0 @@
-# Copyright 2018 MLBenchmark Group. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# Common values reported
-
-VALUE_EPOCH = "epoch"
-VALUE_ITERATION = "iteration"
-VALUE_ACCURACY = "accuracy"
-VALUE_BLEU = "bleu"
-VALUE_TOP1 = "top1"
-VALUE_TOP5 = "top5"
-VALUE_BBOX_MAP = "bbox_map"
-VALUE_MASK_MAP = "mask_map"
-VALUE_BCE = "binary_cross_entropy"
-
-
-# Timed blocks (used with timed_function & timed_block
-# For each there should be *_start and *_stop tags defined
-
-RUN_BLOCK = "run"
-SETUP_BLOCK = "setup"
-PREPROC_BLOCK = "preproc"
-
-TRAIN_BLOCK = "train"
-TRAIN_PREPROC_BLOCK = "train_preproc"
-TRAIN_EPOCH_BLOCK = "train_epoch"
-TRAIN_EPOCH_PREPROC_BLOCK = "train_epoch_preproc"
-TRAIN_CHECKPOINT_BLOCK = "train_checkpoint"
-TRAIN_ITER_BLOCK = "train_iteration"
-
-EVAL_BLOCK = "eval"
-EVAL_ITER_BLOCK = "eval_iteration"
-
-#TODO: to remove?
-TIMED_BLOCKS = {
-    RUN_BLOCK,
-    SETUP_BLOCK,
-    PREPROC_BLOCK,
-    TRAIN_BLOCK,
-    TRAIN_PREPROC_BLOCK,
-    TRAIN_EPOCH_BLOCK,
-    TRAIN_EPOCH_PREPROC_BLOCK,
-    TRAIN_CHECKPOINT_BLOCK,
-    TRAIN_ITER_BLOCK,
-    EVAL_BLOCK,
-    EVAL_ITER_BLOCK,
-}
-
-
-# Events
-
-RUN_INIT = "run_init"
-
-SETUP_START = "setup_start"
-SETUP_STOP = "setup_stop"
-
-PREPROC_START = "preproc_start"
-PREPROC_STOP = "preproc_stop"
-
-RUN_START = "run_start"
-RUN_STOP = "run_stop"
-RUN_FINAL = "run_final"
-
-TRAIN_CHECKPOINT_START = "train_checkpoint_start"
-TRAIN_CHECKPOINT_STOP = "train_checkpoint_stop"
-
-TRAIN_PREPROC_START = "train_preproc_start"
-TRAIN_PREPROC_STOP = "train_preproc_stop"
-
-TRAIN_EPOCH_PREPROC_START = "train_epoch_preproc_start"
-TRAIN_EPOCH_PREPROC_STOP = "train_epoch_preproc_stop"
-
-TRAIN_ITER_START = "train_iter_start"
-TRAIN_ITER_STOP = "train_iter_stop"
-
-TRAIN_EPOCH_START = "train_epoch_start"
-TRAIN_EPOCH_STOP = "train_epoch_stop"
-
-
-# MLPerf specific tags
-
-RUN_CLEAR_CACHES = "run_clear_caches"
-
-PREPROC_NUM_TRAIN_EXAMPLES = "preproc_num_train_examples"
-PREPROC_NUM_EVAL_EXAMPLES = "preproc_num_eval_examples"
-PREPROC_TOKENIZE_TRAINING = "preproc_tokenize_training"
-PREPROC_TOKENIZE_EVAL = "preproc_tokenize_eval"
-PREPROC_VOCAB_SIZE = "preproc_vocab_size"
-
-RUN_SET_RANDOM_SEED = "run_set_random_seed"
-
-INPUT_SIZE = "input_size"
-INPUT_BATCH_SIZE = "input_batch_size"
-INPUT_ORDER = "input_order"
-INPUT_SHARD = "input_shard"
-INPUT_BN_SPAN = "input_bn_span"
-
-INPUT_CENTRAL_CROP = "input_central_crop"
-INPUT_CROP_USES_BBOXES = "input_crop_uses_bboxes"
-INPUT_DISTORTED_CROP_MIN_OBJ_COV = "input_distorted_crop_min_object_covered"
-INPUT_DISTORTED_CROP_RATIO_RANGE = "input_distorted_crop_aspect_ratio_range"
-INPUT_DISTORTED_CROP_AREA_RANGE = "input_distorted_crop_area_range"
-INPUT_DISTORTED_CROP_MAX_ATTEMPTS = "input_distorted_crop_max_attempts"
-INPUT_MEAN_SUBTRACTION = "input_mean_subtraction"
-INPUT_RANDOM_FLIP = "input_random_flip"
-
-INPUT_RESIZE = "input_resize"
-INPUT_RESIZE_ASPECT_PRESERVING = "input_resize_aspect_preserving"
-
-
-# Opt
-
-OPT_NAME = "opt_name"
-
-OPT_LR = "opt_learning_rate"
-OPT_MOMENTUM = "opt_momentum"
-
-OPT_WEIGHT_DECAY = "opt_weight_decay"
-
-OPT_HP_ADAM_BETA1 = "opt_hp_Adam_beta1"
-OPT_HP_ADAM_BETA2 = "opt_hp_Adam_beta2"
-OPT_HP_ADAM_EPSILON = "opt_hp_Adam_epsilon"
-
-OPT_LR_WARMUP_STEPS = "opt_learning_rate_warmup_steps"
-
-
-#  Train
-
-TRAIN_LOOP = "train_loop"
-TRAIN_EPOCH = "train_epoch"
-TRAIN_CHECKPOINT = "train_checkpoint"
-TRAIN_LOSS = "train_loss"
-TRAIN_ITERATION_LOSS = "train_iteration_loss"
-
-
-# Eval
-
-EVAL_START = "eval_start"
-EVAL_SIZE = "eval_size"
-EVAL_TARGET = "eval_target"
-EVAL_ACCURACY = "eval_accuracy"
-EVAL_STOP = "eval_stop"
-
-
-# Perf
-
-PERF_IT_PER_SEC = "perf_it_per_sec"
-PERF_TIME_TO_TRAIN = "time_to_train"
-
-EVAL_ITERATION_ACCURACY = "eval_iteration_accuracy"
-
-
-# Model
-
-MODEL_HP_LOSS_FN = "model_hp_loss_fn"
-
-MODEL_HP_INITIAL_SHAPE = "model_hp_initial_shape"
-MODEL_HP_FINAL_SHAPE = "model_hp_final_shape"
-
-MODEL_L2_REGULARIZATION = "model_l2_regularization"
-MODEL_EXCLUDE_BN_FROM_L2 = "model_exclude_bn_from_l2"
-
-MODEL_HP_RELU = "model_hp_relu"
-MODEL_HP_CONV2D_FIXED_PADDING = "model_hp_conv2d_fixed_padding"
-MODEL_HP_BATCH_NORM = "model_hp_batch_norm"
-MODEL_HP_DENSE = "model_hp_dense"
-
-
-# GNMT specific
-
-MODEL_HP_LOSS_SMOOTHING = "model_hp_loss_smoothing"
-MODEL_HP_NUM_LAYERS = "model_hp_num_layers"
-MODEL_HP_HIDDEN_SIZE = "model_hp_hidden_size"
-MODEL_HP_DROPOUT = "model_hp_dropout"
-
-EVAL_HP_BEAM_SIZE = "eval_hp_beam_size"
-TRAIN_HP_MAX_SEQ_LEN = "train_hp_max_sequence_length"
-EVAL_HP_MAX_SEQ_LEN = "eval_hp_max_sequence_length"
-EVAL_HP_LEN_NORM_CONST = "eval_hp_length_normalization_constant"
-EVAL_HP_LEN_NORM_FACTOR = "eval_hp_length_normalization_factor"
-EVAL_HP_COV_PENALTY_FACTOR = "eval_hp_coverage_penalty_factor"
-
-
-# NCF specific
-
-PREPROC_HP_MIN_RATINGS = "preproc_hp_min_ratings"
-PREPROC_HP_NUM_EVAL = "preproc_hp_num_eval"
-PREPROC_HP_SAMPLE_EVAL_REPLACEMENT = "preproc_hp_sample_eval_replacement"
-
-INPUT_HP_NUM_NEG = "input_hp_num_neg"
-INPUT_HP_SAMPLE_TRAIN_REPLACEMENT = "input_hp_sample_train_replacement"
-INPUT_STEP_TRAIN_NEG_GEN = "input_step_train_neg_gen"
-INPUT_STEP_EVAL_NEG_GEN = "input_step_eval_neg_gen"
-
-EVAL_HP_NUM_USERS = "eval_hp_num_users"
-EVAL_HP_NUM_NEG = "eval_hp_num_neg"
-
-MODEL_HP_MF_DIM = "model_hp_mf_dim"
-MODEL_HP_MLP_LAYER_SIZES = "model_hp_mlp_layer_sizes"
-
-
-# RESNET specific
-
-EVAL_EPOCH_OFFSET = "eval_offset"
-
-MODEL_HP_INITIAL_MAX_POOL = "model_hp_initial_max_pool"
-MODEL_HP_BEGIN_BLOCK = "model_hp_begin_block"
-MODEL_HP_END_BLOCK = "model_hp_end_block"
-MODEL_HP_BLOCK_TYPE = "model_hp_block_type"
-MODEL_HP_PROJECTION_SHORTCUT = "model_hp_projection_shortcut"
-MODEL_HP_SHORTCUT_ADD = "model_hp_shorcut_add"
-MODEL_HP_RESNET_TOPOLOGY = "model_hp_resnet_topology"
-
-
-# Transformer specific
-
-INPUT_MAX_LENGTH = "input_max_length"
-
-MODEL_HP_INITIALIZER_GAIN = "model_hp_initializer_gain"
-MODEL_HP_VOCAB_SIZE = "model_hp_vocab_size"
-MODEL_HP_NUM_HIDDEN_LAYERS = "model_hp_hidden_layers"
-MODEL_HP_EMBEDDING_SHARED_WEIGHTS = "model_hp_embedding_shared_weights"
-MODEL_HP_ATTENTION_DENSE = "model_hp_attention_dense"
-MODEL_HP_ATTENTION_DROPOUT = "model_hp_attention_dropout"
-MODEL_HP_FFN_OUTPUT_DENSE = "model_hp_ffn_output_dense"
-MODEL_HP_FFN_FILTER_DENSE = "model_hp_ffn_filter_dense"
-MODEL_HP_RELU_DROPOUT = "model_hp_relu_dropout"
-MODEL_HP_LAYER_POSTPROCESS_DROPOUT = "model_hp_layer_postprocess_dropout"
-MODEL_HP_NORM = "model_hp_norm"
-MODEL_HP_SEQ_BEAM_SEARCH = "model_hp_sequence_beam_search"
-

+ 1 - 1
TensorFlow/Segmentation/UNet_Medical/download_dataset.py

@@ -19,7 +19,7 @@ PARSER = argparse.ArgumentParser(description="U-Net medical")
 
 PARSER.add_argument('--data_dir',
                     type=str,
-                    default=1,
+                    default='./data',
                     help="""Directory where to download the dataset""")
 
 def main():

+ 3 - 12
TensorFlow/Segmentation/UNet_Medical/examples/unet_FP32_1GPU.sh

@@ -12,16 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# This script launches U-Net training in FP32 on 1 GPUs using 2 batch size
-# Usage ./unet_TRAIN_BENCHMARK_FP32_1GPU.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
+# This script launches U-Net run in FP32 on 1 GPU and trains for 40000 iterations with batch_size 1. Usage:
+# bash unet_FP32_1GPU.sh <path to dataset> <path to results directory>
 
- python $1/main.py \
-     --data_dir $2 \
-     --model_dir $3 \
-     --warmup_steps 200 \
-     --log_every 100 \
-     --max_steps 320000 \
-     --batch_size 2 \
-     --benchmark \
-     --exec_mode train_and_predict \
-     --augment
+horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size 1 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --log_dir $2

+ 3 - 22
TensorFlow/Segmentation/UNet_Medical/examples/unet_FP32_8GPU.sh

@@ -12,26 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# This script launches U-Net training in FP32 on 1 GPUs using 2 batch size
-# Usage ./unet_TRAIN_BENCHMARK_FP32_1GPU.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
+# This script launches U-Net run in FP32 on 8 GPUs and trains for 40000 iterations with batch_size 1. Usage:
+# bash unet_FP32_8GPU.sh <path to dataset> <path to results directory>
 
-mpirun \
-    -np 8 \
-    -H localhost:8 \
-    -bind-to none \
-    -map-by slot \
-    -x NCCL_DEBUG=INFO \
-    -x LD_LIBRARY_PATH \
-    -x PATH \
-    -mca pml ob1 -mca btl ^openib \
-    --allow-run-as-root \
-     python $1/main.py \
-     --data_dir $2 \
-     --model_dir $3 \
-     --warmup_steps 200 \
-     --log_every 100 \
-     --max_steps 40000 \
-     --batch_size 2 \
-     --benchmark \
-     --exec_mode train_and_predict \
-     --augment
+horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size 1 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --log_dir $2

+ 3 - 3
TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_BENCHMARK_FP32.sh

@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# This script launches U-Net training in FP32 on 1 GPUs using 2 batch size
-# Usage ./unet_INFER_BENCHMARK_FP32.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
+# This script launches U-Net run in FP32 on 1 GPU for inference benchmarking. Usage:
+# bash unet_INFER_BENCHMARK_FP32.sh <path to dataset> <path to results directory> <batch size>
 
-python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --benchmark --exec_mode predict --augment --warmup_steps 200 --log_every 100 --max_steps 300
+horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode predict --benchmark --warmup_steps 200 --max_steps 600 --use_xla

+ 3 - 3
TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_BENCHMARK_TF-AMP.sh

@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# This script launches U-Net inference in TF-AMP on 1 GPUs using 2 batch size
-# Usage ./unet_INFER_BENCHMARK_TF-AMP.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
+# This script launches U-Net run in FP16 on 1 GPU for inference benchmarking. Usage:
+# bash unet_INFER_BENCHMARK_TF-AMP.sh <path to dataset> <path to results directory> <batch size>
 
-python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --benchmark --use_amp --exec_mode predict --augment --warmup_steps 200 --log_every 100 --max_steps 300
+horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode predict --benchmark --warmup_steps 200 --max_steps 600 --use_xla --use_amp

+ 1 - 1
TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_BENCHMARK_TF-TRT.sh

@@ -15,4 +15,4 @@
 # This script launches U-Net training in FP32 on 1 GPUs using 2 batch size
 # Usage ./unet_INFER_BENCHMARK_FP32.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
 
-python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --benchmark --exec_mode predict --augment --warmup_steps 200 --log_every 100 --max_steps 300
+python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --benchmark --exec_mode predict --augment --warmup_steps 200 --log_every 100 --max_steps 300 --use_xla

+ 3 - 3
TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_FP32.sh

@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# This script launches U-Net inference in FP32 on 1 GPUs
-# Usage ./unet_INFER_FP32.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
+# This script launches U-Net run in FP32 on 1 GPU for inference batch_size 1. Usage:
+# bash unet_INFER_FP32.sh <path to this repository> <path to dataset> <path to results directory>
 
-python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --exec_mode predict
+horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size 1 --exec_mode predict --use_xla

+ 3 - 3
TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_TF-AMP.sh

@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# This script launches U-Net inference in TF-AMP on 1 GPUs
-# Usage ./unet_INFER_FP32.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
+# This script launches U-Net run in FP16 on 1 GPU for inference batch_size 1. Usage:
+# bash unet_INFER_TF-AMP.sh <path to dataset> <path to results directory>
 
-python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --exec_mode predict --use_amp
+horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size 1 --exec_mode predict --use_xla --use_amp

+ 1 - 1
TensorFlow/Segmentation/UNet_Medical/examples/unet_INFER_TF-TRT.sh

@@ -15,4 +15,4 @@
 # This script launches U-Net inference in TF-AMP on 1 GPUs
 # Usage ./unet_INFER_FP32.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
 
-python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --exec_mode predict --use_trt
+python $1/main.py --data_dir $2 --model_dir $3 --batch_size $4 --exec_mode predict --use_trt --use_xla

+ 4 - 14
TensorFlow/Segmentation/UNet_Medical/examples/unet_TF-AMP_1GPU.sh

@@ -12,17 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# This script launches U-Net training in FP32 on 1 GPUs using 2 batch size
-# Usage ./unet_TF-AMP_1GPU.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
- 
-python $1/main.py \
-     --data_dir $2 \
-     --model_dir $3 \
-     --warmup_steps 200 \
-     --log_every 100 \
-     --max_steps 320000 \
-     --batch_size 2 \
-     --benchmark \
-     --use_amp \
-     --exec_mode train_and_predict \
-     --augment
+# This script launches U-Net run in FP16 on 1 GPU and trains for 40000 iterations batch_size 1. Usage:
+# bash unet_TF-AMP_1GPU.sh <path to dataset> <path to results directory>
+
+horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size 1 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --use_amp --log_dir $2

+ 3 - 23
TensorFlow/Segmentation/UNet_Medical/examples/unet_TF-AMP_8GPU.sh

@@ -12,27 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# This script launches U-Net training in FP32 on 1 GPUs using 2 batch size
-# Usage ./unet_TF-AMP_8GPU.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
+# This script launches U-Net run in FP16 on 8 GPUs and trains for 40000 iterations batch_size 1. Usage:
+# bash unet_TF-AMP_8GPU.sh <path to dataset> <path to results directory>
 
-mpirun \
-    -np 8 \
-    -H localhost:8 \
-    -bind-to none \
-    -map-by slot \
-    -x NCCL_DEBUG=INFO \
-    -x LD_LIBRARY_PATH \
-    -x PATH \
-    -mca pml ob1 -mca btl ^openib \
-    --allow-run-as-root \
-     python $1/main.py \
-     --data_dir $2 \
-     --model_dir $3 \
-     --warmup_steps 200 \
-     --log_every 100 \
-     --max_steps 40000 \
-     --batch_size 2 \
-     --benchmark \
-     --use_amp \
-     --exec_mode train_and_predict \
-     --augment
+horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size 1 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --use_amp --log_dir $2

+ 3 - 3
TensorFlow/Segmentation/UNet_Medical/examples/unet_TRAIN_BENCHMARK_FP32_1GPU.sh

@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# This script launches U-Net training in FP32 on 1 GPUs using 2 batch size
-# Usage ./unet_TRAIN_BENCHMARK_FP32_1GPU.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
+# This script launches U-Net run in FP32 on 1 GPU for training benchmarking. Usage:
+# bash unet_TRAIN_BENCHMARK_FP32_1GPU.sh <path to dataset> <path to results directory> <batch size>
 
-python $1/main.py --data_dir $2 --model_dir $3 --warmup_steps 200 --log_every 100 --max_steps 300 --batch_size $4 --benchmark --exec_mode train --augment
+horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --use_xla

+ 3 - 13
TensorFlow/Segmentation/UNet_Medical/examples/unet_TRAIN_BENCHMARK_FP32_8GPU.sh

@@ -12,17 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# This script launches U-Net training in FP32 on 8 GPUs
-# Usage ./unet_TRAIN_BENCHMARK_FP32_8GPU.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
+# This script launches U-Net run in FP32 on 8 GPUs for training benchmarking. Usage:
+# bash unet_TRAIN_BENCHMARK_FP32_8GPU.sh <path to dataset> <path to results directory> <batch size>
 
-mpirun \
-    -np 8 \
-    -H localhost:8 \
-    -bind-to none \
-    -map-by slot \
-    -x NCCL_DEBUG=INFO \
-    -x LD_LIBRARY_PATH \
-    -x PATH \
-    -mca pml ob1 -mca btl ^openib \
-    --allow-run-as-root \
-    python $1/main.py --data_dir $2 --model_dir $3 --warmup_steps 200 --log_every 100 --max_steps 300 --batch_size $4 --benchmark --exec_mode train --augment
+horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --use_xla

+ 3 - 3
TensorFlow/Segmentation/UNet_Medical/examples/unet_TRAIN_BENCHMARK_TF-AMP_1GPU.sh

@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# This script launches U-Net training in TF-AMP on 1 GPUs using 2 batch size
-# Usage ./unet_TRAIN_BENCHMARK_TF-AMP_1GPU.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
+# This script launches U-Net run in FP16 on 1 GPU for training benchmarking. Usage:
+# bash unet_TRAIN_BENCHMARK_TF-AMP_1GPU.sh <path to dataset> <path to results directory> <batch size>
 
-python $1/main.py --data_dir $2 --model_dir $3 --warmup_steps 200 --log_every 100 --max_steps 300 --batch_size $4 --benchmark --use_amp --exec_mode train --augment
+horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --use_xla --use_amp

+ 3 - 13
TensorFlow/Segmentation/UNet_Medical/examples/unet_TRAIN_BENCHMARK_TF-AMP_8GPU.sh

@@ -12,17 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-# This script launches U-Net training in TF-AMP on 1 GPUs using 2 batch size
-# Usage ./unet_TRAIN_BENCHMARK_TF-AMP_8GPU.sh <path to this repository> <path to dataset> <path to results directory> <batch size>
+# This script launches U-Net run in FP16 on 8 GPUs for training benchmarking. Usage:
+# bash unet_TRAIN_BENCHMARK_TF-AMP_8GPU.sh <path to dataset> <path to results directory> <batch size>
 
-mpirun \
-    -np 8 \
-    -H localhost:8 \
-    -bind-to none \
-    -map-by slot \
-    -x NCCL_DEBUG=INFO \
-    -x LD_LIBRARY_PATH \
-    -x PATH \
-    -mca pml ob1 -mca btl ^openib \
-    --allow-run-as-root \
-    python $1/main.py --data_dir $2 --model_dir $3 --warmup_steps 200 --log_every 100 --max_steps 300 --batch_size $4 --benchmark --use_amp --exec_mode train --augment
+horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode train --augment --benchmark --warmup_steps 200 --max_steps 1000 --use_xla --use_amp

+ 24 - 0
TensorFlow/Segmentation/UNet_Medical/examples/unet_TRAIN_FP32_1GPU.sh

@@ -0,0 +1,24 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This script launches U-Net run in FP32 on 8 GPU and runs 5-fold cross-validation training for 40000 iterations.
+# Usage:
+# bash unet_TRAIN_FP32_1GPU.sh <path to dataset> <path to results directory> <batch size>
+
+horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla > $2/log_FP32_1GPU_fold0.txt
+horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --use_xla > $2/log_FP32_1GPU_fold1.txt
+horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --use_xla > $2/log_FP32_1GPU_fold2.txt
+horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --use_xla > $2/log_FP32_1GPU_fold3.txt
+horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --use_xla > $2/log_FP32_1GPU_fold4.txt
+python utils/parse_results.py --model_dir $2 --exec_mode convergence --env FP32_1GPU

+ 24 - 0
TensorFlow/Segmentation/UNet_Medical/examples/unet_TRAIN_FP32_8GPU.sh

@@ -0,0 +1,24 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This script launches U-Net run in FP32 on 8 GPUs and runs 5-fold cross-validation training for 40000 iterations.
+# Usage:
+# bash unet_TRAIN_FP32_8GPU.sh <path to dataset> <path to results directory> <batch size>
+
+horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla > $2/log_FP32_8GPU_fold0.txt
+horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --use_xla > $2/log_FP32_8GPU_fold1.txt
+horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --use_xla > $2/log_FP32_8GPU_fold2.txt
+horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --use_xla > $2/log_FP32_8GPU_fold3.txt
+horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --use_xla > $2/log_FP32_8GPU_fold4.txt
+python utils/parse_results.py --model_dir $2 --exec_mode convergence --env FP32_8GPU

+ 24 - 0
TensorFlow/Segmentation/UNet_Medical/examples/unet_TRAIN_TF-AMP_1GPU.sh

@@ -0,0 +1,24 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This script launches U-Net run in TF-AMP on 1 GPU and runs 5-fold cross-validation training for 40000 iterations.
+# Usage:
+# bash unet_TRAIN_TF-AMP_1GPU.sh <path to dataset> <path to results directory> <batch size>
+
+horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --use_amp > $2/log_TF-AMP_1GPU_fold0.txt
+horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --use_xla --use_amp > $2/log_TF-AMP_1GPU_fold1.txt
+horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --use_xla --use_amp > $2/log_TF-AMP_1GPU_fold2.txt
+horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --use_xla --use_amp > $2/log_TF-AMP_1GPU_fold3.txt
+horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --use_xla --use_amp > $2/log_TF-AMP_1GPU_fold4.txt
+python utils/parse_results.py --model_dir $2 --exec_mode convergence --env TF-AMP_1GPU

+ 24 - 0
TensorFlow/Segmentation/UNet_Medical/examples/unet_TRAIN_TF-AMP_8GPU.sh

@@ -0,0 +1,24 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# This script launches U-Net run in TF-AMP on 8 GPUs and runs 5-fold cross-validation training for 40000 iterations.
+# Usage:
+# bash unet_TRAIN_TF-AMP_8GPU.sh <path to dataset> <path to results directory> <batch size>
+
+horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --use_xla --use_amp > $2/log_TF-AMP_8GPU_fold0.txt
+horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 1 --augment --use_xla --use_amp > $2/log_TF-AMP_8GPU_fold1.txt
+horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 2 --augment --use_xla --use_amp > $2/log_TF-AMP_8GPU_fold2.txt
+horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 3 --augment --use_xla --use_amp > $2/log_TF-AMP_8GPU_fold3.txt
+horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 40000 --batch_size $3 --exec_mode train_and_evaluate --crossvalidation_idx 4 --augment --use_xla --use_amp > $2/log_TF-AMP_8GPU_fold4.txt
+python utils/parse_results.py --model_dir $2 --exec_mode convergence --env TF-AMP_8GPU

+ 86 - 0
TensorFlow/Segmentation/UNet_Medical/export.py

@@ -0,0 +1,86 @@
+import argparse
+
+import tensorflow as tf
+
+from tf_exports.tf_export import to_savedmodel, to_tf_trt, to_onnx
+from utils.data_loader import Dataset
+from utils.model_fn import unet_fn
+
+PARSER = argparse.ArgumentParser(description="U-Net medical")
+
+PARSER.add_argument('--to', dest='to', choices=['savedmodel', 'tftrt', 'onnx'], required=True)
+
+PARSER.add_argument('--use_amp', dest='use_amp', action='store_true', default=False)
+PARSER.add_argument('--use_xla', dest='use_xla', action='store_true', default=False)
+PARSER.add_argument('--compress', dest='compress', action='store_true', default=False)
+
+PARSER.add_argument('--input_shape',
+                    nargs='+',
+                    type=int,
+                    help="""Directory where to download the dataset""")
+
+PARSER.add_argument('--data_dir',
+                    type=str,
+                    help="""Directory where to download the dataset""")
+
+PARSER.add_argument('--checkpoint_dir',
+                    type=str,
+                    help="""Directory where to download the dataset""")
+
+PARSER.add_argument('--savedmodel_dir',
+                    type=str,
+                    help="""Directory where to download the dataset""")
+
+PARSER.add_argument('--precision',
+                    type=str,
+                    choices=['FP32', 'FP16', 'INT8'],
+                    help="""Directory where to download the dataset""")
+
+
+def main():
+    """
+    Starting point of the application
+    """
+    flags = PARSER.parse_args()
+
+    if flags.to == 'savedmodel':
+        to_savedmodel(input_shape=flags.input_shape,
+                      model_fn=unet_fn,
+                      checkpoint_dir=flags.checkpoint_dir,
+                      output_dir='./saved_model',
+                      input_names=['IteratorGetNext'],
+                      output_names=['total_loss_ref'],
+                      use_amp=flags.use_amp,
+                      use_xla=flags.use_xla,
+                      compress=flags.compress)
+    if flags.to == 'tftrt':
+        ds = Dataset(data_dir=flags.data_dir,
+                     batch_size=1,
+                     augment=False,
+                     gpu_id=0,
+                     num_gpus=1,
+                     seed=42)
+        iterator = ds.test_fn(count=1).make_one_shot_iterator()
+        features = iterator.get_next()
+
+        sess = tf.Session()
+
+        def input_data():
+            return {'input_tensor:0': sess.run(features)}
+
+        to_tf_trt(savedmodel_dir=flags.savedmodel_dir,
+                  output_dir='./tf_trt_model',
+                  precision=flags.precision,
+                  feed_dict_fn=input_data,
+                  num_runs=1,
+                  output_tensor_names=['Softmax:0'],
+                  compress=flags.compress)
+    if flags.to == 'onnx':
+        to_onnx(input_dir=flags.savedmodel_dir,
+                output_dir='./onnx_model',
+                compress=flags.compress)
+
+
+if __name__ == '__main__':
+    main()
+

+ 66 - 56
TensorFlow/Segmentation/UNet_Medical/main.py

@@ -24,8 +24,6 @@ Example:
 """
 
 import os
-import pickle
-import time
 
 import horovod.tensorflow as hvd
 import math
@@ -33,13 +31,12 @@ import numpy as np
 import tensorflow as tf
 from PIL import Image
 
-from dllogger import tags
-from dllogger.logger import LOGGER
 from utils.cmd_util import PARSER, _cmd_params
 from utils.data_loader import Dataset
 from utils.hooks.profiling_hook import ProfilingHook
 from utils.hooks.training_hook import TrainingHook
 from utils.model_fn import unet_fn
+from dllogger.logger import Logger, StdOutBackend, JSONStreamBackend, Verbosity
 
 
 def main(_):
@@ -48,10 +45,15 @@ def main(_):
     """
 
     flags = PARSER.parse_args()
-
     params = _cmd_params(flags)
+    np.random.seed(params.seed)
+    tf.compat.v1.random.set_random_seed(params.seed)
+    tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
 
-    tf.logging.set_verbosity(tf.logging.ERROR)
+    backends = [StdOutBackend(Verbosity.VERBOSE)]
+    if params.log_dir is not None:
+        backends.append(JSONStreamBackend(Verbosity.VERBOSE, params.log_dir))
+    logger = Logger(backends)
 
     # Optimization flags
     os.environ['CUDA_CACHE_DISABLE'] = '0'
@@ -71,95 +73,103 @@ def main(_):
     os.environ['TF_SYNC_ON_FINISH'] = '0'
     os.environ['TF_AUTOTUNE_THRESHOLD'] = '2'
 
-    if params['use_amp']:
-        os.environ['TF_ENABLE_AUTO_MIXED_PRECISION']='1'
-
+    if params.use_amp:
+        os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '1'
+    else:
+        os.environ['TF_ENABLE_AUTO_MIXED_PRECISION'] = '0'
     hvd.init()
 
     # Build run config
-    gpu_options = tf.GPUOptions()
-    config = tf.ConfigProto(gpu_options=gpu_options, allow_soft_placement=True)
+    gpu_options = tf.compat.v1.GPUOptions()
+    config = tf.compat.v1.ConfigProto(gpu_options=gpu_options, allow_soft_placement=True)
+
+    if params.use_xla:
+        config.graph_options.optimizer_options.global_jit_level = tf.compat.v1.OptimizerOptions.ON_1
+
     config.gpu_options.allow_growth = True
     config.gpu_options.visible_device_list = str(hvd.local_rank())
-    config.gpu_options.force_gpu_compatible = True
-    config.intra_op_parallelism_threads = 1
-    config.inter_op_parallelism_threads = max(2, 40 // hvd.size() - 2)
 
     run_config = tf.estimator.RunConfig(
         save_summary_steps=1,
         tf_random_seed=None,
         session_config=config,
-        save_checkpoints_steps=params['max_steps'],
+        save_checkpoints_steps=params.max_steps // hvd.size(),
         keep_checkpoint_max=1)
 
     # Build the estimator model
     estimator = tf.estimator.Estimator(
         model_fn=unet_fn,
-        model_dir=params['model_dir'],
+        model_dir=params.model_dir,
         config=run_config,
         params=params)
 
-    dataset = Dataset(data_dir=params['data_dir'],
-                      batch_size=params['batch_size'],
-                      augment=params['augment'],
+    dataset = Dataset(data_dir=params.data_dir,
+                      batch_size=params.batch_size,
+                      fold=params.crossvalidation_idx,
+                      augment=params.augment,
                       gpu_id=hvd.rank(),
                       num_gpus=hvd.size(),
-                      seed=params['seed'])
+                      seed=params.seed)
 
-    if 'train' in params['exec_mode']:
+    if 'train' in params.exec_mode:
+        max_steps = params.max_steps // (1 if params.benchmark else hvd.size())
         hooks = [hvd.BroadcastGlobalVariablesHook(0),
-                 TrainingHook(params['log_every'])]
-
-        if params['benchmark']:
-            hooks.append(ProfilingHook(params['batch_size'],
-                                       params['log_every'],
-                                       params['warmup_steps']))
+                 TrainingHook(logger,
+                              max_steps=max_steps,
+                              log_every=params.log_every)]
 
-        LOGGER.log('Begin Training...')
+        if params.benchmark and hvd.rank() == 0:
+            hooks.append(ProfilingHook(logger,
+                                       batch_size=params.batch_size,
+                                       log_every=params.log_every,
+                                       warmup_steps=params.warmup_steps,
+                                       mode='train'))
 
-        LOGGER.log(tags.RUN_START)
         estimator.train(
             input_fn=dataset.train_fn,
-            steps=params['max_steps'],
+            steps=max_steps,
             hooks=hooks)
-        LOGGER.log(tags.RUN_STOP)
 
-    if 'predict' in params['exec_mode']:
+    if 'evaluate' in params.exec_mode:
+        if hvd.rank() == 0:
+            results = estimator.evaluate(input_fn=dataset.eval_fn, steps=dataset.eval_size)
+            logger.log(step=(),
+                       data={"eval_ce_loss": float(results["eval_ce_loss"]),
+                             "eval_dice_loss": float(results["eval_dice_loss"]),
+                             "eval_total_loss": float(results["eval_total_loss"]),
+                             "eval_dice_score": float(results["eval_dice_score"])})
+
+    if 'predict' in params.exec_mode:
         if hvd.rank() == 0:
             predict_steps = dataset.test_size
             hooks = None
-            if params['benchmark']:
-                hooks = [ProfilingHook(params['batch_size'],
-                                       params['log_every'],
-                                       params['warmup_steps'])]
-                predict_steps = params['warmup_steps'] * 2 * params['batch_size']
-
-            LOGGER.log('Begin Predict...')
-            LOGGER.log(tags.RUN_START)
+            if params.benchmark:
+                hooks = [ProfilingHook(logger,
+                                       batch_size=params.batch_size,
+                                       log_every=params.log_every,
+                                       warmup_steps=params.warmup_steps,
+                                       mode="test")]
+                predict_steps = params.warmup_steps * 2 * params.batch_size
 
             predictions = estimator.predict(
-                input_fn=lambda: dataset.test_fn(count=math.ceil(predict_steps/dataset.test_size)),
+                input_fn=lambda: dataset.test_fn(count=math.ceil(predict_steps / dataset.test_size)),
                 hooks=hooks)
-
             binary_masks = [np.argmax(p['logits'], axis=-1).astype(np.uint8) * 255 for p in predictions]
-            LOGGER.log(tags.RUN_STOP)
-
-            multipage_tif = [Image.fromarray(mask).resize(size=(512, 512), resample=Image.BILINEAR)
-                             for mask in binary_masks]
 
-            output_dir = os.path.join(params['model_dir'], 'pred')
+            if not params.benchmark:
+                multipage_tif = [Image.fromarray(mask).resize(size=(512, 512), resample=Image.BILINEAR)
+                                 for mask in binary_masks]
 
-            if not os.path.exists(output_dir):
-                os.makedirs(output_dir)
+                output_dir = os.path.join(params.model_dir, 'pred')
 
-            multipage_tif[0].save(os.path.join(output_dir, 'test-masks.tif'),
-                                  compression="tiff_deflate",
-                                  save_all=True,
-                                  append_images=multipage_tif[1:])
+                if not os.path.exists(output_dir):
+                    os.makedirs(output_dir)
 
-            LOGGER.log("Predict finished")
-            LOGGER.log("Results available in: {}".format(output_dir))
+                multipage_tif[0].save(os.path.join(output_dir, 'test-masks.tif'),
+                                      compression="tiff_deflate",
+                                      save_all=True,
+                                      append_images=multipage_tif[1:])
 
 
 if __name__ == '__main__':
-    tf.app.run()
+    tf.compat.v1.app.run()

+ 2 - 2
TensorFlow/Segmentation/UNet_Medical/model/layers.py

@@ -91,7 +91,7 @@ def upsample_block(inputs, residual_input, filters, idx):
                                kernel_size=(3, 3),
                                activation=tf.nn.relu)
         return tf.layers.conv2d_transpose(inputs=out,
-                                          filters=int(filters),
+                                          filters=int(filters // 2),
                                           kernel_size=(3, 3),
                                           strides=(2, 2),
                                           padding='same',
@@ -129,7 +129,7 @@ def bottleneck(inputs, filters, mode):
         out = tf.layers.dropout(out, rate=0.5, training=training)
 
         return tf.layers.conv2d_transpose(inputs=out,
-                                          filters=filters,
+                                          filters=filters // 2,
                                           kernel_size=(3, 3),
                                           strides=(2, 2),
                                           padding='same',

+ 3 - 3
TensorFlow/Segmentation/UNet_Medical/model/unet.py

@@ -18,10 +18,11 @@ This module provides a convenient way to create different topologies
 based around UNet.
 
 """
+import tensorflow as tf
 from model.layers import output_block, upsample_block, bottleneck, downsample_block, input_block
 
 
-def unet_v1(inputs, mode):
+def unet_v1(features,  mode):
     """ U-Net: Convolutional Networks for Biomedical Image Segmentation
 
     Source:
@@ -31,7 +32,7 @@ def unet_v1(inputs, mode):
 
     skip_connections = []
 
-    out, skip = input_block(inputs, filters=64)
+    out, skip = input_block(features, filters=64)
 
     skip_connections.append(skip)
 
@@ -46,5 +47,4 @@ def unet_v1(inputs, mode):
                              residual_input=skip_connections.pop(),
                              filters=filters,
                              idx=idx)
-
     return output_block(out, residual_input=skip_connections.pop(), filters=64, n_classes=2)

+ 3 - 2
TensorFlow/Segmentation/UNet_Medical/requirements.txt

@@ -1,2 +1,3 @@
-numpy==1.14.5
-Pillow==6.2.0
+Pillow==6.2.0
+tf2onnx
+munch

+ 270 - 0
TensorFlow/Segmentation/UNet_Medical/tf_exports/tf_export.py

@@ -0,0 +1,270 @@
+import glob
+import inspect
+import os
+import shutil
+import subprocess
+from typing import List, Callable
+
+import tensorflow as tf
+from google.protobuf import text_format
+from tensorflow.core.framework import graph_pb2
+from tensorflow.python.compiler.tensorrt import trt_convert as trt
+from tensorflow.python.framework import dtypes
+from tensorflow.python.framework import graph_io
+from tensorflow.python.platform import gfile
+from tensorflow.python.tools import optimize_for_inference_lib
+
+os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
+
+
+def _compress(src_path: str, dst_path: str):
+    """
+    Compress source path into destination path
+
+    :param src_path: (str) Source path
+    :param dst_path: (str) Destination path
+    """
+    print('[*] Compressing...')
+    shutil.make_archive(dst_path, 'zip', src_path)
+    print('[*] Compressed the contents in: {}.zip'.format(dst_path))
+
+
+def _print_input(func: Callable):
+    """
+    Decorator printing function name and args
+    :param func: (Callable) Decorated function
+    :return: Wrapped call
+    """
+
+    def wrapper(*args, **kwargs):
+        """
+        Print the name and arguments of a function
+
+        :param args: Named arguments
+        :param kwargs: Keyword arguments
+        :return: Original function call
+        """
+        tf.logging.set_verbosity(tf.logging.ERROR)
+        func_args = inspect.signature(func).bind(*args, **kwargs).arguments
+        func_args_str = ''.join('\t{} = {!r}\n'.format(*item) for item in func_args.items())
+
+        print('[*] Running \'{}\' with arguments:'.format(func.__qualname__))
+        print(func_args_str[:-1])
+
+        return func(*args, **kwargs)
+
+    return wrapper
+
+
+def _parse_placeholder_types(values: str):
+    """
+    Extracts placeholder types from a comma separate list.
+
+    :param values: (str) Placeholder types
+    :return: (List) Placeholder types
+    """
+    values = [int(value) for value in values.split(",")]
+    return values if len(values) > 1 else values[0]
+
+
+def _optimize_checkpoint_for_inference(graph_path: str,
+                                       input_names: List[str],
+                                       output_names: List[str]):
+    """
+    Removes Horovod and training related information from the graph
+
+    :param graph_path: (str) Path to the graph.pbtxt file
+    :param input_names: (str) Input node names
+    :param output_names: (str) Output node names
+    """
+
+    print('[*] Optimizing graph for inference ...')
+
+    input_graph_def = graph_pb2.GraphDef()
+    with gfile.Open(graph_path, "rb") as f:
+        data = f.read()
+        text_format.Merge(data.decode("utf-8"), input_graph_def)
+
+    output_graph_def = optimize_for_inference_lib.optimize_for_inference(
+        input_graph_def,
+        input_names,
+        output_names,
+        _parse_placeholder_types(str(dtypes.float32.as_datatype_enum)),
+        False)
+
+    print('[*] Saving original graph in: {}'.format(graph_path + '.old'))
+    shutil.move(graph_path, graph_path + '.old')
+
+    print('[*] Writing down optimized graph ...')
+    graph_io.write_graph(output_graph_def,
+                         os.path.dirname(graph_path),
+                         os.path.basename(graph_path))
+
+
+@_print_input
+def to_savedmodel(input_shape: str,
+                  model_fn: Callable,
+                  checkpoint_dir: str,
+                  output_dir: str,
+                  input_names: List[str],
+                  output_names: List[str],
+                  use_amp: bool,
+                  use_xla: bool,
+                  compress: bool):
+    """
+    Export checkpoint to Tensorflow savedModel
+
+    :param input_shape: (str) Input shape to the model in format [batch, height, width, channels]
+    :param model_fn: (Callable) Estimator's model_fn
+    :param checkpoint_dir: (str) Directory where checkpoints are stored
+    :param output_dir: (str) Output directory for storage of the generated savedModel
+    :param input_names: (List[str]) Input node names
+    :param output_names: (List[str]) Output node names
+    :param use_amp: (bool )Enable TF-AMP
+    :param use_xla: (bool) Enable XLA
+    :param compress: (bool) Compress output
+    """
+    assert os.path.exists(checkpoint_dir), 'Path not found: {}'.format(checkpoint_dir)
+    assert input_shape is not None, 'Input shape must be provided'
+
+    _optimize_checkpoint_for_inference(os.path.join(checkpoint_dir, 'graph.pbtxt'), input_names, output_names)
+
+    try:
+        ckpt_path = os.path.splitext([p for p in glob.iglob(os.path.join(checkpoint_dir, '*.index'))][0])[0]
+    except IndexError:
+        raise ValueError('Could not find checkpoint in directory: {}'.format(checkpoint_dir))
+
+    config_proto = tf.compat.v1.ConfigProto()
+
+    config_proto.allow_soft_placement = True
+    config_proto.log_device_placement = False
+    config_proto.gpu_options.allow_growth = True
+    config_proto.gpu_options.force_gpu_compatible = True
+
+    if use_amp:
+        os.environ["TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE"] = "1"
+    if use_xla:
+        config_proto.graph_options.optimizer_options.global_jit_level = tf.compat.v1.OptimizerOptions.ON_1
+
+    run_config = tf.estimator.RunConfig(
+        model_dir=None,
+        tf_random_seed=None,
+        save_summary_steps=1e9,  # disabled
+        save_checkpoints_steps=None,
+        save_checkpoints_secs=None,
+        session_config=config_proto,
+        keep_checkpoint_max=None,
+        keep_checkpoint_every_n_hours=1e9,  # disabled
+        log_step_count_steps=1e9,
+        train_distribute=None,
+        device_fn=None,
+        protocol=None,
+        eval_distribute=None,
+        experimental_distribute=None
+    )
+
+    estimator = tf.estimator.Estimator(
+        model_fn=model_fn,
+        model_dir=ckpt_path,
+        config=run_config,
+        params={'dtype': tf.float16 if use_amp else tf.float32}
+    )
+
+    print('[*] Exporting the model ...')
+
+    input_type = tf.float16 if use_amp else tf.float32
+
+    def get_serving_input_receiver_fn():
+
+        def serving_input_receiver_fn():
+            features = tf.placeholder(dtype=input_type, shape=input_shape, name='input_tensor')
+
+            return tf.estimator.export.TensorServingInputReceiver(features=features, receiver_tensors=features)
+
+        return serving_input_receiver_fn
+
+    export_path = estimator.export_saved_model(
+        export_dir_base=output_dir,
+        serving_input_receiver_fn=get_serving_input_receiver_fn(),
+        checkpoint_path=ckpt_path
+    )
+
+    print('[*] Done! path: `%s`' % export_path.decode())
+
+    if compress:
+        _compress(export_path.decode(), os.path.join(output_dir, 'saved_model'))
+
+
+@_print_input
+def to_tf_trt(savedmodel_dir: str,
+              output_dir: str,
+              precision: str,
+              feed_dict_fn: Callable,
+              num_runs: int,
+              output_tensor_names: List[str],
+              compress: bool):
+    """
+    Export Tensorflow savedModel to TF-TRT
+
+    :param savedmodel_dir: (str) Input directory containing a Tensorflow savedModel
+    :param output_dir: (str) Output directory for storage of the generated TF-TRT exported model
+    :param precision: (str) Desired precision of the network (FP32, FP16 or INT8)
+    :param feed_dict_fn: (Callable) Input tensors for INT8 calibration. Model specific.
+    :param num_runs: (int) Number of calibration runs.
+    :param output_tensor_names: (List) Name of the output tensor for graph conversion. Model specific.
+    :param compress: (bool) Compress output
+    """
+    if savedmodel_dir is None or not os.path.exists(savedmodel_dir):
+        raise FileNotFoundError('savedmodel_dir not found: {}'.format(savedmodel_dir))
+
+    if os.path.exists(output_dir):
+        print('[*] Output dir \'{}\' is not empty. Cleaning up ...'.format(output_dir))
+        shutil.rmtree(output_dir)
+
+    print('[*] Converting model...')
+
+    converter = trt.TrtGraphConverter(input_saved_model_dir=savedmodel_dir,
+                                      precision_mode=precision)
+    converter.convert()
+
+    if precision == 'INT8':
+        print('[*] Running INT8 calibration ...')
+
+        converter.calibrate(fetch_names=output_tensor_names, num_runs=num_runs, feed_dict_fn=feed_dict_fn)
+
+    converter.save(output_dir)
+
+    print('[*] Done! TF-TRT saved_model stored in: `%s`' % output_dir)
+
+    if compress:
+        _compress('tftrt_saved_model', output_dir)
+
+
+@_print_input
+def to_onnx(input_dir: str, output_dir: str, compress: bool):
+    """
+    Convert Tensorflow savedModel to ONNX with tf2onnx
+
+    :param input_dir: (str) Input directory with a Tensorflow savedModel
+    :param output_dir: (str) Output directory where to store the ONNX version of the model
+    :param compress: (bool) Compress output
+    """
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir)
+
+    file_name = os.path.join(output_dir, 'model.onnx')
+    print('[*] Converting model...')
+
+    ret = subprocess.call(['python', '-m', 'tf2onnx.convert',
+                           '--saved-model', input_dir,
+                           '--output', file_name],
+                          stdout=open(os.devnull, 'w'),
+                          stderr=subprocess.STDOUT)
+    if ret > 0:
+        raise RuntimeError('tf2onnx.convert has failed with error: {}'.format(ret))
+
+    print('[*] Done! ONNX file stored in: %s' % file_name)
+
+    if compress:
+        _compress(output_dir, 'onnx_model')
+

+ 55 - 45
TensorFlow/Segmentation/UNet_Medical/utils/cmd_util.py

@@ -1,42 +1,64 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Command line argument parsing"""
 import argparse
-import tensorflow as tf
+from munch import Munch
 
 PARSER = argparse.ArgumentParser(description="UNet-medical")
 
 PARSER.add_argument('--exec_mode',
-                    choices=['train', 'train_and_predict', 'predict'],
+                    choices=['train', 'train_and_predict', 'predict', 'evaluate', 'train_and_evaluate'],
                     type=str,
-                    default='train_and_predict',
-                    help="""Which execution mode to run the model into"""
-                    )
+                    default='train_and_evaluate',
+                    help="""Execution mode of running the model""")
 
 PARSER.add_argument('--model_dir',
                     type=str,
                     default='./results',
-                    help="""Output directory for information related to the model"""
-                    )
+                    help="""Output directory for information related to the model""")
 
 PARSER.add_argument('--data_dir',
                     type=str,
                     required=True,
-                    help="""Input directory containing the dataset for training the model"""
-                    )
+                    help="""Input directory containing the dataset for training the model""")
+
+PARSER.add_argument('--log_dir',
+                    type=str,
+                    default=None,
+                    help="""Output directory for training logs""")
 
 PARSER.add_argument('--batch_size',
                     type=int,
                     default=1,
                     help="""Size of each minibatch per GPU""")
 
+PARSER.add_argument('--learning_rate',
+                    type=float,
+                    default=0.0001,
+                    help="""Learning rate coefficient for AdamOptimizer""")
+
+PARSER.add_argument('--crossvalidation_idx',
+                    type=int,
+                    default=None,
+                    help="""Chosen fold for cross-validation. Use None to disable cross-validation""")
+
 PARSER.add_argument('--max_steps',
                     type=int,
                     default=1000,
                     help="""Maximum number of steps (batches) used for training""")
 
-PARSER.add_argument('--seed',
-                    type=int,
-                    default=0,
-                    help="""Random seed""")
-
 PARSER.add_argument('--weight_decay',
                     type=float,
                     default=0.0005,
@@ -52,25 +74,10 @@ PARSER.add_argument('--warmup_steps',
                     default=200,
                     help="""Number of warmup steps""")
 
-PARSER.add_argument('--learning_rate',
-                    type=float,
-                    default=0.01,
-                    help="""Learning rate coefficient for SGD""")
-
-PARSER.add_argument('--momentum',
-                    type=float,
-                    default=0.99,
-                    help="""Momentum coefficient for SGD""")
-
-PARSER.add_argument('--decay_steps',
-                    type=float,
-                    default=5000,
-                    help="""Decay steps for inverse learning rate decay""")
-
-PARSER.add_argument('--decay_rate',
-                    type=float,
-                    default=0.95,
-                    help="""Decay rate for learning rate decay""")
+PARSER.add_argument('--seed',
+                    type=int,
+                    default=0,
+                    help="""Random seed""")
 
 PARSER.add_argument('--augment', dest='augment', action='store_true',
                     help="""Perform data augmentation during training""")
@@ -86,29 +93,32 @@ PARSER.add_argument('--use_amp', dest='use_amp', action='store_true',
                     help="""Train using TF-AMP""")
 PARSER.set_defaults(use_amp=False)
 
+PARSER.add_argument('--use_xla', dest='use_xla', action='store_true',
+                    help="""Train using XLA""")
+PARSER.set_defaults(use_amp=False)
+
 PARSER.add_argument('--use_trt', dest='use_trt', action='store_true',
                     help="""Use TF-TRT""")
 PARSER.set_defaults(use_trt=False)
 
 
 def _cmd_params(flags):
-    return {
+    return Munch({
+        'exec_mode': flags.exec_mode,
         'model_dir': flags.model_dir,
-        'batch_size': flags.batch_size,
         'data_dir': flags.data_dir,
+        'log_dir': flags.log_dir,
+        'batch_size': flags.batch_size,
+        'learning_rate': flags.learning_rate,
+        'crossvalidation_idx': flags.crossvalidation_idx,
         'max_steps': flags.max_steps,
         'weight_decay': flags.weight_decay,
-        'dtype': tf.float32,
-        'learning_rate': flags.learning_rate,
-        'momentum': flags.momentum,
-        'benchmark': flags.benchmark,
+        'log_every': flags.log_every,
+        'warmup_steps': flags.warmup_steps,
         'augment': flags.augment,
-        'exec_mode': flags.exec_mode,
+        'benchmark': flags.benchmark,
         'seed': flags.seed,
         'use_amp': flags.use_amp,
         'use_trt': flags.use_trt,
-        'log_every': flags.log_every,
-        'warmup_steps': flags.warmup_steps,
-        'decay_steps': flags.decay_steps,
-        'decay_rate': flags.decay_rate,
-    }
+        'use_xla': flags.use_xla,
+    })

+ 53 - 40
TensorFlow/Segmentation/UNet_Medical/utils/data_loader.py

@@ -13,32 +13,39 @@
 # limitations under the License.
 
 """ Dataset class encapsulates the data loading"""
-import math
-import os
 import multiprocessing
+import os
+from collections import deque
 
-import tensorflow as tf
 import numpy as np
+import tensorflow as tf
 from PIL import Image, ImageSequence
 
 
 class Dataset():
     """Load, separate and prepare the data for training and prediction"""
 
-    def __init__(self, data_dir, batch_size, augment=False, gpu_id=0, num_gpus=1, seed=0):
+    def __init__(self, data_dir, batch_size, fold=1, augment=False, gpu_id=0, num_gpus=1, seed=0):
+        if not os.path.exists(data_dir):
+            raise FileNotFoundError('Cannot find data dir: {}'.format(data_dir))
+
         self._data_dir = data_dir
         self._batch_size = batch_size
         self._augment = augment
 
         self._seed = seed
 
-        self._train_images = \
-            self._load_multipage_tiff(os.path.join(self._data_dir, 'train-volume.tif'))
-        self._train_masks = \
-            self._load_multipage_tiff(os.path.join(self._data_dir, 'train-labels.tif'))
+        images = self._load_multipage_tiff(os.path.join(self._data_dir, 'train-volume.tif'))
+        masks = self._load_multipage_tiff(os.path.join(self._data_dir, 'train-labels.tif'))
         self._test_images = \
             self._load_multipage_tiff(os.path.join(self._data_dir, 'test-volume.tif'))
 
+        train_indices, val_indices = self._get_val_train_indices(len(images), fold)
+        self._train_images = images[train_indices]
+        self._train_masks = masks[train_indices]
+        self._val_images = images[val_indices]
+        self._val_masks = masks[val_indices]
+
         self._num_gpus = num_gpus
         self._gpu_id = gpu_id
 
@@ -46,6 +53,10 @@ class Dataset():
     def train_size(self):
         return len(self._train_images)
 
+    @property
+    def eval_size(self):
+        return len(self._val_images)
+
     @property
     def test_size(self):
         return len(self._test_images)
@@ -54,6 +65,22 @@ class Dataset():
         """Load tiff images containing many images in the channel dimension"""
         return np.array([np.array(p) for p in ImageSequence.Iterator(Image.open(path))])
 
+    def _get_val_train_indices(self, length, fold, ratio=0.8):
+        assert 0 < ratio <= 1, "Train/total data ratio must be in range (0.0, 1.0]"
+        np.random.seed(self._seed)
+        indices = np.arange(0, length, 1, dtype=np.int)
+        np.random.shuffle(indices)
+        if fold is not None:
+            indices = deque(indices)
+            indices.rotate(fold * int((1.0 - ratio) * length))
+            indices = np.array(indices)
+            train_indices = indices[:int(ratio * len(indices))]
+            val_indices = indices[int(ratio * len(indices)):]
+        else:
+            train_indices = indices
+            val_indices = []
+        return train_indices, val_indices
+
     def _normalize_inputs(self, inputs):
         """Normalize inputs"""
         inputs = tf.expand_dims(tf.cast(inputs, tf.float32), -1)
@@ -61,7 +88,7 @@ class Dataset():
         # Center around zero
         inputs = tf.divide(inputs, 127.5) - 1
 
-        inputs = tf.image.resize_images(inputs, (392, 392))
+        inputs = tf.image.resize_images(inputs, (388, 388))
 
         return tf.image.resize_image_with_crop_or_pad(inputs, 572, 572)
 
@@ -98,30 +125,6 @@ class Dataset():
             inputs = tf.expand_dims(inputs, 0)
             labels = tf.expand_dims(labels, 0)
 
-            # Elastic deformation
-
-            alpha = tf.random.uniform([], minval=0, maxval=34)
-
-            # Create random vector flows
-            delta_x = tf.random.uniform([1, 4, 4, 1], minval=-1, maxval=1)
-            delta_y = tf.random.uniform([1, 4, 4, 1], minval=-1, maxval=1)
-
-            # Build 2D flow and apply
-            flow = tf.concat([delta_x, delta_y], axis=-1) * alpha
-            inputs = tf.contrib.image.dense_image_warp(inputs,
-                                                       tf.image.resize_images(flow, (572, 572)))
-            labels = tf.contrib.image.dense_image_warp(labels,
-                                                       tf.image.resize_images(flow, (572, 572)))
-
-            # Rotation invariance
-
-            # Rotate by random angle\
-            radian = tf.random_uniform([], maxval=360) * math.pi / 180
-            inputs = tf.contrib.image.rotate(inputs, radian)
-            labels = tf.contrib.image.rotate(labels, radian)
-
-            # Shift invariance
-
             # Random crop and resize
             left = tf.random_uniform([]) * 0.3
             right = 1 - tf.random_uniform([]) * 0.3
@@ -147,17 +150,27 @@ class Dataset():
 
         return (inputs, labels)
 
-    def train_fn(self):
+    def train_fn(self, drop_remainder=False):
         """Input function for training"""
         dataset = tf.data.Dataset.from_tensor_slices(
             (self._train_images, self._train_masks))
-        dataset = dataset.shuffle(self._batch_size * 3)
-        dataset = dataset.repeat()
         dataset = dataset.shard(self._num_gpus, self._gpu_id)
-        dataset = dataset.apply(
-            tf.data.experimental.map_and_batch(map_func=self._preproc_samples,
-                                               batch_size=self._batch_size,
-                                               num_parallel_calls=multiprocessing.cpu_count()))
+        dataset = dataset.repeat()
+        dataset = dataset.shuffle(self._batch_size * 3)
+        dataset = dataset.map(self._preproc_samples,
+                              num_parallel_calls=multiprocessing.cpu_count() // self._num_gpus)
+        dataset = dataset.batch(self._batch_size, drop_remainder=drop_remainder)
+        dataset = dataset.prefetch(self._batch_size)
+
+        return dataset
+
+    def eval_fn(self, count=1):
+        """Input function for validation"""
+        dataset = tf.data.Dataset.from_tensor_slices(
+            (self._val_images, self._val_masks))
+        dataset = dataset.repeat(count=count)
+        dataset = dataset.map(self._preproc_samples)
+        dataset = dataset.batch(self._batch_size)
         dataset = dataset.prefetch(self._batch_size)
 
         return dataset

+ 14 - 12
TensorFlow/Segmentation/UNet_Medical/utils/hooks/profiling_hook.py

@@ -14,26 +14,26 @@
 
 import time
 
+import numpy as np
 import tensorflow as tf
 import horovod.tensorflow as hvd
 
-from dllogger import LOGGER, tags, AverageMeter
+from utils.parse_results import process_performance_stats
 
 
-class ProfilingHook(tf.train.SessionRunHook):
+class ProfilingHook(tf.estimator.SessionRunHook):
 
-    def __init__(self, batch_size, log_every, warmup_steps):
+    def __init__(self, logger, batch_size, log_every, warmup_steps, mode):
         self._log_every = log_every
         self._warmup_steps = warmup_steps
         self._current_step = 0
         self._global_batch_size = batch_size * hvd.size()
-        self._meter = AverageMeter()
         self._t0 = 0
+        self._timestamps = []
+        self.logger = logger
+        self.mode = mode
 
     def before_run(self, run_context):
-        if self._current_step % self._log_every == 0:
-            LOGGER.log('iter_start', self._current_step)
-
         if self._current_step > self._warmup_steps:
             self._t0 = time.time()
 
@@ -41,14 +41,16 @@ class ProfilingHook(tf.train.SessionRunHook):
                   run_context,
                   run_values):
         if self._current_step > self._warmup_steps:
-            batch_time = time.time() - self._t0
-            ips = self._global_batch_size / batch_time
-            self._meter.record(ips)
-
+            self._timestamps.append(time.time() - self._t0)
         self._current_step += 1
 
     def begin(self):
         pass
 
     def end(self, session):
-        LOGGER.log('average_images_per_second', self._meter.get_value())
+        if hvd.rank() == 0:
+            throughput_imgps, latency_ms = process_performance_stats(np.array(self._timestamps),
+                                                                     self._global_batch_size)
+            self.logger.log(step=(),
+                            data={'throughput_{}'.format(self.mode): throughput_imgps,
+                                  'latency_{}'.format(self.mode): latency_ms})

+ 11 - 9
TensorFlow/Segmentation/UNet_Medical/utils/hooks/training_hook.py

@@ -13,18 +13,19 @@
 # limitations under the License.
 
 import tensorflow as tf
+import horovod.tensorflow as hvd
 
-from dllogger import LOGGER, tags
 
+class TrainingHook(tf.estimator.SessionRunHook):
 
-class TrainingHook(tf.train.SessionRunHook):
-
-    def __init__(self, log_every=1):
+    def __init__(self, logger, max_steps, log_every=1):
         self._log_every = log_every
         self._iter_idx = 0
+        self.logger = logger
+        self.max_steps = max_steps
 
     def before_run(self, run_context):
-        run_args = tf.train.SessionRunArgs(
+        run_args = tf.estimator.SessionRunArgs(
             fetches=[
                 'cross_loss_ref:0',
                 'dice_loss_ref:0',
@@ -39,8 +40,9 @@ class TrainingHook(tf.train.SessionRunHook):
                   run_values):
         cross_loss, dice_loss, total_loss = run_values.results
 
-        if self._iter_idx % self._log_every == 0:
-            LOGGER.log('cross_loss', cross_loss)
-            LOGGER.log('dice_loss', dice_loss)
-            LOGGER.log('total_loss', total_loss)
+        if (self._iter_idx % self._log_every == 0) and (hvd.rank() == 0):
+            self.logger.log(step=(self._iter_idx, self.max_steps),
+                            data={'train_ce_loss': float(cross_loss),
+                                  'train_dice_loss': float(dice_loss),
+                                  'train_total_loss': float(total_loss)})
         self._iter_idx += 1

+ 24 - 21
TensorFlow/Segmentation/UNet_Medical/utils/model_fn.py

@@ -29,10 +29,9 @@ Example:
 
 """
 import os
-import tensorflow as tf
-import horovod.tensorflow as hvd
 
-from dllogger.logger import LOGGER
+import horovod.tensorflow as hvd
+import tensorflow as tf
 
 from model.unet import unet_v1
 
@@ -92,22 +91,19 @@ def unet_fn(features, labels, mode, params):
         Appropriate tf.estimator.EstimatorSpec for the current mode
 
     """
-    dtype = params['dtype']
-    max_steps = params['max_steps']
-    lr_init = params['learning_rate']
-    momentum = params['momentum']
+    dtype = tf.float32
 
     device = '/gpu:0'
 
-    global_step = tf.train.get_global_step()
-    learning_rate = tf.train.exponential_decay(lr_init, global_step,
-                                               decay_steps=max_steps,
-                                               decay_rate=0.96)
+    global_step = tf.compat.v1.train.get_global_step()
+
+    if mode == tf.estimator.ModeKeys.TRAIN:
+        lr_init = params.learning_rate
 
     with tf.device(device):
         features = tf.cast(features, dtype)
 
-        output_map = unet_v1(features, mode)
+        output_map = unet_v1(features=features, mode=mode)
 
         if mode == tf.estimator.ModeKeys.PREDICT:
             predictions = {'logits': tf.nn.softmax(output_map, axis=-1)}
@@ -120,24 +116,31 @@ def unet_fn(features, labels, mode, params):
         flat_labels = tf.reshape(labels,
                                  [tf.shape(output_map)[0], -1, n_classes])
 
-        crossentropy_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=flat_logits,
-                                                                                      labels=flat_labels),
-                                           name='cross_loss_ref')
-        dice_loss = tf.reduce_mean(1 - dice_coef(flat_logits, flat_labels), name='dice_loss_ref')
-
+        crossentropy_loss = tf.reduce_mean(
+            tf.nn.softmax_cross_entropy_with_logits_v2(logits=flat_logits,
+                                                       labels=flat_labels), name='cross_loss_ref')
+        dice_loss = tf.reduce_mean(1 - dice_coef(tf.keras.activations.softmax(flat_logits, axis=-1),
+                                                 flat_labels), name='dice_loss_ref')
         total_loss = tf.add(crossentropy_loss, dice_loss, name="total_loss_ref")
 
-        opt = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=momentum)
+        if mode == tf.estimator.ModeKeys.EVAL:
+            eval_metric_ops = {"eval_ce_loss": tf.compat.v1.metrics.mean(crossentropy_loss),
+                               "eval_dice_loss": tf.compat.v1.metrics.mean(dice_loss),
+                               "eval_total_loss": tf.compat.v1.metrics.mean(total_loss),
+                               "eval_dice_score": tf.compat.v1.metrics.mean(1.0 - dice_loss)}
+            return tf.estimator.EstimatorSpec(mode=mode, loss=dice_loss, eval_metric_ops=eval_metric_ops)
+
+        opt = tf.compat.v1.train.AdamOptimizer(learning_rate=lr_init)
 
         if is_using_hvd():
             opt = hvd.DistributedOptimizer(opt, device_dense='/gpu:0')
 
-        with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
+        with tf.control_dependencies(tf.compat.v1.get_collection(tf.compat.v1.GraphKeys.UPDATE_OPS)):
             deterministic = True
             gate_gradients = (
-                tf.train.Optimizer.GATE_OP
+                tf.compat.v1.train.Optimizer.GATE_OP
                 if deterministic
-                else tf.train.Optimizer.GATE_NONE)
+                else tf.compat.v1.train.Optimizer.GATE_NONE)
 
             train_op = opt.minimize(total_loss, gate_gradients=gate_gradients, global_step=global_step)
 

+ 80 - 0
TensorFlow/Segmentation/UNet_Medical/utils/parse_results.py

@@ -0,0 +1,80 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import numpy as np
+
+import argparse
+
+
+def process_performance_stats(timestamps, batch_size):
+    timestamps_ms = 1000 * timestamps
+    timestamps_ms = timestamps_ms[timestamps_ms > 0]
+    latency_ms = timestamps_ms.mean()
+    std = timestamps_ms.std()
+    n = np.sqrt(len(timestamps_ms))
+    throughput_imgps = (1000.0 * batch_size / timestamps_ms).mean()
+    print('Throughput Avg:', round(throughput_imgps, 3), 'img/s')
+    print('Latency Avg:', round(latency_ms, 3), 'ms')
+    for ci, lvl in zip(["90%:", "95%:", "99%:"],
+                       [1.645, 1.960, 2.576]):
+        print("Latency", ci, round(latency_ms + lvl * std / n, 3), "ms")
+    return float(throughput_imgps), float(latency_ms)
+
+
+def parse_convergence_results(path, environment):
+    dice_scores = []
+    ce_scores = []
+    logfiles = [f for f in os.listdir(path) if "log" in f and environment in f]
+    if not logfiles:
+        raise FileNotFoundError("No logfile found at {}".format(path))
+    for logfile in logfiles:
+        with open(os.path.join(path, logfile), "r") as f:
+            content = f.readlines()
+        if "eval_dice_score" not in content[-1]:
+            print("Evaluation score not found. The file", logfile, "might be corrupted.")
+            continue
+        dice_scores.append(float([val for val in content[-1].split()
+                                  if "eval_dice_score" in val][0].split(":")[1]))
+        ce_scores.append(float([val for val in content[-1].split()
+                                if "eval_ce_loss" in val][0].split(":")[1]))
+    if dice_scores:
+        print("Evaluation dice score:", sum(dice_scores) / len(dice_scores))
+        print("Evaluation cross-entropy loss:", sum(ce_scores) / len(ce_scores))
+    else:
+        print("All logfiles were corrupted, no loss was obtained.")
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description="UNet-medical-utils")
+
+    parser.add_argument('--exec_mode',
+                        choices=['convergence', 'benchmark'],
+                        type=str,
+                        help="""Which execution mode to run the model into""")
+
+    parser.add_argument('--model_dir',
+                        type=str,
+                        required=True)
+
+    parser.add_argument('--env',
+                        choices=['FP32_1GPU', 'FP32_8GPU', 'TF-AMP_1GPU', 'TF-AMP_8GPU'],
+                        type=str,
+                        required=True)
+
+    args = parser.parse_args()
+    if args.exec_mode == 'convergence':
+        parse_convergence_results(path=args.model_dir, environment=args.env)
+    elif args.exec_mode == 'benchmark':
+        pass
+    print()

+ 0 - 106
TensorFlow/Segmentation/UNet_Medical/utils/var_storage.py

@@ -1,106 +0,0 @@
-#!/usr/bin/env python
-# -*- coding: utf-8 -*-
-
-# ==============================================================================
-# Copyright 2018 The TensorFlow Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ==============================================================================
-
-import tensorflow as tf
-
-from dllogger.logger import LOGGER
-
-__all__ = ['model_variable_scope']
-
-
-def model_variable_scope(name, reuse=False, dtype=tf.float32, debug_mode=False, *args, **kwargs):
-    """Returns a variable scope that the model should be created under.
-    If self.dtype is a castable type, model variable will be created in fp32
-    then cast to self.dtype before being used.
-    Returns:
-      A variable scope for the model.
-    """
-
-    def _custom_dtype_getter(getter, name, shape=None, dtype=None, trainable=True, regularizer=None, *args, **kwargs):
-        """Creates variables in fp32, then casts to fp16 if necessary.
-        This function is a custom getter. A custom getter is a function with the
-        same signature as tf.get_variable, except it has an additional getter
-        parameter. Custom getters can be passed as the `custom_getter` parameter of
-        tf.variable_scope. Then, tf.get_variable will call the custom getter,
-        instead of directly getting a variable itself. This can be used to change
-        the types of variables that are retrieved with tf.get_variable.
-        The `getter` parameter is the underlying variable getter, that would have
-        been called if no custom getter was used. Custom getters typically get a
-        variable with `getter`, then modify it in some way.
-        This custom getter will create an fp32 variable. If a low precision
-        (e.g. float16) variable was requested it will then cast the variable to the
-        requested dtype. The reason we do not directly create variables in low
-        precision dtypes is that applying small gradients to such variables may
-        cause the variable not to change.
-        Args:
-          getter: The underlying variable getter, that has the same signature as
-            tf.get_variable and returns a variable.
-          name: The name of the variable to get.
-          shape: The shape of the variable to get.
-          *args: Additional arguments to pass unmodified to getter.
-          **kwargs: Additional keyword arguments to pass unmodified to getter.
-        Returns:
-          A variable which is cast to fp16 if necessary.
-        """
-
-        storage_dtype = tf.float32 if dtype in [tf.float32, tf.float16] else dtype
-
-        variable = getter(
-            name,
-            shape,
-            dtype=storage_dtype,
-            trainable=trainable,
-            regularizer=(
-                regularizer if
-                (trainable and not any(l_name.lower() in name.lower()
-                                       for l_name in ['batchnorm', 'batch_norm'])) else None
-            ),
-            *args,
-            **kwargs
-        )
-
-        if dtype != tf.float32:
-            cast_name = name + '/fp16_cast'
-
-            try:
-                cast_variable = tf.get_default_graph().get_tensor_by_name(cast_name + ':0')
-
-            except KeyError:
-                cast_variable = tf.cast(variable, dtype, name=cast_name)
-
-            cast_variable._ref = variable._ref
-            variable = cast_variable
-
-        if debug_mode:
-
-            LOGGER.log(
-                "Var Name: `%s`\n\t"
-                "[*] dtype before cast: %s\n\t"
-                "[*] dtype after cast: %s\n\t"
-                "[*] target dtype: %s\n\t"
-                "[*] trainable: %s\n\t"
-                "[*] shape: %s\n" % (
-                    variable.name, str(storage_dtype), str(variable.dtype), dtype, trainable,
-                    str([int(x) for x in shape])
-                )
-            )
-
-        return variable
-
-    return tf.variable_scope(name, reuse=reuse, dtype=dtype, custom_getter=_custom_dtype_getter, *args, **kwargs)