Răsfoiți Sursa

Merge: [nnUNet/TF2] Update container to 22.11, fix XLA+channel last conv, multi-gpu binding script

Krzysztof Kudrynski 3 ani în urmă
părinte
comite
eb3571096d

+ 2 - 1
TensorFlow2/Segmentation/nnUNet/Dockerfile

@@ -1,4 +1,4 @@
-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:22.04-tf2-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:22.11-tf2-py3
 FROM ${FROM_IMAGE_NAME}
 
 RUN pip install nvidia-pyindex
@@ -13,6 +13,7 @@ RUN unzip -qq awscliv2.zip
 RUN ./aws/install
 RUN rm -rf awscliv2.zip aws
 
+ENV OMP_NUM_THREADS=2
 ENV TF_CPP_MIN_LOG_LEVEL 3
 ENV OMPI_MCA_coll_hcoll_enable 0
 ENV HCOLL_ENABLE_MCAST 0 

+ 163 - 162
TensorFlow2/Segmentation/nnUNet/README.md

@@ -58,15 +58,15 @@ This model is trained with mixed precision using Tensor Cores on Volta, Turing,
 
 The nnU-Net allows training two types of networks: 2D U-Net and 3D U-Net to perform semantic segmentation of 2D or 3D images, with high accuracy and performance.
 
-The following figure shows the architecture of the 3D U-Net model and its different components. U-Net is composed of a contractive and an expanding path, that aims at building a bottleneck in its centermost part through a combination of convolution, instance norm and leaky relu operations. After this bottleneck, the image is reconstructed through a combination of convolutions and upsampling. Skip connections are added with the goal of helping the backward flow of gradients in order to improve the training.
+The following figure shows the architecture of the 3D U-Net model and its different components. U-Net is composed of a contractive and an expanding path, that aims at building a bottleneck in its centermost part through a combination of convolution, instance norm, and leaky ReLU operations. After this bottleneck, the image is reconstructed through a combination of convolutions and upsampling. Skip connections are added with the goal of helping the backward flow of gradients to improve the training.
 
 <img src="images/unet3d.png" width="900"/>
-    
+ 
 *Figure 1: The 3D U-Net architecture*
 
 ### Default configuration
 
-All convolution blocks in U-Net in both encoder and decoder are using two convolution layers followed by instance normalization and a leaky ReLU nonlinearity. For downsampling we are using stride convolution whereas transposed convolution for upsampling.
+All convolution blocks in U-Net in both encoder and decoder are using two convolution layers followed by instance normalization and a leaky ReLU nonlinearity. For downsampling, we are using stride convolution whereas transposed convolution is used for upsampling.
 
 All models were trained with the Adam optimizer. For loss function we use the average of [cross-entropy](https://en.wikipedia.org/wiki/Cross_entropy) and [dice coefficient](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient).
 
@@ -83,30 +83,26 @@ The following features are supported by this model:
 |Horovod Multi-GPU (NCCL) | Yes        
 |[XLA](https://www.tensorflow.org/xla) | Yes
 
-         
+ 
 #### Features
 
 **DALI**
-
-NVIDIA Data Loading Library (DALI) is a collection of highly optimized building blocks, and an execution engine, to accelerate the pre-processing of the input data for deep learning applications. DALI provides both the performance and the flexibility for accelerating different data pipelines as a single library. This single library can then be easily integrated into different deep learning training and inference applications. For details, refer to example sources in this repository or refer to the [DALI documentation](https://docs.nvidia.com/deeplearning/dali/index.html).
+NVIDIA Data Loading Library (DALI) is a collection of optimized building blocks, and an execution engine, to speed up the pre-processing of the input data for deep learning applications. DALI provides both the performance and the flexibility for accelerating different data pipelines as a single library. This single library can then be integrated into different deep learning training and inference applications. For details, refer to example sources in this repository or refer to the [DALI documentation](https://docs.nvidia.com/deeplearning/dali/index.html).
 
 **Automatic Mixed Precision (AMP)**
-
-Computation graphs can be modified by TensorFlow during runtime to support mixed precision training, which allows to use FP16 training with FP32 master weights. A detailed explanation of mixed precision can be found in the next section.
-    
+Computation graphs can be modified by TensorFlow during runtime to support mixed precision training, which allows using FP16 training with FP32 master weights. A detailed explanation of mixed precision can be found in the next section.
+ 
 **Multi-GPU training with Horovod**
 Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use. For more information about how to get started with Horovod, refer to the [Horovod: Official repository](https://github.com/horovod/horovod).
 Our model uses Horovod to implement efficient multi-GPU training with NCCL. For details, refer to example scripts in this repository or refer to the [TensorFlow tutorial](https://github.com/horovod/horovod/#usage).
 
 **XLA**
-
-XLA (Accelerated Linear Algebra) is a compiler which can accelerate TensorFlow networks by model-specific optimizations i.e. fusing multiple GPU operations together.
-Operations fused into a single GPU kernel do not have to use additional memory to store intermediate values by keeping them entirely in GPU registers, therefore reducing memory operations and improving performance. For details refer to the [TensorFlow documentation](https://www.tensorflow.org/xla).
-
+XLA (Accelerated Linear Algebra) is a compiler that can speed up TensorFlow networks by model-specific optimizations i.e. fusing many GPU operations together.
+Operations fused into a single GPU kernel do not have to use extra memory to store intermediate values by keeping them in GPU registers, thus reducing memory operations and improving performance. For details refer to the [TensorFlow documentation](https://www.tensorflow.org/xla).
 
 ### Mixed precision training
 
-Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
+Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in the half-precision format while storing minimal information in single-precision to keep as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x speedup on the most intense model architectures. Using mixed precision training requires two steps:
 
 1. Porting the model to use the FP16 data type where appropriate.
 2. Adding loss scaling to preserve small gradient values.
@@ -121,9 +117,9 @@ For information about:
 
 #### Enabling mixed precision
 
-Mixed precision is enabled in TensorFlow by using the Automatic Mixed Precision (TF-AMP) extension which casts variables to half-precision upon retrieval, while storing variables in single-precision format. Furthermore, to preserve small gradient magnitudes in backpropagation, a [loss scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling) step must be included when applying gradients. In TensorFlow, loss scaling can be applied statically by using simple multiplication of loss by a constant value or automatically, by TF-AMP. Automatic mixed precision makes all the adjustments internally in TensorFlow, providing two benefits over manual operations. First, programmers need not modify network model code, reducing development and maintenance effort. Second, using AMP maintains forward and backward compatibility with all the APIs for defining and running TensorFlow models.
+Mixed precision is enabled in TensorFlow by using the Automatic Mixed Precision (TF-AMP) extension which casts variables to half-precision upon retrieval, while storing variables in single-precision format. Furthermore, to preserve small gradient magnitudes in backpropagation, a [loss scaling](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#lossscaling) step must be included when applying gradients. In TensorFlow, loss scaling can be applied statically by using simple multiplication of loss by a constant value or automatically, by TF-AMP. Automatic mixed precision makes all the adjustments internally in TensorFlow, providing two benefits over manual operations. First, programmers need not modify network model code, reducing development and maintenance efforts. Second, using AMP maintains forward and backward compatibility with all the APIs for defining and running TensorFlow models.
 
-Example nnU-Net scripts for training, inference and benchmarking from the `scripts/` directory enable mixed precision if `--amp` command line flag is used.
+Example nnU-Net scripts for training, inference, and benchmarking from the `scripts/` directory enable mixed precision if the `--amp` command line flag is used.
 
 Internally, mixed precision is enabled by setting `keras.mixed_precision` policy to `mixed_float16`. Additionally, our custom training loop uses a `LossScaleOptimizer` wrapper for the optimizer. For more information see the [Mixed precision guide](#mixed-precision-training).
 
@@ -131,7 +127,7 @@ Internally, mixed precision is enabled by setting `keras.mixed_precision` policy
 
 TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math, also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on NVIDIA Volta GPUs. 
 
-TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
+TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require a high dynamic range for weights or activations.
 
 For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post.
 
@@ -140,17 +136,14 @@ TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by defaul
 ### Glossary
 
 **Deep supervision**
-
-Deep supervision is a technique which adds auxiliary loss outputs to the U-Net decoder layers. For nnU-Net, we add auxiliary losses to three latest decoder levels. Final loss is a weighted average of the obtained loss values. Deep supervision can be enabled by adding the `--deep-supervision` flag.
+Deep supervision is a technique that adds auxiliary loss outputs to the U-Net decoder layers. For nnU-Net, we add auxiliary losses to the three latest decoder levels. The final loss is a weighted average of the obtained loss values. Deep supervision can be enabled by adding the `--deep-supervision` flag.
 
 **Test time augmentation**
-
-Test time augmentation is an inference technique which averages predictions from augmented images with its prediction. As a result, predictions are more accurate, but with the cost of a slower inference process. For nnU-Net, we use all possible flip combinations for image augmenting. Test time augmentation can be enabled by adding the `--tta` flag to the training or inference script invocation.
+Test time augmentation is an inference technique that averages predictions from augmented images with its prediction. As a result, predictions are more accurate, but with the cost of a slower inference process. For nnU-Net, we use all possible flip combinations for image augmenting. Test time augmentation can be enabled by adding the `--tta` flag to the training or inference script invocation.
 
 **Sliding window inference**
-
-During inference this method replaces an input image with arbitrary resolution with a batch of overlapping windows, which cover the whole input. After passing this batch through the network a prediction with the original resolution is reassembled. Predicted values inside overlapped regions are obtained from a weighted average.
-Overlap ratio and weights for the average (i.e. blending mode) can be adjusted with the `--overlap` and `--blend-mode` options respectively.
+During inference, this method replaces an arbitrary resolution input image with a batch of overlapping windows, that cover the whole input. After passing this batch through the network a prediction with the original resolution is reassembled. Predicted values inside overlapped regions are obtained from a weighted average.
+Overlap ratio and weights for the average (i.e. blending mode) can be adjusted with the `--overlap` and `--blend-mode` options.
 
 ## Setup
 
@@ -159,23 +152,23 @@ The following section lists the requirements that you need to meet in order to s
 ### Requirements
 
 This repository contains Dockerfile which extends the TensorFlow 2 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
--   [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
--   TensorFlow2 22.04-py3+ NGC container
+- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
+-   TensorFlow2 22.11-py3+ NGC container
 -   Supported GPUs:
-    - [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
-    - [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
-    - [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
+ - [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
+ - [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
+ - [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
 
 For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
--   [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
--   [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#accessing_registry)
+- [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
+- [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#accessing_registry)
 -   Running [TensorFlow](https://docs.nvidia.com/deeplearning/frameworks/tensorflow-release-notes/running.html#running)
-  
+ 
 For those unable to use the PyTorch NGC container, to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
 
 ## Quick Start Guide
 
-To train your model using mixed or TF32 precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the nnUNet model on the [Medical Segmentation Decathlon](http://medicaldecathlon.com/) dataset. For the specifics concerning training and inference, see the [Advanced](#advanced) section.
+To train your model using mixed or TF32 precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the nnUNet model on the [Medical Segmentation Decathlon](http://medicaldecathlon.com/) dataset. For the specifics on training and inference, see the [Advanced](#advanced) section.
 
 1. Clone the repository.
 
@@ -184,27 +177,27 @@ Executing this command will create your local repository with all the code to ru
 git clone https://github.com/NVIDIA/DeepLearningExamples
 cd DeepLearningExamples/TensorFlow2/Segmentation/nnUNet
 ```
-    
+ 
 2. Build the nnU-Net TensorFlow2 NGC container.
-    
-This command will use the Dockerfile to create a Docker image named `nnunet`, downloading all the required components automatically.
+ 
+This command will use the Dockerfile to create a Docker image named `nnunet`, downloading all the required components.
 
 ```
 docker build -t nnunet .
 ```
-    
+ 
 The NGC container contains all the components optimized for usage on NVIDIA hardware.
-    
+ 
 3. Start an interactive session in the NGC container to run preprocessing/training/inference.
-    
+ 
 The following command will launch the container and mount the `./data` directory as a volume to the `/data` directory inside the container, and `./results` directory to the `/results` directory in the container.
-    
+ 
 ```
 mkdir data results
-docker run -it --runtime=nvidia --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 --rm -v ${PWD}/data:/data -v ${PWD}/results:/results nnunet:latest /bin/bash
+docker run -it --privileged --runtime=nvidia --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 --rm -v ${PWD}/data:/data -v ${PWD}/results:/results nnunet:latest /bin/bash
 ```
 
-4. Prepare BraTS (MSD 01 task) dataset.
+4. Prepare the BraTS (MSD 01 task) dataset.
 
 To download and preprocess the data run:
 ```
@@ -218,10 +211,10 @@ Then `ls /data` should print:
 01_3d_tf2 01_2d_tf2 Task01_BrainTumour
 ```
 
-For the specifics concerning data preprocessing, see the [Getting the data](#getting-the-data) section.
-    
+For the specifics on data preprocessing, see the [Getting the data](#getting-the-data) section.
+ 
 5. Start training.
-   
+ 
 Training can be started with:
 ```
 python scripts/train.py --gpus <gpus> --fold <fold> --dim <dim> [--amp]
@@ -240,21 +233,22 @@ python scripts/benchmark.py --mode {train,predict} --gpus <ngpus> --dim {2,3} --
 To see descriptions of the benchmark script arguments run `python scripts/benchmark.py --help`.
 
 
+
 7. Start inference/predictions.
-   
+ 
 Inference can be started with:
 ```
 python scripts/inference.py --data <path/to/data> --dim <dim> --fold <fold> --ckpt-dir <path/to/checkpoint> [--amp] [--tta] [--save-preds]
 ```
 
-Note: You have to prepare either validation or test dataset to run this script by running `python preprocess.py --task 01 --dim {2,3} --exec_mode {val,test}`. After preprocessing inside a given task directory (e.g. `/data/01_3d/` for task 01 and dim 3) it will create a `val` or `test` directory with preprocessed data ready for inference. Possible workflow:
+Note: You have to prepare either validation or test dataset to run this script by running `python preprocess.py --task 01 --dim {2,3} --exec_mode {val,test}`. After preprocessing inside a given task directory (e.g. `/data/01_3d` for task 01 and dim 3) it will create a `val` or `test` directory with preprocessed data ready for inference. Possible workflow:
 
 ```
 python preprocess.py --task 01 --dim 3 --exec_mode val
 python scripts/inference.py --data /data/01_3d/val --dim 3 --fold 0 --ckpt-dir <path/to/checkpoint> --amp --tta --save-preds
 ```
 
-Then if you have labels for predicted images you can evaluate it with `evaluate.py` script. For example:
+Then if you have labels for predicted images you can evaluate them with `evaluate.py` script. For example:
 
 ```
 python evaluate.py --preds /results/preds_task_01_dim_3_fold_0_amp_tta --lbls /data/Task01_BrainTumour/labelsTr
@@ -263,7 +257,7 @@ python evaluate.py --preds /results/preds_task_01_dim_3_fold_0_amp_tta --lbls /d
 To see descriptions of the inference script arguments run `python scripts/inference.py --help`. You can customize the inference process. For details, see the [Inference process](#inference-process) section.
 
 Now that you have your model trained and evaluated, you can choose to compare your training results with our [Training accuracy results](#training-accuracy-results). You can also choose to benchmark yours performance to [Training performance benchmark](#training-performance-results), or [Inference performance benchmark](#inference-performance-results). Following the steps in these sections will ensure that you achieve the same accuracy and performance results as stated in the [Results](#results) section.
-    
+ 
 ## Advanced
 
 The following sections provide greater details of the dataset, running training and inference, and the training results.
@@ -278,27 +272,27 @@ In the root directory, the most important files are:
 * `Dockerfile`: Container with the basic set of dependencies to run nnU-Net.
 * `requirements.txt:` Set of extra requirements for running nnU-Net.
 * `evaluate.py`: Compare predictions with ground truth and get the final score.
-    
+ 
 The `data_preprocessing` folder contains information about the data preprocessing used by nnU-Net. Its contents are:
-    
+ 
 * `configs.py`: Defines dataset configuration like patch size or spacing.
 * `preprocessor.py`: Implements data preprocessing pipeline.
 
-The `data_loading` folder contains information about the data loading pipeline used by nnU-Net. Its contents are:
-    
+The `data_loading` folder contains information about the data-loading pipeline used by nnU-Net. Its contents are:
+ 
 * `data_module.py`: Defines a data module managing datasets and splits (similar to PyTorch Lightning `DataModule`)
 * `dali_loader.py`: Implements DALI data loading pipelines.
 * `utils.py`: Defines auxiliary functions used for data loading.
 
 The `models` folder contains information about the building blocks of nnU-Net and the way they are assembled. Its contents are:
-    
-* `layers.py`: Implements convolution blocks used by U-Net template.
+ 
+* `layers.py`: Implements convolution blocks used by the U-Net template.
 * `nn_unet.py`: Implements training/validation/test logic and dynamic creation of U-Net architecture used by nnU-Net.
 * `sliding_window.py`: Implements sliding window inference used by evaluation and prediction loops.
 * `unet.py`: Implements the U-Net template.
 
-The `runtime` folder contains information about training, inference and evaluation logic. Its contents are:
-    
+The `runtime` folder contains information about training, inference, and evaluation logic. Its contents are:
+ 
 * `args.py`: Defines command line arguments.
 * `checkpoint.py`: Implements checkpoint saving.
 * `logging.py`: Defines logging utilities along with wandb.io integration.
@@ -310,7 +304,7 @@ The `runtime` folder contains information about training, inference and evaluati
 Other folders included in the root directory are:
 
 * `images/`: Contains a model diagram.
-* `scripts/`: Provides scripts for training, benchmarking and inference of nnU-Net.
+* `scripts/`: Provides scripts for training, benchmarking, and inference of nnU-Net.
 
 ### Command-line options
 
@@ -321,24 +315,26 @@ To see the full list of available options and their descriptions, use the `-h` o
 The following example output is printed when running the model:
 
 ```
-usage: main.py [-h] [--exec-mode {train,evaluate,predict,export}] [--data DATA] [--task TASK] [--dim {2,3}] [--seed SEED] [--benchmark]
-               [--tta [BOOLEAN]] [--save-preds [BOOLEAN]] [--sw-benchmark [BOOLEAN]] [--results RESULTS] [--logname LOGNAME] [--quiet]
-               [--use-dllogger [BOOLEAN]] [--amp [BOOLEAN]] [--xla [BOOLEAN]] [--read-roi [BOOLEAN]] [--batch-size BATCH_SIZE]
-               [--learning-rate LEARNING_RATE] [--momentum MOMENTUM] [--scheduler {none,poly,cosine,cosine_annealing}]
-               [--end-learning-rate END_LEARNING_RATE] [--cosine-annealing-first-cycle-steps COSINE_ANNEALING_FIRST_CYCLE_STEPS]
-               [--cosine-annealing-peak-decay COSINE_ANNEALING_PEAK_DECAY] [--optimizer {sgd,adam,radam}] [--deep-supervision [BOOLEAN]]
-               [--lookahead [BOOLEAN]] [--weight-decay WEIGHT_DECAY] [--loss-batch-reduction [BOOLEAN]]
-               [--loss-include-background [BOOLEAN]] [--negative-slope NEGATIVE_SLOPE] [--norm {instance,batch,group,none}]
-               [--ckpt-strategy {last_and_best,last_only,none}] [--ckpt-dir CKPT_DIR] [--saved-model-dir SAVED_MODEL_DIR]
-               [--resume-training] [--nvol NVOL] [--data2d-dim {2,3}] [--oversampling OVERSAMPLING] [--num-workers NUM_WORKERS]
-               [--dali-use-cpu [BOOLEAN]] [--sw-batch-size SW_BATCH_SIZE] [--overlap OVERLAP] [--blend {gaussian,constant}]
-               [--nfolds NFOLDS] [--fold FOLD] [--epochs EPOCHS] [--skip-eval SKIP_EVAL] [--steps-per-epoch STEPS_PER_EPOCH]
-               [--bench-steps BENCH_STEPS] [--warmup-steps WARMUP_STEPS]
+usage: main.py [-h] [--exec-mode {train,evaluate,predict,export,nav}] [--gpus GPUS] [--data DATA] [--task TASK] [--dim {2,3}]
+               [--seed SEED] [--benchmark] [--tta [BOOLEAN]] [--save-preds [BOOLEAN]] [--sw-benchmark [BOOLEAN]]
+               [--results RESULTS] [--logname LOGNAME] [--quiet] [--use-dllogger [BOOLEAN]] [--amp [BOOLEAN]] [--xla [BOOLEAN]]
+               [--read-roi [BOOLEAN]] [--batch-size BATCH_SIZE] [--learning-rate LEARNING_RATE] [--momentum MOMENTUM]
+               [--scheduler {none,poly,cosine,cosine_annealing}] [--end-learning-rate END_LEARNING_RATE]
+               [--cosine-annealing-first-cycle-steps COSINE_ANNEALING_FIRST_CYCLE_STEPS]
+               [--cosine-annealing-peak-decay COSINE_ANNEALING_PEAK_DECAY] [--optimizer {sgd,adam,radam}]
+               [--deep-supervision [BOOLEAN]] [--lookahead [BOOLEAN]] [--weight-decay WEIGHT_DECAY]
+               [--loss-batch-reduction [BOOLEAN]] [--loss-include-background [BOOLEAN]] [--negative-slope NEGATIVE_SLOPE]
+               [--norm {instance,batch,group,none}] [--ckpt-strategy {last_and_best,last_only,none}] [--ckpt-dir CKPT_DIR]
+               [--saved-model-dir SAVED_MODEL_DIR] [--resume-training] [--load_sm [BOOLEAN]] [--validate [BOOLEAN]] [--nvol NVOL]
+               [--oversampling OVERSAMPLING] [--num-workers NUM_WORKERS] [--sw-batch-size SW_BATCH_SIZE] [--overlap OVERLAP]
+               [--blend {gaussian,constant}] [--nfolds NFOLDS] [--fold FOLD] [--epochs EPOCHS] [--skip-eval SKIP_EVAL]
+               [--steps-per-epoch STEPS_PER_EPOCH] [--bench-steps BENCH_STEPS] [--warmup-steps WARMUP_STEPS]
 
 optional arguments:
   -h, --help            show this help message and exit
-  --exec-mode {train,evaluate,predict,export}, --exec_mode {train,evaluate,predict,export}
+  --exec-mode {train,evaluate,predict,export,nav}, --exec_mode {train,evaluate,predict,export,nav}
                         Execution mode to run the model (default: train)
+  --gpus GPUS
   --data DATA           Path to data directory (default: /data)
   --task TASK           Task number, MSD uses numbers 01-10 (default: 01)
   --dim {2,3}           UNet dimension (default: 3)
@@ -365,7 +361,7 @@ optional arguments:
   --scheduler {none,poly,cosine,cosine_annealing}
                         Learning rate scheduler (default: none)
   --end-learning-rate END_LEARNING_RATE
-                        End learning rate for poly scheduler (default: 0.0001)
+                        End learning rate for poly scheduler (default: 5e-05)
   --cosine-annealing-first-cycle-steps COSINE_ANNEALING_FIRST_CYCLE_STEPS
                         Length of a cosine decay cycle in steps, only with 'cosine_annealing' scheduler (default: 512)
   --cosine-annealing-peak-decay COSINE_ANNEALING_PEAK_DECAY
@@ -393,14 +389,13 @@ optional arguments:
                         Path to saved model directory (for evaluation and prediction) (default: None)
   --resume-training, --resume_training
                         Resume training from the last checkpoint (default: False)
-  --nvol NVOL           Number of volumes which come into single batch size for 2D model (default: 4)
-  --data2d-dim {2,3}    Input data dimension for 2d model (default: 3)
+  --load_sm [BOOLEAN]   Load exported savedmodel (default: False)
+  --validate [BOOLEAN]  Validate exported savedmodel (default: False)
+  --nvol NVOL           Number of volumes which come into single batch size for 2D model (default: 2)
   --oversampling OVERSAMPLING
                         Probability of crop to have some region with positive label (default: 0.33)
   --num-workers NUM_WORKERS
                         Number of subprocesses to use for data loading (default: 8)
-  --dali-use-cpu [BOOLEAN]
-                        Use CPU for data augmentation instead of GPU (default: False)
   --sw-batch-size SW_BATCH_SIZE
                         Sliding window inference batch size (default: 2)
   --overlap OVERLAP     Amount of overlap between scans during sliding window inference (default: 0.5)
@@ -425,11 +420,11 @@ The nnU-Net model was trained on the [Medical Segmentation Decathlon](http://med
 
 #### Dataset guidelines
 
-To train nnU-Net you will need to preprocess your dataset as a first step with `preprocess.py` script. Run `python scripts/preprocess.py --help` to see descriptions of the preprocess script arguments.
+To train nnU-Net you will need to preprocess your dataset as the first step with `preprocess.py` script. Run `python scripts/preprocess.py --help` to see descriptions of the preprocess script arguments.
 
 For example to preprocess data for 3D U-Net run: `python preprocess.py --task 01 --dim 3`.
 
-In `data_preprocessing/configs.py` for each [Medical Segmentation Decathlon](http://medicaldecathlon.com/) task there are defined: patch size, precomputed spacings and statistics for CT datasets.
+In `data_preprocessing/configs.py` for each [Medical Segmentation Decathlon](http://medicaldecathlon.com/) task, there are defined: patch sizes, precomputed spacings and statistics for CT datasets.
 
 The preprocessing pipeline consists of the following steps:
 
@@ -437,50 +432,49 @@ The preprocessing pipeline consists of the following steps:
 2. Resampling to the median voxel spacing of their respective dataset (exception for anisotropic datasets where the lowest resolution axis is selected to be the 10th percentile of the spacings).
 3. Padding volumes so that dimensions are at least as patch size.
 4. Normalizing:
-    * For CT modalities the voxel values are clipped to 0.5 and 99.5 percentiles of the foreground voxels and then data is normalized with mean and standard deviation collected from foreground voxels.
-    * For MRI modalities z-score normalization is applied.
+ * For CT modalities the voxel values are clipped to 0.5 and 99.5 percentiles of the foreground voxels and then data is normalized with mean and standard deviation collected from foreground voxels.
+ * For MRI modalities z-score normalization is applied.
 
 #### Multi-dataset
 
-It is possible to run nnUNet on a custom dataset. If your dataset correspond to [Medical Segmentation Decathlon](http://medicaldecathlon.com/) (i.e. data should be `NIfTi` format and there should be `dataset.json` file where you need to provide fields: modality, labels and at least one of training, test) you need to perform the following:
+It is possible to run nnUNet on a custom dataset. If your dataset corresponds to [Medical Segmentation Decathlon](http://medicaldecathlon.com/) (i.e. data should be in `NIfTi` format and there should be `dataset.json` file where you need to provide fields: modality, labels, and at least one of training, test) you need to perform the following:
 
 1. Mount your dataset to `/data` directory.
  
 2. In `data_preprocessing/config.py`:
-    - Add to the `task_dir` dictionary your dataset directory name. For example, for the Brain Tumour dataset, it corresponds to `"01": "Task01_BrainTumour"`.
-    - Add the patch size that you want to use for training to the `patch_size` dictionary. For example, for the Brain Tumour dataset it corresponds to `"01_3d": [128, 128, 128]` for 3D U-Net and `"01_2d": [192, 160]` for 2D U-Net. There are three types of suffixes `_3d, _2d` corresponding to 3D UNet and 2D U-Net.
+ - Add to the `task_dir` dictionary your dataset directory name. For example, for the Brain Tumour dataset, it corresponds to `"01": "Task01_BrainTumour"`.
+ - Add the patch size that you want to use for training to the `patch_size` dictionary. For example, for the Brain Tumour dataset it corresponds to `"01_3d": [128, 128, 128]` for 3D U-Net and `"01_2d": [192, 160]` for 2D U-Net. There are three types of suffixes `_3d, _2d` corresponding to 3D UNet and 2D U-Net.
 
-3. Preprocess your data with `preprocess.py` scripts. For example, to preprocess Brain Tumour dataset for 2D U-Net you should run `python preprocess.py --task 01 --dim 2`.
+3. Preprocess your data with `preprocess.py` scripts. For example, to preprocess the Brain Tumour dataset for 2D U-Net you should run `python preprocess.py --task 01 --dim 2`.
 
 ### Training process
 
 The model trains for at most `--epochs` epochs. After each epoch evaluation, the validation set is done and validation metrics are monitored for checkpoint updating
 (see `--ckpt-strategy` flag). Default training settings are:
-* The Adam optimizer with learning rate of 0.0003 and weight decay 0.0001.
+* The Adam optimizer with a learning rate of 0.0003 and weight decay of 0.0001.
 * Training batch size is set to 2.
 
-This default parametrization is applied when running scripts from the `scripts/` directory and when running `main.py` without explicitly overriding these parameters. By default, using scripts from `scripts/` directory will not use AMP. To enable AMP, pass the `--amp` flag. AMP can be enabled for every mode of execution. However, a custom invocation of `main.py` script will turn on AMP by default. To turn it off use `main.py --amp false`.
+This default parametrization is applied when running scripts from the `scripts/` directory and when running `main.py` without explicitly overriding these parameters. By default, using scripts from the `scripts` directory will not use AMP. To enable AMP, pass the `--amp` flag. AMP can be enabled for every mode of execution. However, a custom invocation of the `main.py` script will turn on AMP by default. To turn it off use `main.py --amp false`.
 
-The default configuration minimizes a function `L = (1 - dice_coefficient) + cross_entropy` during training and reports achieved convergence as [dice coefficient](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) per class. The training, with a combination of dice and cross entropy has been proven to achieve better convergence than a training using only dice.
+The default configuration minimizes a function `L = (1 - dice_coefficient) + cross_entropy` during training and reports achieved convergence as [dice coefficient](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) per class. The training, with a combination of dice and cross-entropy has been proven to achieve better convergence than training using only dice.
 
-The training can be run directly without using the predefined scripts. The name of the training script is `main.py`. For example:
+The training can be run without using the predefined scripts. The name of the training script is `main.py`. For example:
 
 ```
 python main.py --exec-mode train --task 01 --fold 0
 ```
-  
+ 
 Training artifacts will be saved to `/results` in the container. Some important artifacts are:
-* `/results/dllogger.json`: Collected dice scores and loss values evaluated after each epoch during training on validation set.
-* `/results/ckpt/`: Saved checkpoints. By default, two checkpoints are saved - one after each epoch and one with the highest validation dice (saved in the `/results/ckpt/best/` subdirectory). You can modify this behaviour by modifying `--ckpt-strategy` parameter.
+* `/results/dllogger.json`: Collected dice scores and loss values evaluated after each epoch during training on a validation set.
+* `/results/ckpt/`: Saved checkpoints. By default, two checkpoints are saved - one after each epoch and one with the highest validation dice (saved in the `/results/ckpt/best/` subdirectory). You can change this behavior by modifying the `--ckpt-strategy` parameter.
 
 To load the pretrained model, provide `--ckpt-dir <path/to/checkpoint/directory>` and use `--resume-training` if you intend to continue training.
 
-To use multi-gpu training with `main.py` script prepend the command with `horovodrun -np <ngpus>`, for example with 8 GPUs use:
+To use multi-GPU training with the `main.py` script prepend the command with `horovodrun -np <ngpus>`, for example with 8 GPUs use:
 
 ```
 horovodrun -np 8 python main.py --exec-mode train --task 01 --fold 0
 ```
-
 ### Inference process
 
 Inference can be launched by passing the `--exec-mode predict` flag. For example:
@@ -494,7 +488,7 @@ The script will then:
 * Load the checkpoint from the directory specified by the `<path/to/checkpoint/dir>` directory
 * Run inference on the preprocessed validation dataset corresponding to fold 0
 * If `--save-preds` is provided then resulting masks in the NumPy format will be saved in the `/results` directory
-                       
+ 
 ## Performance
 
 ### Benchmarking
@@ -503,7 +497,7 @@ The following section shows how to run benchmarks to measure the model performan
 
 #### Training performance benchmark
 
-To benchmark training, run `scripts/benchmark.py` script with `--mode train`:
+To benchmark training, run the `scripts/benchmark.py` script with `--mode train`:
 
 ```
 python scripts/benchmark.py --xla --mode train --gpus <ngpus> --dim {2,3} --batch-size <bsize> [--amp]
@@ -515,32 +509,32 @@ For example, to benchmark 3D U-Net training using mixed-precision on 8 GPUs with
 python scripts/benchmark.py --xla --mode train --gpus 8 --dim 3 --batch-size 2 --amp
 ```
 
-Each of these scripts will by default run warm-up for 100 iterations and then start benchmarking for another 100 steps.
+Each of these scripts will by default run a warm-up for 100 iterations and then start benchmarking for another 100 steps.
 You can adjust these settings with `--warmup-steps` and `--bench-steps` parameters.
 
-At the end of the script, a line reporting the train throughput and latency will be printed.
+At the end of the script, a line reporting the training throughput and latency will be printed.
 
 #### Inference performance benchmark
 
-To benchmark inference, run `scripts/benchmark.py` script with `--mode predict`:
+To benchmark inference, run the `scripts/benchmark.py` script with `--mode predict`:
 
 ```
 python scripts/benchmark.py --xla --mode predict --gpus <ngpus> --dim {2,3} --batch-size <bsize> [--amp]
 ```
 
-For example, to benchmark inference using mixed-precision for 3D U-Net on 1 GPU, with batch size of 4, run:
+For example, to benchmark inference using mixed-precision for 3D U-Net on 1 GPU, with a batch size of 4, run:
 
 ```
 python scripts/benchmark.py --xla --mode predict --gpus 1 --dim 3 --batch-size 4 --amp 
 ```
 
-Each of these scripts will by default run warm-up for 100 iterations and then start benchmarking for another 100 steps.
+Each of these scripts will by default run a warm-up for 100 iterations and then start benchmarking for another 100 steps.
 You can adjust these settings with `--warmup-steps` and `--bench-steps` parameters.
 
 At the end of the script, a line reporting the inference throughput and latency will be printed.
 
-*Note that this benchmark reports performance numbers for iterations over samples with a fixed patch size.
-Real inference process uses [sliding window](#glossary) for input images with arbitrary resolution and performance may vary for images with different resolutions.*
+*Note that this benchmark reports performance numbers for iterations over samples with fixed patch sizes.
+The real inference process uses [sliding window](#glossary) for input images with arbitrary resolution and performance may vary for images with different resolutions.*
 
 ### Results
 
@@ -550,29 +544,29 @@ The following sections provide details on how to achieve the same performance an
 
 ##### Training accuracy: NVIDIA DGX A100 (8xA100 80G)
 
-Our results were obtained by running the `python scripts/train.py --xla --gpus {1,8} --fold {0,1,2,3,4} --dim {2,3} --learning_rate lr [--amp]` training scripts and averaging results in the TensorFlow 22.04 NGC container on NVIDIA DGX with (8x A100 80G) GPUs.
+Our results were obtained by running the `python scripts/train.py --xla --gpus {1,8} --fold {0,1,2,3,4} --dim {2,3} --learning_rate lr [--amp] --seed n` training scripts and averaging results in the TensorFlow 22.11 NGC container on NVIDIA DGX with (8x A100 80G) GPUs.
 
-| Dimension | GPUs | Batch size / GPU  | Accuracy - mixed precision | Accuracy - FP32 | Time to train - mixed precision | Time to train - TF32 |  Time to train speedup (TF32 to mixed precision)        
+| Dimension | GPUs | Batch size / GPU  | Dice - mixed precision | Accuracy - FP32 | Time to train - mixed precision | Time to train - TF32 |  Time to train speedup (TF32 to mixed precision)        
 |:-:|:-:|:--:|:-----:|:-----:|:--------:|:---------:|:----:|
 | 2 | 1 | 64 | 0.7312 | 0.7302 | 29 min | 40 min | 1.38 |
 | 2 | 8 | 64 | 0.7322 | 0.7310 | 8 min | 10 min | 1.22 |
 | 3 | 1 | 2  | 0.7435 | 0.7441 | 85 min | 153 min | 1.79 |
 | 3 | 8 | 2  | 0.7440 | 0.7438 | 19 min | 33 min | 1.69 |
 
-Reported accuracy is the average Dice metric over 5 cross-validation folds.
+Reported dice score is the average over 5 folds from the best run for grid search over learning rates {1e-4, 2e-4, ..., 9e-4} and seed {1, 3, 5}.
 
 ##### Training accuracy: NVIDIA DGX-1 (8xV100 32G)
 
-Our results were obtained by running the `python scripts/train.py --xla --gpus {1,8} --fold {0,1,2,3,4} --dim {2,3} [--amp]` training scripts and averaging results in the TensorFlow 22.04 NGC container on NVIDIA DGX-1 with (8x V100 32G) GPUs.
+Our results were obtained by running the `python scripts/train.py --xla --gpus {1,8} --fold {0,1,2,3,4} --dim {2,3} [--amp] --seed n` training scripts and averaging results in the TensorFlow 22.11 NGC container on NVIDIA DGX-1 with (8x V100 32G) GPUs.
 
-| Dimension | GPUs | Batch size / GPU | Accuracy - mixed precision |  Accuracy - FP32 |  Time to train - mixed precision | Time to train - FP32 | Time to train speedup (FP32 to mixed precision)        
+| Dimension | GPUs | Batch size / GPU | Dice - mixed precision |  Accuracy - FP32 |  Time to train - mixed precision | Time to train - FP32 | Time to train speedup (FP32 to mixed precision)        
 |:-:|:-:|:--:|:-----:|:-----:|:---------:|:---------:|:----:|
 | 2 | 1 | 64 | 0.7315 | 0.7311 | 52 min | 102 min | 1.96 |
 | 2 | 8 | 64 | 0.7312 | 0.7316 | 12 min | 17 min | 1.41 |
 | 3 | 1 | 2  | 0.7435 | 0.7441 | 181 min | 580 min | 3.20 |
 | 3 | 8 | 2  | 0.7434 | 0.7440 | 35 min | 131 min | 3.74 |
 
-Reported accuracy is the average Dice metric over 5 cross-validation folds.
+Reported dice score is the average over 5 folds from the best run for grid search over learning rates {1e-4, 2e-4, ..., 9e-4} and seed {1, 3, 5}.
 
 #### Training performance results
 
@@ -580,41 +574,43 @@ Reported accuracy is the average Dice metric over 5 cross-validation folds.
 
 Our results were obtained by running the `python scripts/benchmark.py --xla --mode train --gpus {1,8} --dim {2,3} --batch-size <bsize> [--amp]` training script in the NGC container on NVIDIA DGX A100 (8x A100 80G) GPUs. Performance numbers (in volumes per second) were averaged over an entire training epoch.
 
+Note: We recommend using `--bind` flag for multi-GPU settings to increase the throughput. To launch multi-GPU with `--bind` you will have to add `--horovod` e.g., `python scripts/benchmark.py --xla --mode train --gpus 8 --dim 3 --amp --batch-size 2 --bind --horovod` for the interactive session, or use regular command when launching with SLURM's sbatch.
+
 | Dimension | GPUs | Batch size / GPU  | Throughput - mixed precision [img/s] | Throughput - TF32 [img/s] | Throughput speedup (TF32 - mixed precision) | Weak scaling - mixed precision | Weak scaling - TF32 |
 |:-:|:-:|:--:|:------:|:------:|:-----:|:-----:|:-----:|
-| 2 | 1 | 32 | 1209.24 | 742.87 | 1.62 | N/A | N/A |
-| 2 | 1 | 64 | 1663.73 | 840.18 | 1.98 | N/A | N/A |
-| 2 | 1 | 128 | 1835.35 | 897.52 | 2.04 | N/A | N/A |
-| 2 | 8 | 32 | 6873.68 | 4868.65 | 1.41 | 5.68 | 6.55 |
-| 2 | 8 | 64 | 11182.54 | 5824.52 | 1.91 | 6.72 | 6.93 |
-| 2 | 8 | 128 | 12918.97 | 6648.57 | 1.94 | 7.03 | 7.40 |
-| 3 | 1 | 1 | 21.63 | 11.80 | 1.83 | N/A | N/A |
-| 3 | 1 | 2 | 23.63 | 12.40 | 1.90 | N/A | N/A |
-| 3 | 1 | 4 | 25.02 | 12.63 | 1.98 | N/A | N/A |
-| 3 | 8 | 1 | 130.39| 88.19 | 1.47 | 6.02 | 7.47 |
-| 3 | 8 | 2 | 170.12| 92.35 | 1.84 | 7.19 | 7.44 |
-| 3 | 8 | 4 | 186.54| 95.98 | 1.94 | 7.45 | 7.59 |
+| 2 | 1 | 32 | 1347.19 | 748.56 | 1.80 | - | - |
+| 2 | 1 | 64 | 1662.8 | 804.23 | 2.07 | - | - |
+| 2 | 1 | 128 | 1844.7 | 881.87 | 2.09 | - | - |
+| 2 | 8 | 32 | 9056.45 | 5420.51 | 1.67 | 6.72 | 6.91 | 
+| 2 | 8 | 64 | 11687.11 | 6250.52 | 1.87 | 7.03 | 7.49 |
+| 2 | 8 | 128 | 13679.76 | 6841.78 | 2.00 | 7.42 | 7.66 |
+| 3 | 1 | 1 | 27.02 | 11.63 | 2.32 | - | - |
+| 3 | 1 | 2 | 29.3 | 11.81 | 2.48 | - | - |
+| 3 | 1 | 4 | 31.87 | 12.17 | 2.62 | - | - |
+| 3 | 8 | 1 | 186.84 | 91.11 | 2.05 | 7.24 | 7.83 | 
+| 3 | 8 | 2 | 219.34 | 92.91 | 2.36 | 7.77 | 7.87 | 
+| 3 | 8 | 4 | 244.01 | 96.52 | 2.53 | 7.76 | 7.93 | 
 
 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
 
 ##### Training performance: NVIDIA DGX-1 (8xV100 32G)
 
-Our results were obtained by running the `python scripts/benchmark.py --xla --mode train --gpus {1,8} --dim {2,3} --batch-size <bsize> [--amp]` training script in the TensorFlow 22.04 NGC container on NVIDIA DGX-1 with (8x V100 32G) GPUs. Performance numbers (in volumes per second) were averaged over an entire training epoch.
+Our results were obtained by running the `python scripts/benchmark.py --xla --mode train --gpus {1,8} --dim {2,3} --batch-size <bsize> [--amp]` training script in the TensorFlow 22.11 NGC container on NVIDIA DGX-1 with (8x V100 32G) GPUs. Performance numbers (in volumes per second) were averaged over an entire training epoch.
+
+Note: We recommend using `--bind` flag for multi-GPU settings to increase the throughput. To launch multi-GPU with `--bind` you will have to add `--horovod` e.g., `python scripts/benchmark.py --xla --mode train --gpus 8 --dim 3 --amp --batch-size 2 --bind --horovod` for the interactive session, or use regular command when launching with SLURM's sbatch.
 
 | Dimension | GPUs | Batch size / GPU | Throughput - mixed precision [img/s] | Throughput - FP32 [img/s] | Throughput speedup (FP32 - mixed precision) | Weak scaling - mixed precision | Weak scaling - FP32 |
 |:-:|:-:|:---:|:---------:|:-----------:|:--------:|:---------:|:-------------:|
-| 2 | 1 | 32 | 694.92 | 306.74 | 2.65 | N/A | N/A |
-| 2 | 1 | 64 | 817.92 | 328.62 | 2.48 | N/A | N/A |
-| 2 | 1 | 128 | 894.17 | 342.20 | 2.61 | N/A | N/A |
-| 2 | 8 | 32 | 4147.77 | 2225.46 | 1.86 | 5.96 | 7.25 |
-| 2 | 8 | 64 | 5444.31 | 2514.72 | 2.16 | 6.65 | 7.65 |
-| 2 | 8 | 128 | 6420.67 | 2692.59 | 2.38 | 7.18 | 7.86 |
-| 3 | 1 | 1 | 9.92 | 2.04 | 4.86 | N/A | N/A |
-| 3 | 1 | 2 | 10.78 | 2.10 | 5.13 | N/A | N/A |
-| 3 | 1 | 4 | 11.26 | 2.29 | 4.91 | N/A | N/A |
-| 3 | 8 | 1 | 69.28 | 15.72 | 4.40 | 6.98 | 7.70 |
-| 3 | 8 | 2 | 78.37 | 16.17 | 4.84 | 7.26 | 7.70 |
-| 3 | 8 | 4 | 85.34 | 17.48 | 4.88 | 7.57 | 7.63 |
+|	2	|	1	|	32	|	697.36	|	312.51	|	2.23	|	-	|	-	|
+|	2	|	1	|	64	|	819.15	|	337.42	|	2.43	|	-	|	-	|
+|	2	|	1	|	128	|	894.94	|	352.32	|	2.54	|	-	|	-	|
+|	2	|	8	|	32	|	4355.65	|	2260.37	|	1.93	|	6.25	|	7.23	|
+|	2	|	8	|	64	|	5696.41	|	2585.65	|	2.20	|	6.95	|	7.66	|
+|	2	|	8	|	128	|	6714.96	|	2779.25	|	2.42	|	7.50	|	7.89	|
+|	3	|	1	|	1	|	12.15	|	2.08	|	5.84	|	-	|	-	|
+|	3	|	1	|	2	|	13.13	|	2.5	|	5.25	|	-	|	-	|
+|	3	|	8	|	1	|	82.62	|	16.59	|	4.98	|	6.80	|	7.98	|
+|	3	|	8	|	2	|	97.68	|	19.91	|	4.91	|	7.44	|	7.96	|
 
 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
 
@@ -622,58 +618,58 @@ To achieve these same results, follow the steps in the [Quick Start Guide](#quic
 
 ##### Inference performance: NVIDIA DGX A100 (1xA100 80G)
 
-Our results were obtained by running the `python scripts/benchmark.py --xla --mode predict --dim {2,3} --batch-size <bsize> [--amp]` inferencing benchmarking script in the TensorFlow 22.04 NGC container on NVIDIA DGX A100 (1x A100 80G) GPU.
+Our results were obtained by running the `python scripts/benchmark.py --xla --mode predict --dim {2,3} --batch-size <bsize> [--amp]` inferencing benchmarking script in the TensorFlow 22.11 NGC container on NVIDIA DGX A100 (1x A100 80G) GPU.
 
 FP16
 
-| Dimension | Batch size |  Resolution  | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
+| Dimension | Batch size |Resolution| Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
 |:----------:|:---------:|:-------------:|:----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
-| 2 | 32 | 192x160 | 1839.95 | 17.49 | 19.05 | 19.62 | 21.42
-| 2 | 64 | 192x160 | 2936.84 | 22.14 | 25.14 | 26.28 | 35.78
-| 2 | 128 | 192x160 | 3746.55 | 34.54 | 38.18 | 38.44 | 39.23
-| 3 | 1 | 128x128x128 | 56.93 | 17.56 | 17.79 | 17.88 | 17.95
-| 3 | 2 | 128x128x128 | 50.99 | 39.28 | 39.77 | 41.20 | 46.16
-| 3 | 4 | 128x128x128 | 54.44 | 73.51 | 73.78 | 74.00 | 78.65
+|	2	|	32	|	192x160	|	1728.03	|	18.52	| 22.55 | 23.18 | 24.82 |
+|	2	|	64	|	192x160	|	4160.91	|	15.38	| 17.49 | 18.53 | 19.88 |
+|	2	|	128	|	192x160	|	4672.52	|	27.39	| 27.68 | 27.79 | 27.87 |
+|	3	|	1	|	128x128x128	|	78.2	|	12.79	| 14.29 | 14.87 | 15.25 |
+|	3	|	2	|	128x128x128	|	63.76	|	31.37	| 36.07 | 40.02 | 42.44 |
+|	3	|	4	|	128x128x128	|	83.17	|	48.1	| 50.96 | 52.08 | 52.56 |
 
 TF32
 
-| Dimension | Batch size |  Resolution  | Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
+| Dimension | Batch size |Resolution| Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
 |:----------:|:---------:|:-------------:|:----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
-| 2 | 32 | 192x160 | 1731.85 | 18.65 | 21.27 | 21.87 | 22.42
-| 2 | 64 | 192x160 | 2260.40 | 28.31 | 28.73 | 28.85 | 29.01
-| 2 | 128 | 192x160 | 1875.44 | 68.29 | 68.60 | 69.21 | 76.29
-| 3 | 1 | 128x128x128 | 25.82 | 38.72 | 39.13 | 39.22 | 39.40
-| 3 | 2 | 128x128x128 | 28.42 | 70.37 | 70.95 | 71.08 | 71.30
-| 3 | 4 | 128x128x128 | 28.22 | 141.80 | 142.61 | 143.45 | 147.79
+|	2	|	32	|	192x160	|	2067.63	|	15.48	| 17.97 | 19.12 | 19.77 |
+|	2	|	64	|	192x160	|	2447	|	26.15	| 26.43 | 26.48 | 26.62 |
+|	2	|	128	|	192x160	|	2514.75	|	50.9	| 51.15 | 51.23 | 51.28 |
+|	3	|	1	|	128x128x128	|	38.85	|	25.74	| 26.04 | 26.19 | 27.41 |
+|	3	|	2	|	128x128x128	|	40.1	|	49.87	| 50.31 | 50.44 | 50.57 |
+|	3	|	4	|	128x128x128	|	41.69	|	95.95	| 97.09 | 97.41 | 98.03 |
 
 Throughput is reported in images per second. Latency is reported in milliseconds per batch.
 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
 
 ##### Inference performance: NVIDIA DGX-1 (1xV100 32G)
 
-Our results were obtained by running the `python scripts/benchmark.py --mode predict --dim {2,3} --batch-size <bsize> [--amp]` inferencing benchmarking script in the TensorFlow 22.04 NGC container on NVIDIA DGX-1 with (1x V100 32G) GPU.
+Our results were obtained by running the `python scripts/benchmark.py --mode predict --dim {2,3} --batch-size <bsize> [--amp]` inferencing benchmarking script in the TensorFlow 22.11 NGC container on NVIDIA DGX-1 with (1x V100 32G) GPU.
 
 FP16
  
 | Dimension | Batch size |Resolution| Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
 |:----------:|:---------:|:-------------:|:----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
-| 2 | 32 | 192x160 | 1011.67 | 51.33 | 33.38 | 46.81 | 876.90 |
-| 2 | 64 | 192x160 | 1922.81 | 33.36 | 35.28 | 36.07 | 38.10 |
-| 2 | 128 | 192x160 | 2208.47 | 57.96 | 58.39 | 58.49 | 58.63 |
-| 3 | 1 | 128x128x128 | 27.20 | 36.77 | 37.44 | 37.57 | 37.92 |
-| 3 | 2 | 128x128x128 | 23.72 | 84.31 | 85.35 | 85.93 | 88.71 |
-| 3 | 4 | 128x128x128 | 24.47 | 163.47 | 166.34 | 167.72 | 171.77 |
+|	2	|	32	|	192x160	|	1166.83	|	27.42	| 28.76 | 28.91 | 29.16 |
+|	2	|	64	|	192x160	|	2263.21	|	28.28	| 30.63 | 31.83 | 32.5 |
+|	2	|	128	|	192x160	|	2387.06	|	53.62	| 53.97 | 54.07 | 54.3 |
+|	3	|	1	|	128x128x128	|	36.87	|	27.12	| 27.32 | 27.37 | 27.42 |
+|	3	|	2	|	128x128x128	|	37.65	|	53.12	| 53.49 | 53.59 | 53.71 |
+|	3	|	4	|	128x128x128	|	38.8	|	103.11	| 104.16 | 104.3 | 104.75 |
 
 FP32
 
 | Dimension | Batch size |Resolution| Throughput Avg [img/s] | Latency Avg [ms] | Latency 90% [ms] | Latency 95% [ms] | Latency 99% [ms] |
 |:----------:|:---------:|:-------------:|:----------------------:|:----------------:|:----------------:|:----------------:|:----------------:|
-| 2 | 32 | 192x160 | 851.19 | 37.60 | 38.33 | 38.50 | 38.89 |
-| 2 | 64 | 192x160 | 945.85 | 67.66 | 67.98 | 68.12 | 68.59 |
-| 2 | 128 | 192x160 | 784.73 | 163.12 | 164.78 | 165.75 | 169.12 |
-| 3 | 1 | 128x128x128 | 8.04 | 124.34 | 126.36 | 127.25 | 129.30 |
-| 3 | 2 | 128x128x128 | 8.25 | 242.38 | 245.23 | 246.25 | 249.75 |
-| 3 | 4 | 128x128x128 | 8.38 | 476.83 | 479.18 | 481.76 | 489.71 |
+|	2	|	32	|	192x160	|	990.61	|	32.3	| 32.46 | 32.51 | 32.78 |
+|	2	|	64	|	192x160	|	1034.22	|	61.88	| 62.19 | 62.32 | 62.56 |
+|	2	|	128	|	192x160	|	1084.21	|	118.06	| 118.45 | 118.6 | 118.95 |
+|	3	|	1	|	128x128x128	|	9.65	|	103.62	| 104.46 | 104.52 | 104.63 |
+|	3	|	2	|	128x128x128	|	9.96	|	200.75	| 202.51 | 202.74 | 202.86 |
+|	3	|	4	|	128x128x128	|	10.13	|	394.74	| 396.74 | 397.0 | 397.82 |
 
 Throughput is reported in images per second. Latency is reported in milliseconds per batch.
 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
@@ -686,5 +682,10 @@ There are no known issues in this release.
 
 ### Changelog
 
+November 2022
+- Container update to 22.11
+- Use channel last layout for convolution with XLA
+- Add support for GPU binding
+
 May 2022
 - Initial release

+ 18 - 74
TensorFlow2/Segmentation/nnUNet/data_loading/dali_loader.py

@@ -17,7 +17,6 @@ import itertools
 import horovod.tensorflow as hvd
 import numpy as np
 import nvidia.dali.fn as fn
-import nvidia.dali.math as math
 import nvidia.dali.ops as ops
 import nvidia.dali.plugin.tf as dali_tf
 import nvidia.dali.types as types
@@ -57,7 +56,6 @@ class GenericPipeline(Pipeline):
         shuffle_input=True,
         input_x_files=None,
         input_y_files=None,
-        use_cpu=False,
     ):
         super().__init__(
             batch_size=batch_size,
@@ -85,19 +83,12 @@ class GenericPipeline(Pipeline):
 
         self.dim = dim
         self.internal_seed = seed
-        self.use_cpu = use_cpu
-
-    def mark_pipeline_start(self, x, y):
-        if not self.use_cpu:
-            x, y = x.gpu(), y.gpu()
-        return x, y
 
 
 class TrainPipeline(GenericPipeline):
-    def __init__(self, imgs, lbls, oversampling, patch_size, read_roi=False, batch_size_2d=None, **kwargs):
+    def __init__(self, imgs, lbls, oversampling, patch_size, batch_size_2d=None, **kwargs):
         super().__init__(input_x_files=imgs, input_y_files=lbls, shuffle_input=True, **kwargs)
         self.oversampling = oversampling
-        self.read_roi = read_roi
         self.patch_size = patch_size
         if self.dim == 2 and batch_size_2d is not None:
             self.patch_size = [batch_size_2d] + self.patch_size
@@ -129,7 +120,7 @@ class TrainPipeline(GenericPipeline):
             roi_end=roi_end,
             crop_shape=[*self.patch_size, 1],
         )
-        anchor = fn.slice(anchor, 0, 3, axes=[0])  # drop channel from anchor
+        anchor = fn.slice(anchor, 0, 3, axes=[0])
         img, lbl = fn.slice(
             [img, lbl],
             anchor,
@@ -138,40 +129,7 @@ class TrainPipeline(GenericPipeline):
             out_of_bounds_policy="pad",
             device="cpu",
         )
-
-        return img.gpu(), lbl.gpu()
-
-    def load_roi(self):
-        lbl = self.input_y(name="ReaderY")
-        lbl = fn.reshape(lbl, layout="DHWC")
-        roi_start, roi_end = fn.segmentation.random_object_bbox(
-            lbl,
-            format="start_end",
-            foreground_prob=self.oversampling,
-            k_largest=2,
-            device="cpu",
-            cache_objects=True,
-        )
-        anchor = fn.roi_random_crop(lbl, roi_start=roi_start, roi_end=roi_end, crop_shape=[1, *self.patch_size])
-        anchor = fn.slice(anchor, 1, 3, axes=[0])  # drop channel from anchor
-        lbl = fn.slice(
-            lbl,
-            anchor,
-            self.crop_shape,
-            axis_names="DHW",
-            out_of_bounds_policy="pad",
-            device="cpu",
-        )
-
-        img = self.input_x(
-            name="ReaderX",
-            roi_start=fn.cast(anchor, dtype=types.INT32),
-            roi_axes=[1, 2, 3],
-            roi_shape=self.patch_size,
-            out_of_bounds_policy="pad",
-        )
-        img = fn.reshape(img, layout="DHWC")
-
+        img, lbl = img.gpu(), lbl.gpu()
         return img, lbl
 
     def zoom_fn(self, img, lbl):
@@ -189,22 +147,18 @@ class TrainPipeline(GenericPipeline):
         return img, lbl
 
     def noise_fn(self, img):
-        img_noised = img + fn.random.normal(img, stddev=fn.random.uniform(range=(0.0, 0.33)))
+        img_noised = fn.noise.gaussian(img, stddev=fn.random.uniform(range=(0.0, 0.3)))
         return random_augmentation(0.15, img_noised, img)
 
     def blur_fn(self, img):
         img_blurred = fn.gaussian_blur(img, sigma=fn.random.uniform(range=(0.5, 1.5)))
         return random_augmentation(0.15, img_blurred, img)
 
-    def brightness_fn(self, img):
-        brightness_scale = random_augmentation(0.15, fn.random.uniform(range=(0.7, 1.3)), 1.0)
-        return img * brightness_scale
-
-    def contrast_fn(self, img):
-        min_, max_ = fn.reductions.min(img), fn.reductions.max(img)
-        scale = random_augmentation(0.15, fn.random.uniform(range=(0.65, 1.5)), 1.0)
-        img = math.clamp(img * scale, min_, max_)
-        return img
+    def brightness_contrast_fn(self, img):
+        img_transformed = fn.brightness_contrast(
+            img, brightness=fn.random.uniform(range=(0.7, 1.3)), contrast=fn.random.uniform(range=(0.65, 1.5))
+        )
+        return random_augmentation(0.15, img_transformed, img)
 
     def flips_fn(self, img, lbl):
         kwargs = {
@@ -216,16 +170,13 @@ class TrainPipeline(GenericPipeline):
         return fn.flip(img, **kwargs), fn.flip(lbl, **kwargs)
 
     def define_graph(self):
-        if self.read_roi:
-            img, lbl = self.load_roi()
-        else:
-            img, lbl = self.load_data()
-            img, lbl = self.biased_crop_fn(img, lbl)
-        img, lbl = img.gpu(), lbl.gpu()
+        img, lbl = self.load_data()
+        img, lbl = self.biased_crop_fn(img, lbl)
         img, lbl = self.zoom_fn(img, lbl)
         img, lbl = self.flips_fn(img, lbl)
-        img = self.brightness_fn(img)
-        img = self.contrast_fn(img)
+        img = self.noise_fn(img)
+        img = self.blur_fn(img)
+        img = self.brightness_contrast_fn(img)
         return img, lbl
 
 
@@ -251,12 +202,11 @@ class TestPipeline(GenericPipeline):
 
 
 class BenchmarkPipeline(GenericPipeline):
-    def __init__(self, imgs, lbls, patch_size, batch_size_2d=None, sw_benchmark=False, **kwargs):
+    def __init__(self, imgs, lbls, patch_size, batch_size_2d=None, **kwargs):
         super().__init__(input_x_files=imgs, input_y_files=lbls, shuffle_input=False, **kwargs)
         self.patch_size = patch_size
         if self.dim == 2 and batch_size_2d is not None:
             self.patch_size = [batch_size_2d] + self.patch_size
-        self.crop = not sw_benchmark
 
     def crop_fn(self, img, lbl):
         img = fn.crop(img, crop=self.patch_size, out_of_bounds_policy="pad")
@@ -265,9 +215,8 @@ class BenchmarkPipeline(GenericPipeline):
 
     def define_graph(self):
         img, lbl = self.input_x(name="ReaderX").gpu(), self.input_y(name="ReaderY").gpu()
+        img, lbl = self.crop_fn(img, lbl)
         img, lbl = fn.reshape(img, layout="DHWC"), fn.reshape(lbl, layout="DHWC")
-        if self.crop:
-            img, lbl = self.crop_fn(img, lbl)
         return img, lbl
 
 
@@ -293,7 +242,6 @@ def fetch_dali_loader(imgs, lbls, batch_size, mode, **kwargs):
         "batch_size": batch_size,
         "num_threads": kwargs["num_workers"],
         "shard_id": device_id,
-        "use_cpu": kwargs["use_cpu"],
     }
     if kwargs["dim"] == 2:
         if kwargs["benchmark"]:
@@ -308,13 +256,9 @@ def fetch_dali_loader(imgs, lbls, batch_size, mode, **kwargs):
 
     output_dtypes = (tf.float32, tf.uint8)
     if kwargs["benchmark"]:
-        pipeline = BenchmarkPipeline(
-            imgs, lbls, kwargs["patch_size"], sw_benchmark=kwargs["sw_benchmark"], **pipe_kwargs
-        )
+        pipeline = BenchmarkPipeline(imgs, lbls, kwargs["patch_size"], **pipe_kwargs)
     elif mode == "train":
-        pipeline = TrainPipeline(
-            imgs, lbls, kwargs["oversampling"], kwargs["patch_size"], kwargs["read_roi"], **pipe_kwargs
-        )
+        pipeline = TrainPipeline(imgs, lbls, kwargs["oversampling"], kwargs["patch_size"], **pipe_kwargs)
     elif mode == "eval":
         pipeline = EvalPipeline(imgs, lbls, kwargs["patch_size"], **pipe_kwargs)
     else:

+ 0 - 3
TensorFlow2/Segmentation/nnUNet/data_loading/data_module.py

@@ -44,9 +44,6 @@ class DataModule:
             "nvol": self.args.nvol,
             "bench_steps": self.args.bench_steps,
             "meta": load_data(self.data_path, "*_meta.npy"),
-            "read_roi": self.args.read_roi,
-            "use_cpu": self.args.dali_use_cpu,
-            "sw_benchmark": self.args.sw_benchmark,
         }
 
     def setup(self, stage=None):

+ 0 - 14
TensorFlow2/Segmentation/nnUNet/main.py

@@ -12,9 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-import ctypes
-import os
-
 from data_loading.data_module import DataModule
 from models.nn_unet import NNUnet
 from runtime.args import get_main_args
@@ -25,17 +22,6 @@ from runtime.utils import hvd_init, set_seed, set_tf_flags
 
 
 def main(args):
-    os.environ["TF_GPU_THREAD_MODE"] = "gpu_private"
-    os.environ["TF_GPU_THREAD_COUNT"] = "1"
-
-    _libcudart = ctypes.CDLL("libcudart.so")
-    # Set device limit on the current device
-    # cudaLimitMaxL2FetchGranularity = 0x05
-    pValue = ctypes.cast((ctypes.c_int * 1)(), ctypes.POINTER(ctypes.c_int))
-    _libcudart.cudaDeviceSetLimit(ctypes.c_int(0x05), ctypes.c_int(128))
-    _libcudart.cudaDeviceGetLimit(pValue, ctypes.c_int(0x05))
-    assert pValue.contents.value == 128
-
     hvd_init()
     if args.seed is not None:
         set_seed(args.seed)

+ 4 - 1
TensorFlow2/Segmentation/nnUNet/models/layers.py

@@ -12,6 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+import nv_norms
 import tensorflow as tf
 import tensorflow_addons as tfa
 
@@ -26,7 +27,7 @@ convolutions = {
 class KaimingNormal(tf.keras.initializers.VarianceScaling):
     def __init__(self, negative_slope, seed=None):
         super().__init__(
-            scale=2.0 / (1 + negative_slope ** 2), mode="fan_in", distribution="untruncated_normal", seed=seed
+            scale=2.0 / (1 + negative_slope**2), mode="fan_in", distribution="untruncated_normal", seed=seed
         )
 
     def get_config(self):
@@ -38,6 +39,8 @@ def get_norm(name):
         return tfa.layers.GroupNormalization(32, axis=-1, center=True, scale=True)
     elif "batch" in name:
         return tf.keras.layers.BatchNormalization(axis=-1, center=True, scale=True)
+    elif "atex_instance" in name:
+        return nv_norms.InstanceNormalization(axis=-1)
     elif "instance" in name:
         return tfa.layers.InstanceNormalization(axis=-1, center=True, scale=True)
     elif "none" in name:

+ 24 - 7
TensorFlow2/Segmentation/nnUNet/models/nn_unet.py

@@ -19,7 +19,7 @@ import tensorflow as tf
 from runtime.utils import get_config_file, get_tta_flips, is_main_process
 from skimage.transform import resize
 
-from models.sliding_window import sliding_window_inference
+from models.sliding_window import get_importance_kernel, sliding_window_inference
 from models.unet import UNet
 
 
@@ -41,6 +41,8 @@ class NNUnet(tf.keras.Model):
 
             self.model = wrapped_model
         else:
+            if not self.args.xla and self.args.norm == "instance":
+                self.args.norm = "atex_instance"
             self.model = UNet(
                 input_shape=input_shape,
                 n_class=n_class,
@@ -54,11 +56,28 @@ class NNUnet(tf.keras.Model):
             if is_main_process():
                 print(f"Filters: {self.model.filters},\nKernels: {kernels}\nStrides: {strides}")
         self.tta_flips = get_tta_flips(self.args.dim)
+        if self.args.dim == 3:
+            self.predictor = self.sw_inference
+        elif self.args.benchmark:
+            self.predictor = self.call
+        else:
+            self.predictor = self.call_2d
+
+        if args.dim == 3:
+            importance_kernel = get_importance_kernel(self.patch_size, args.blend_mode, 0.125)
+            self.importance_map = tf.tile(
+                tf.reshape(importance_kernel, shape=[1, *self.patch_size, 1]),
+                multiples=[1, 1, 1, 1, n_class],
+            )
 
-    @tf.function(experimental_relax_shapes=True)
+    @tf.function
     def call(self, *args, **kwargs):
         return self.model(*args, **kwargs)
 
+    @tf.function(reduce_retracing=True)
+    def call_2d(self, *args, **kwargs):
+        return self.model(*args, **kwargs)
+
     @tf.function
     def compute_loss(self, loss_fn, label, preds):
         if self.args.deep_supervision:
@@ -77,21 +96,19 @@ class NNUnet(tf.keras.Model):
         return sliding_window_inference(
             inputs=img,
             roi_size=self.patch_size,
-            sw_batch_size=self.args.sw_batch_size,
             model=self.model,
             overlap=self.args.overlap,
             n_class=self.n_class,
-            blend_mode=self.args.blend_mode,
+            importance_map=self.importance_map,
             **kwargs,
         )
 
     def inference(self, img):
-        predictor = self.call if self.args.dim == 2 else self.sw_inference
-        pred = predictor(img, training=False)
+        pred = self.predictor(img, training=False)
         if self.args.tta:
             for flip_axes in self.tta_flips:
                 flipped_img = tf.reverse(img, axis=flip_axes)
-                flipped_pred = predictor(flipped_img, training=False)
+                flipped_pred = self.predictor(flipped_img, training=False)
                 pred = pred + tf.reverse(flipped_pred, axis=flip_axes)
             pred = pred / (len(self.tta_flips) + 1)
         return pred

+ 28 - 127
TensorFlow2/Segmentation/nnUNet/models/sliding_window.py

@@ -5,26 +5,16 @@ import tensorflow as tf
 from scipy import signal
 
 
-def get_window_slices(image_size, roi_size, intervals, strategy):
+def get_window_slices(image_size, roi_size, overlap, strategy):
     dim_starts = []
-    for image_x, roi_x, interval in zip(image_size, roi_size, intervals):
+    for image_x, roi_x in zip(image_size, roi_size):
+        interval = roi_x if roi_x == image_x else int(roi_x * (1 - overlap))
         starts = list(range(0, image_x - roi_x + 1, interval))
         if strategy == "overlap_inside" and starts[-1] + roi_x < image_x:
             starts.append(image_x - roi_x)
         dim_starts.append(starts)
     slices = [(starts + (0,), roi_size + (-1,)) for starts in itertools.product(*dim_starts)]
-    return slices
-
-
-def batch_window_slices(slices, image_batch_size, batch_size):
-    batched_window_slices = []
-    for batch_start in range(0, image_batch_size, batch_size):
-        batched_window_slices.extend(
-            [
-                ((batch_start,) + start, (min(batch_size, image_batch_size - batch_start),) + roi_size)
-                for start, roi_size in slices
-            ]
-        )
+    batched_window_slices = [((0,) + start, (1,) + roi_size) for start, roi_size in slices]
     return batched_window_slices
 
 
@@ -50,134 +40,45 @@ def get_importance_kernel(roi_size, blend_mode, sigma):
         raise ValueError(f'Invalid blend mode: {blend_mode}. Use either "constant" or "gaussian".')
 
 
[email protected](experimental_relax_shapes=True)
-def run_model(model, windows, importance_map, sw_batch_size, **kwargs):
-    windows_merged = tf.reshape(windows, shape=(-1, *windows.shape[2:]))
-    preds = tf.cast(model(windows_merged, **kwargs), dtype=tf.float32) * importance_map
-    return tf.reshape(preds, shape=(sw_batch_size, -1, *preds.shape[1:]))
[email protected]
+def run_model(x, model, importance_map, **kwargs):
+    return tf.cast(model(x, **kwargs), dtype=tf.float32) * importance_map
 
 
 def sliding_window_inference(
     inputs,
     roi_size,
-    sw_batch_size,
     model,
     overlap,
     n_class,
-    blend_mode="gaussian",
-    sigma=0.125,
+    importance_map,
     strategy="overlap_inside",
     **kwargs,
 ):
-    """
-    Sliding window inference based on implementation by monai library:
-    https://docs.monai.io/en/latest/_modules/monai/inferers/utils.html#sliding_window_inference
-
-    Args:
-        inputs: tf.Tensor to process; should have batch dimension and be
-            in channels-last format, therefore assuming NHWDC or NHWC format.
-            Currently batch dimension MUST have size equal to 1 for NHWDC layout.
-        roi_size: region-of-interest size i.e. the sliding window shape
-        sw_batch_size: batch size for the stacked windows
-        overlap: [0.0, 1.0] float, ratio of overlapping windows in one dimension.
-            Can be equal to 1, then a stride 1 will be used.
-        blend_mode: how to blend overlapping windows. Possible values {"constant", "gaussian"}.
-        n_class: number of output channels.
-        sigma: standard deviation for the gaussian blending kernel.
-        strategy: strategy for dealing with unaligned edge window. Possible values:
-            "pad" for padding the input image with zeroes to match the size or
-            "overlap_inside" for reducing the length of the last stride.
-        kwargs: additional parameters for the model call.
-
-    Returns: Inferred tf.Tensor.
-    """
-
-    dim = int(tf.rank(inputs)) - 2
-    batch_size = inputs.shape[0]
-    assert dim in [2, 3], "Only 2D and 3D data are supported"
-    assert dim != 3 or batch_size == 1, "Batch size of the 3D input has to be equal to one"
-    assert len(roi_size) == dim, "Dimensionality of ROI size used by sliding window does not match the input dim"
-
-    input_spatial_shape = list(inputs.shape[1:-1])
-
+    image_size = tuple(inputs.shape[1:-1])
     roi_size = tuple(roi_size)
-    image_size = tuple(max(input_spatial_shape[i], roi_size[i]) for i in range(dim))
-    padding_size = [image_x - input_x for image_x, input_x in zip(image_size, input_spatial_shape)]
-
-    intervals = []
-    for i in range(dim):
-        if roi_size[i] == image_size[i]:
-            intervals.append(int(roi_size[i]))
-        else:
-            interval = int(roi_size[i] * (1 - overlap))
-            intervals.append(interval if interval > 0 else 1)
-
-    if strategy == "pad":
-        for i, (image_x, roi_x, interval) in enumerate(zip(image_size, roi_size, intervals)):
-            if image_x % interval != roi_x % interval:
-                padding_size[i] += interval - (image_x - roi_x) % interval
+    # Padding to make sure that the image size is at least roi size
+    padded_image_size = tuple(max(image_size[i], roi_size[i]) for i in range(3))
+    padding_size = [image_x - input_x for image_x, input_x in zip(image_size, padded_image_size)]
     paddings = [[0, 0]] + [[x // 2, x - x // 2] for x in padding_size] + [[0, 0]]
     input_padded = tf.pad(inputs, paddings)
-    image_size = list(input_padded.shape[1:-1])
-
-    importance_kernel = get_importance_kernel(roi_size, blend_mode, sigma=sigma)
-    output_shape = (batch_size,) + tuple(image_size) + (n_class,)
-    importance_map = tf.tile(
-        tf.reshape(importance_kernel, shape=[1, *roi_size, 1]),
-        multiples=[sw_batch_size] + [1] * dim + [output_shape[-1]],
-    )
-    output_sum = tf.zeros(output_shape, dtype=tf.float32)
-    output_weight_sum = tf.zeros(output_shape, dtype=tf.float32)
 
-    window_slices = get_window_slices(image_size, roi_size, intervals, strategy)
-    if dim == 3:
-        window_slices = batch_window_slices(window_slices, batch_size, 1)
-    else:
-        window_slices = batch_window_slices(window_slices, batch_size, sw_batch_size)
-        sw_batch_size = 1
-
-    for window_group_start in range(0, len(window_slices), sw_batch_size):
-        slice_group = window_slices[window_group_start : window_group_start + sw_batch_size]
-        windows = tf.stack([tf.slice(input_padded, begin=begin, size=size) for begin, size in slice_group])
-        importance_map_part = importance_map[: windows.shape[0] * windows.shape[1]]
-        preds = run_model(model, windows, importance_map_part, windows.shape[0], **kwargs)
-        preds = tf.unstack(preds, axis=0)
-        for s, pred in zip(slice_group, preds):
-            padding = [[start, output_size - (start + size)] for start, size, output_size in zip(*s, output_shape)]
-            padding = padding[:-1] + [[0, 0]]
-            output_sum = output_sum + tf.pad(pred, padding)
-            output_weight_sum = output_weight_sum + tf.pad(importance_map_part, padding)
+    output_shape = (1, *padded_image_size, n_class)
+    output_sum = tf.zeros(output_shape, dtype=tf.float32)
+    output_weight_sum = tf.ones(output_shape, dtype=tf.float32)
+    window_slices = get_window_slices(padded_image_size, roi_size, overlap, strategy)
+
+    for window_slice in window_slices:
+        window = tf.slice(input_padded, begin=window_slice[0], size=window_slice[1])
+        pred = run_model(window, model, importance_map, **kwargs)
+        padding = [
+            [start, output_size - (start + size)] for start, size, output_size in zip(*window_slice, output_shape)
+        ]
+        padding = padding[:-1] + [[0, 0]]
+        output_sum = output_sum + tf.pad(pred, padding)
+        output_weight_sum = output_weight_sum + tf.pad(importance_map, padding)
 
     output = output_sum / output_weight_sum
     crop_slice = [slice(pad[0], pad[0] + input_x) for pad, input_x in zip(paddings, inputs.shape[:-1])]
-    output_cropped = output[crop_slice]
-
-    return output_cropped
-
-
-if __name__ == "__main__":
-    image_size = [7, 6]
-    roi_size = [3, 2]
-    intervals = [2, 2]
-    assert get_window_slices(image_size, roi_size, intervals) == [
-        (slice(0, 3), slice(0, 2)),
-        (slice(0, 3), slice(2, 4)),
-        (slice(0, 3), slice(4, 6)),
-        (slice(2, 5), slice(0, 2)),
-        (slice(2, 5), slice(2, 4)),
-        (slice(2, 5), slice(4, 6)),
-        (slice(4, 7), slice(0, 2)),
-        (slice(4, 7), slice(2, 4)),
-        (slice(4, 7), slice(4, 6)),
-    ]
-
-    # print(gaussian_kernel([4, 5, 6], sigma=0.125))
-
-    import matplotlib.pyplot as plt
-    import PIL
-
-    img = np.asarray(PIL.Image.open("images/unet3d.png"))
-    inputs = tf.expand_dims(tf.convert_to_tensor(img, dtype=tf.float32), axis=0)
-    model = tf.identity
-    result = sliding_window_inference(inputs, roi_size=(128, 128), sw_batch_size=1, overlap=0.5, model=model)
-    plt.imsave("images/sw_unet3d.png", np.squeeze(result.numpy()))
+
+    return output[crop_slice]

+ 1 - 0
TensorFlow2/Segmentation/nnUNet/requirements.txt

@@ -1,4 +1,5 @@
 git+https://github.com/NVIDIA/dllogger
+git+https://github.com/NVIDIA/mlperf-common.git
 nibabel==3.1.1
 joblib==0.16.0
 scikit-learn==0.23.2

+ 11 - 15
TensorFlow2/Segmentation/nnUNet/runtime/args.py

@@ -54,7 +54,7 @@ def get_main_args():
         "--exec-mode",
         "--exec_mode",
         type=str,
-        choices=["train", "evaluate", "predict", "export", "nav"],
+        choices=["train", "evaluate", "predict", "export"],
         default="train",
         help="Execution mode to run the model",
     )
@@ -66,7 +66,6 @@ def get_main_args():
     p.flag("--benchmark", help="Run model benchmarking")
     p.boolean_flag("--tta", default=False, help="Enable test time augmentation")
     p.boolean_flag("--save-preds", "--save_preds", default=False, help="Save predictions")
-    p.boolean_flag("--sw-benchmark", "--sw_benchmark", default=False)
 
     # Logging
     p.arg("--results", type=Path, default=Path("/results"), help="Path to results directory")
@@ -74,10 +73,9 @@ def get_main_args():
     p.flag("--quiet", help="Minimalize stdout/stderr output")
     p.boolean_flag("--use-dllogger", "--use_dllogger", default=True, help="Use DLLogger logging")
 
-    # Optimalization
+    # Performance optimization
     p.boolean_flag("--amp", default=False, help="Enable automatic mixed precision")
     p.boolean_flag("--xla", default=False, help="Enable XLA compiling")
-    p.boolean_flag("--read-roi", "--read_roi", default=False, help="Use DALI direct ROI loading feature")
 
     # Training hyperparameters and loss fn customization
     p.arg("--batch-size", "--batch_size", type=positive_int, default=2, help="Batch size")
@@ -86,15 +84,15 @@ def get_main_args():
     p.arg(
         "--scheduler",
         type=str,
-        default="none",
+        default="cosine_annealing",
         choices=["none", "poly", "cosine", "cosine_annealing"],
         help="Learning rate scheduler",
     )
-    p.arg("--end-learning-rate", type=float, default=0.00008, help="End learning rate for poly scheduler")
+    p.arg("--end-learning-rate", type=float, default=0.00001, help="End learning rate for poly scheduler")
     p.arg(
         "--cosine-annealing-first-cycle-steps",
         type=positive_int,
-        default=512,
+        default=4096,
         help="Length of a cosine decay cycle in steps, only with 'cosine_annealing' scheduler",
     )
     p.arg(
@@ -145,7 +143,7 @@ def get_main_args():
     p.arg(
         "--nvol",
         type=positive_int,
-        default=4,
+        default=2,
         help="Number of volumes which come into single batch size for 2D model",
     )
     p.arg(
@@ -160,14 +158,12 @@ def get_main_args():
         default=8,
         help="Number of subprocesses to use for data loading",
     )
-    p.boolean_flag("--dali-use-cpu", default=False, help="Use CPU for data augmentation instead of GPU")
 
     # Sliding window inference
-    p.arg("--sw-batch-size", type=positive_int, default=2, help="Sliding window inference batch size")
     p.arg(
         "--overlap",
         type=float_0_1,
-        default=0.5,
+        default=0.25,
         help="Amount of overlap between scans during sliding window inference",
     )
     p.arg(
@@ -176,7 +172,7 @@ def get_main_args():
         dest="blend_mode",
         type=str,
         choices=["gaussian", "constant"],
-        default="gaussian",
+        default="constant",
         help="How to blend output of overlapping windows",
     )
 
@@ -184,7 +180,7 @@ def get_main_args():
     p.arg("--nfolds", type=positive_int, default=5, help="Number of cross-validation folds")
     p.arg("--fold", type=non_negative_int, default=0, help="Fold number")
     p.arg("--epochs", type=positive_int, default=1000, help="Number of epochs")
-    p.arg("--skip-eval", type=positive_int, default=0, help="Skip evaluation for the first N epochs.")
+    p.arg("--skip-eval", type=non_negative_int, default=0, help="Skip evaluation for the first N epochs.")
     p.arg(
         "--steps-per-epoch",
         type=positive_int,
@@ -195,13 +191,13 @@ def get_main_args():
     p.arg(
         "--bench-steps",
         type=non_negative_int,
-        default=100,
+        default=200,
         help="Number of benchmarked steps in total",
     )
     p.arg(
         "--warmup-steps",
         type=non_negative_int,
-        default=25,
+        default=100,
         help="Number of warmup steps before collecting benchmarking statistics",
     )
 

+ 6 - 11
TensorFlow2/Segmentation/nnUNet/runtime/logging.py

@@ -12,11 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-import dllogger
 import pathlib
 from abc import ABC, abstractmethod
+from typing import Callable
+
+import dllogger
 from dllogger import Verbosity
-from typing import Dict, Any, Callable, Optional
 
 from runtime.utils import rank_zero_only
 
@@ -89,7 +90,7 @@ class LoggerCollection(Logger):
 
 
 class DLLogger(Logger):
-    def __init__(self, save_dir: pathlib.Path, filename: str, append: bool, quiet: bool):
+    def __init__(self, save_dir, filename, append, quiet):
         super().__init__()
         self._initialize_dllogger(save_dir, filename, append, quiet)
 
@@ -97,16 +98,10 @@ class DLLogger(Logger):
     def _initialize_dllogger(self, save_dir, filename, append, quiet):
         save_dir.mkdir(parents=True, exist_ok=True)
         backends = [
-            dllogger.JSONStreamBackend(
-                Verbosity.DEFAULT, str(save_dir / filename), append=append
-            ),
+            dllogger.JSONStreamBackend(Verbosity.DEFAULT, str(save_dir / filename), append=append),
         ]
         if not quiet:
-            backends.append(
-                dllogger.StdOutBackend(
-                    Verbosity.VERBOSE, step_format=lambda step: f"Step: {step} "
-                )
-            )
+            backends.append(dllogger.StdOutBackend(Verbosity.VERBOSE, step_format=lambda step: f"Step: {step} "))
         dllogger.init(backends=backends)
 
     @rank_zero_only

+ 14 - 14
TensorFlow2/Segmentation/nnUNet/runtime/losses.py

@@ -22,7 +22,7 @@ class DiceLoss(tf.keras.losses.Loss):
         self.reduce_batch = reduce_batch
         self.eps = eps
         self.include_background = include_background
-    
+
     def dice_coef(self, y_true, y_pred):
         intersection = tf.reduce_sum(y_true * y_pred, axis=1)
         pred_sum = tf.reduce_sum(y_pred, axis=1)
@@ -30,6 +30,7 @@ class DiceLoss(tf.keras.losses.Loss):
         dice = (2.0 * intersection + self.eps) / (pred_sum + true_sum + self.eps)
         return tf.reduce_mean(dice, axis=0)
 
+    @tf.function
     def call(self, y_true, y_pred):
         n_class = y_pred.shape[-1]
         if self.reduce_batch:
@@ -38,7 +39,7 @@ class DiceLoss(tf.keras.losses.Loss):
             flat_shape = (y_pred.shape[0], -1, n_class)
         if self.y_one_hot:
             y_true = tf.one_hot(y_true, n_class)
-        
+
         flat_pred = tf.reshape(tf.cast(y_pred, tf.float32), flat_shape)
         flat_true = tf.reshape(y_true, flat_shape)
 
@@ -55,29 +56,28 @@ class DiceCELoss(tf.keras.losses.Loss):
         super().__init__()
         self.y_one_hot = y_one_hot
         self.dice_loss = DiceLoss(y_one_hot=False, **dice_kwargs)
-    
+
+    @tf.function
     def call(self, y_true, y_pred):
         y_pred = tf.cast(y_pred, tf.float32)
         n_class = y_pred.shape[-1]
         if self.y_one_hot:
             y_true = tf.one_hot(y_true, n_class)
         dice_loss = self.dice_loss(y_true, y_pred)
-        ce_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
-            labels=y_true, logits=y_pred, 
-        ))
+        ce_loss = tf.reduce_mean(
+            tf.nn.softmax_cross_entropy_with_logits(
+                labels=y_true,
+                logits=y_pred,
+            )
+        )
         return dice_loss + ce_loss
 
 
 class WeightDecay:
     def __init__(self, factor):
         self.factor = factor
-    
+
+    @tf.function
     def __call__(self, model):
         # TODO: add_n -> accumulate_n ?
-        return self.factor * tf.add_n( 
-            [
-                tf.nn.l2_loss(v)
-                for v in model.trainable_variables
-                if "norm" not in v.name
-            ]
-        )
+        return self.factor * tf.add_n([tf.nn.l2_loss(v) for v in model.trainable_variables if "norm" not in v.name])

+ 16 - 21
TensorFlow2/Segmentation/nnUNet/runtime/run.py

@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from time import time
+import time
 
 import horovod.tensorflow as hvd
 import numpy as np
@@ -30,7 +30,7 @@ def update_best_metrics(old, new, start_time, iteration, watch_metric=None):
     did_change = False
     for metric, value in new.items():
         if metric not in old or old[metric]["value"] < value:
-            old[metric] = {"value": value, "timestamp": time() - start_time, "iter": int(iteration)}
+            old[metric] = {"value": value, "timestamp": time.time() - start_time, "iter": int(iteration)}
             if watch_metric == metric:
                 did_change = True
     return did_change
@@ -50,7 +50,7 @@ def get_scheduler(args, total_steps):
         "cosine_annealing": tf.keras.optimizers.schedules.CosineDecayRestarts(
             initial_learning_rate=args.learning_rate,
             first_decay_steps=args.cosine_annealing_first_cycle_steps,
-            m_mul=args.cosine_annealing_peak_decay,
+            alpha=0.1,
         ),
         "none": args.learning_rate,
     }[args.scheduler.lower()]
@@ -77,10 +77,9 @@ def get_epoch_size(args, batch_size, dataset_size):
     return (dataset_size + div - 1) // div
 
 
-def process_performance_stats(timestamps, batch_size, mode):
-    deltas = np.diff(timestamps)
-    deltas_ms = 1000 * deltas
-    throughput_imgps = (1000.0 * batch_size / deltas_ms).mean()
+def process_performance_stats(deltas, batch_size, mode):
+    deltas_ms = 1000 * np.array(deltas)
+    throughput_imgps = 1000.0 * batch_size / deltas_ms.mean()
     stats = {f"throughput_{mode}": throughput_imgps, f"latency_{mode}_mean": deltas_ms.mean()}
     for level in [90, 95, 99]:
         stats.update({f"latency_{mode}_{level}": np.percentile(deltas_ms, level)})
@@ -90,7 +89,7 @@ def process_performance_stats(timestamps, batch_size, mode):
 
 def benchmark(args, step_fn, data, steps, warmup_steps, logger, mode="train"):
     assert steps > warmup_steps, "Number of benchmarked steps has to be greater then number of warmup steps"
-    timestamps = []
+    deltas = []
     wrapped_data = progress_bar(
         enumerate(data),
         quiet=args.quiet,
@@ -99,19 +98,18 @@ def benchmark(args, step_fn, data, steps, warmup_steps, logger, mode="train"):
         postfix={"phase": "warmup"},
         total=steps,
     )
+    start = time.perf_counter()
     for step, (images, labels) in wrapped_data:
         output_map = step_fn(images, labels, warmup_batch=step == 0)
-        if mode == "predict":
-            with tf.device("/device:CPU:0"):
-                output_map = tf.experimental.numpy.copy(output_map)
         if step >= warmup_steps:
-            timestamps.append(time())
+            deltas.append(time.perf_counter() - start)
             if step == warmup_steps and is_main_process() and not args.quiet:
                 wrapped_data.set_postfix(phase="benchmark")
+        start = time.perf_counter()
         if step >= steps:
             break
 
-    stats = process_performance_stats(timestamps, args.gpus * args.batch_size, mode=mode)
+    stats = process_performance_stats(deltas, args.gpus * args.batch_size, mode=mode)
     logger.log_metrics(stats)
 
 
@@ -174,7 +172,7 @@ def train(args, model, dataset, logger):
             unit="step",
             total=total_steps - int(tstep),
         )
-        start_time = time()
+        start_time = time.time()
         total_train_loss, dice_score = 0.0, 0.0
         for images, labels in wrapped_data:
             if tstep >= total_steps:
@@ -193,7 +191,7 @@ def train(args, model, dataset, logger):
                     metrics = dice_metrics.logger_metrics()
                     metrics.update(make_class_logger_metrics(dice))
                     if did_improve:
-                        metrics["time_to_train"] = time() - start_time
+                        metrics["time_to_train"] = time.time() - start_time
                     logger.log_metrics(metrics=metrics, step=int(tstep))
                     checkpoint.update(float(dice_score))
                     logger.flush()
@@ -245,20 +243,17 @@ def evaluate(args, model, dataset, logger):
 def predict(args, model, dataset, logger):
     if args.benchmark:
 
+        @tf.function
         def predict_bench_fn(features, labels, warmup_batch):
             if args.dim == 2:
                 features = features[0]
-
-            if args.sw_benchmark:
-                output_map = model.inference(features)
-            else:
-                output_map = model(features, training=False)
+            output_map = model(features, training=False)
             return output_map
 
         benchmark(
             args,
             predict_bench_fn,
-            dataset.train_dataset(),
+            dataset.test_dataset(),
             args.bench_steps,
             args.warmup_steps,
             logger,

+ 13 - 10
TensorFlow2/Segmentation/nnUNet/runtime/utils.py

@@ -15,7 +15,6 @@
 import multiprocessing
 import os
 import pickle
-import random
 import shutil
 import sys
 from functools import wraps
@@ -29,7 +28,6 @@ from tqdm import tqdm
 
 def hvd_init():
     hvd.init()
-
     gpus = tf.config.experimental.list_physical_devices("GPU")
     for gpu in gpus:
         tf.config.experimental.set_memory_growth(gpu, True)
@@ -38,24 +36,31 @@ def hvd_init():
 
 
 def set_tf_flags(args):
-    os.environ["CUDA_CACHE_DISABLE"] = "1"
+    os.environ["CUDA_CACHE_DISABLE"] = "0"
     os.environ["HOROVOD_GPU_ALLREDUCE"] = "NCCL"
     os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
     os.environ["TF_GPU_THREAD_MODE"] = "gpu_private"
-    os.environ["TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT"] = "0"
+    os.environ["TF_GPU_THREAD_COUNT"] = str(hvd.size())
+    os.environ["TF_USE_CUDNN_BATCHNORM_SPATIAL_PERSISTENT"] = "1"
     os.environ["TF_ADJUST_HUE_FUSED"] = "1"
     os.environ["TF_ADJUST_SATURATION_FUSED"] = "1"
     os.environ["TF_ENABLE_WINOGRAD_NONFUSED"] = "1"
     os.environ["TF_SYNC_ON_FINISH"] = "0"
     os.environ["TF_AUTOTUNE_THRESHOLD"] = "2"
     os.environ["TF_ENABLE_AUTO_MIXED_PRECISION"] = "0"
+    os.environ["TF_ENABLE_LAYOUT_NHWC"] = "1"
+    os.environ["TF_CPP_VMODULE"] = "4"
 
     if args.xla:
+        os.environ["TF_XLA_ENABLE_GPU_GRAPH_CAPTURE"] = "1"
+        if args.amp:
+            os.environ["XLA_FLAGS"] = "--xla_gpu_force_conv_nhwc"
         tf.config.optimizer.set_jit(True)
 
-    tf.config.optimizer.set_experimental_options({"remapping": False})
-    tf.config.threading.set_intra_op_parallelism_threads(1)
-    tf.config.threading.set_inter_op_parallelism_threads(max(2, (multiprocessing.cpu_count() // hvd.size()) - 2))
+    if hvd.size() > 1:
+        tf.config.threading.set_inter_op_parallelism_threads(max(2, (multiprocessing.cpu_count() // hvd.size()) - 2))
+    else:
+        tf.config.threading.set_inter_op_parallelism_threads(8)
 
     if args.amp:
         tf.keras.mixed_precision.set_global_policy("mixed_float16")
@@ -81,7 +86,6 @@ def rank_zero_only(fn):
 
 
 def set_seed(seed):
-    seed = seed or random.randrange(2 ** 31)
     np.random.seed(seed)
     tf.random.set_seed(seed)
 
@@ -101,8 +105,7 @@ def get_config_file(args):
 def get_tta_flips(dim):
     if dim == 2:
         return [[1], [2], [1, 2]]
-    else:
-        return [[1], [2], [3], [1, 2], [1, 3], [2, 3], [1, 2, 3]]
+    return [[1], [2], [3], [1, 2], [1, 3], [2, 3], [1, 2, 3]]
 
 
 def make_empty_dir(path, force=False):

+ 10 - 7
TensorFlow2/Segmentation/nnUNet/scripts/benchmark.py

@@ -12,9 +12,9 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+import subprocess
 from argparse import ArgumentParser
 from pathlib import Path
-from subprocess import call
 
 parser = ArgumentParser()
 parser.add_argument("--mode", type=str, required=True, choices=["train", "predict"], help="Benchmarking mode")
@@ -23,9 +23,11 @@ parser.add_argument("--dim", type=int, required=True, choices=[2, 3], help="Dime
 parser.add_argument("--gpus", type=int, default=1, help="Number of gpus")
 parser.add_argument("--batch-size", "--batch_size", type=int, required=True)
 parser.add_argument("--amp", action="store_true", help="Enable automatic mixed precision")
+parser.add_argument("--bind", action="store_true", help="Bind CPUs for each GPU. Improves throughput for multi-GPU.")
+parser.add_argument("--horovod", action="store_true")
 parser.add_argument("--xla", action="store_true", help="Enable XLA compiling")
 parser.add_argument(
-    "--bench-steps", "--bench_steps", type=int, default=300, help="Number of benchmarked steps in total"
+    "--bench-steps", "--bench_steps", type=int, default=200, help="Number of benchmarked steps in total"
 )
 parser.add_argument(
     "--warmup-steps", "--warmup_steps", type=int, default=100, help="Warmup iterations before collecting statistics"
@@ -37,8 +39,10 @@ parser.add_argument("--logname", type=str, default="perf.json", help="Name of th
 if __name__ == "__main__":
     args = parser.parse_args()
     path_to_main = Path(__file__).resolve().parent.parent / "main.py"
-    cmd = f"horovodrun -np {args.gpus} "
-    cmd += f"python {path_to_main} --benchmark "
+    cmd = f"horovodrun -np {args.gpus} " if args.horovod else ""
+    if args.bind:
+        cmd += "bindpcie --cpu=exclusive,nosmt "
+    cmd += f"python {path_to_main} --benchmark --ckpt-strategy none --seed 0 "
     cmd += f"--exec-mode {args.mode} "
     cmd += f"--task {args.task} "
     cmd += f"--dim {args.dim} "
@@ -49,6 +53,5 @@ if __name__ == "__main__":
     cmd += f"--warmup-steps {args.warmup_steps} "
     cmd += f"--results {args.results} "
     cmd += f"--logname {args.logname} "
-    cmd += f"--ckpt-strategy none "
-    cmd += f"--gpus {args.gpus}"
-    call(cmd, shell=True)
+    cmd += f"--gpus {args.gpus} "
+    subprocess.run(cmd, shell=True)

+ 11 - 13
TensorFlow2/Segmentation/nnUNet/scripts/train.py

@@ -20,36 +20,34 @@ parser = ArgumentParser()
 parser.add_argument("--task", type=str, default="01", help="Task code")
 parser.add_argument("--dim", type=int, required=True, choices=[2, 3], help="Dimension of UNet")
 parser.add_argument("--gpus", type=int, default=1, help="Number of gpus")
+parser.add_argument("--seed", type=int, default=1, help="Random seed")
 parser.add_argument("--learning_rate", type=float, default=3e-4)
 parser.add_argument("--fold", type=int, required=True, choices=[0, 1, 2, 3, 4], help="Fold number")
 parser.add_argument("--amp", action="store_true", help="Enable automatic mixed precision")
-parser.add_argument("--xla", action="store_true", help="Enable xla")
 parser.add_argument("--tta", action="store_true", help="Enable test time augmentation")
+parser.add_argument("--horovod", action="store_true", help="Launch horovod within script")
+parser.add_argument("--bind", action="store_true", help="Bind CPUs for each GPU. Improves throughput for multi-GPU.")
 parser.add_argument("--results", type=Path, default=Path("/results"), help="Path to results directory")
 parser.add_argument("--logname", type=str, default="train_log.json", help="Name of the dlloger output")
 
 if __name__ == "__main__":
     args = parser.parse_args()
-    bs = 2 if args.dim == 3 else 64
-    epochs = 300 if args.gpus == 1 else 600
-    if args.gpus == 1:
-        skip = 100 if args.dim == 2 else 180
-    else:
-        skip = 150 if args.dim == 2 else 260
+    skip = 100 if args.gpus == 1 else 150
     path_to_main = Path(__file__).resolve().parent.parent / "main.py"
-    cmd = f"horovodrun -np {args.gpus} "
-    cmd += f"python {path_to_main} --exec-mode train --skip-eval {skip} "
+    cmd = f"horovodrun -np {args.gpus} " if args.horovod else ""
+    if args.bind:
+        cmd += "bindpcie --cpu=exclusive,nosmt "
+    cmd += f"python {path_to_main} --exec-mode train --deep_supervision --xla --skip-eval {skip} "
     cmd += f"--task {args.task} "
     cmd += f"--dim {args.dim} "
-    cmd += f"--epochs {epochs} "
-    cmd += f"--batch-size {bs} "
+    cmd += f"--epochs {300 if args.gpus == 1 else 600} "
+    cmd += f"--batch-size {2 if args.dim == 3 else 64} "
     cmd += f"--learning_rate {args.learning_rate} "
     cmd += f"--fold {args.fold} "
     cmd += f"--amp {args.amp} "
-    cmd += f"--xla {args.xla} "
     cmd += f"--tta {args.tta} "
     cmd += f"--results {args.results} "
     cmd += f"--logname {args.logname} "
     cmd += f"--gpus {args.gpus} "
-    cmd += f"--deep_supervision"
+    cmd += f"--seed {args.seed} "
     call(cmd, shell=True)