Parcourir la source

[FastPitch] Updating for Ampere

Przemek Strzelczyk il y a 5 ans
Parent
commit
77dad060a2
30 fichiers modifiés avec 569 ajouts et 615 suppressions
  1. 1 1
      PyTorch/SpeechSynthesis/FastPitch/Dockerfile
  2. 189 211
      PyTorch/SpeechSynthesis/FastPitch/README.md
  3. BIN
      PyTorch/SpeechSynthesis/FastPitch/audio/sample_fp32.wav
  4. 15 0
      PyTorch/SpeechSynthesis/FastPitch/common/log_helper.py
  5. 6 5
      PyTorch/SpeechSynthesis/FastPitch/common/stft.py
  6. 2 2
      PyTorch/SpeechSynthesis/FastPitch/export_torchscript.py
  7. 1 1
      PyTorch/SpeechSynthesis/FastPitch/extract_mels.py
  8. BIN
      PyTorch/SpeechSynthesis/FastPitch/img/loss.png
  9. 37 29
      PyTorch/SpeechSynthesis/FastPitch/inference.py
  10. 0 107
      PyTorch/SpeechSynthesis/FastPitch/inference_perf.py
  11. 0 91
      PyTorch/SpeechSynthesis/FastPitch/multiproc.py
  12. 1 1
      PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_AMP_1GPU.sh
  13. 2 3
      PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_AMP_4GPU.sh
  14. 24 0
      PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_AMP_8GPU.sh
  15. 0 0
      PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_FP32_1GPU.sh
  16. 1 2
      PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_FP32_4GPU.sh
  17. 1 2
      PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_FP32_8GPU.sh
  18. 24 0
      PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_AMP_1GPU.sh
  19. 24 0
      PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_AMP_4GPU.sh
  20. 24 0
      PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_AMP_8GPU.sh
  21. 23 0
      PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_TF32_1GPU.sh
  22. 23 0
      PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_TF32_4GPU.sh
  23. 1 1
      PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_TF32_8GPU.sh
  24. 2 2
      PyTorch/SpeechSynthesis/FastPitch/scripts/download_dataset.sh
  25. 21 21
      PyTorch/SpeechSynthesis/FastPitch/scripts/inference_benchmark.sh
  26. 12 10
      PyTorch/SpeechSynthesis/FastPitch/scripts/inference_example.sh
  27. 22 22
      PyTorch/SpeechSynthesis/FastPitch/scripts/train.sh
  28. 36 35
      PyTorch/SpeechSynthesis/FastPitch/train.py
  29. 5 10
      PyTorch/SpeechSynthesis/FastPitch/waveglow/denoiser.py
  30. 72 59
      PyTorch/SpeechSynthesis/FastPitch/waveglow/model.py

+ 1 - 1
PyTorch/SpeechSynthesis/FastPitch/Dockerfile

@@ -1,4 +1,4 @@
-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.03-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.06-py3
 FROM ${FROM_IMAGE_NAME}
 
 ADD requirements.txt .

+ 189 - 211
PyTorch/SpeechSynthesis/FastPitch/README.md

@@ -8,9 +8,10 @@ This repository provides a script and recipe to train the FastPitch model to ach
     * [Model architecture](#model-architecture)
     * [Default configuration](#default-configuration)
     * [Feature support matrix](#feature-support-matrix)
-	    * [Features](#features)
+        * [Features](#features)
     * [Mixed precision training](#mixed-precision-training)
-	    * [Enabling mixed precision](#enabling-mixed-precision)
+        * [Enabling mixed precision](#enabling-mixed-precision)
+        * [Enabling TF32](#enabling-tf32)
     * [Glossary](#glossary)
 - [Setup](#setup)
     * [Requirements](#requirements)
@@ -31,12 +32,15 @@ This repository provides a script and recipe to train the FastPitch model to ach
         * [Inference performance benchmark](#inference-performance-benchmark)
     * [Results](#results)
         * [Training accuracy results](#training-accuracy-results)
-            * [Training accuracy: NVIDIA DGX-1 (8x V100 16G)](#training-accuracy-nvidia-dgx-1-8x-v100-16g)
+            * [Training accuracy: NVIDIA DGX A100 (8x A100 40GB)](#training-accuracy-nvidia-dgx-a100-8x-a100-40gb)
+            * [Training accuracy: NVIDIA DGX-1 (8x V100 16GB)](#training-accuracy-nvidia-dgx-1-8x-v100-16gb)
         * [Training performance results](#training-performance-results)
-            * [Training performance: NVIDIA DGX-1 (8x V100 16G)](#training-performance-nvidia-dgx-1-8x-v100-16g)
+            * [Training performance: NVIDIA DGX A100 (8x A100 40GB)](#training-performance-nvidia-dgx-a100-8x-a100-40gb)
+            * [Training performance: NVIDIA DGX-1 (8x V100 16GB)](#training-performance-nvidia-dgx-1-8x-v100-16gb)
             * [Expected training time](#expected-training-time)
         * [Inference performance results](#inference-performance-results)
-            * [Inference performance: NVIDIA DGX-1 (1x V100 16G)](#inference-performance-nvidia-dgx-1-1x-v100-16g)
+            * [Inference performance: NVIDIA DGX A100 (1x A100 40GB)](#inference-performance-nvidia-dgx-a100-gpu-1x-a100-40gb)
+            * [Inference performance: NVIDIA DGX-1 (1x V100 16GB)](#inference-performance-nvidia-dgx-1-1x-v100-16gb)
             * [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4)
 - [Release notes](#release-notes)
     * [Changelog](#changelog)
@@ -44,9 +48,9 @@ This repository provides a script and recipe to train the FastPitch model to ach
 
 ## Model overview
 
-FastPitch is one of two major components in a neural, text-to-speech (TTS) system:
+[FastPitch](https://arxiv.org/abs/2006.06873) is one of two major components in a neural, text-to-speech (TTS) system:
 
-* a mel-spectrogram generator such as FastPitch or [Tacotron 2](https://arxiv.org/abs/1712.05884), and
+* a mel-spectrogram generator such as [FastPitch](https://arxiv.org/abs/2006.06873) or [Tacotron 2](https://arxiv.org/abs/1712.05884), and
 * a waveform synthesizer such as [WaveGlow](https://arxiv.org/abs/1811.00002) (see [NVIDIA example code](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2)).
 
 Such two-component TTS system is able to synthesize natural sounding speech from raw transcripts.
@@ -55,16 +59,28 @@ The FastPitch model generates mel-spectrograms and predicts a pitch contour from
 * modify the pitch contour to control the prosody,
 * increase or decrease the fundamental frequency in a naturally sounding way, that preserves the perceived identity of the speaker,
 * alter the pace of speech.
+Some of the capabilities of FastPitch are presented on the website with [samples](https://fastpitch.github.io/).
+
+Speech synthesized with FastPitch has state-of-the-art quality, and does not suffer from missing/repeating phrases like Tacotron2 does.
+This is reflected in Mean Opinion Scores ([details](https://arxiv.org/abs/2006.06873)).
+
+| Model     | Mean Opinion Score (MOS) |
+|:----------|:-------------------------|
+| Tacotron2 | 3.946 ± 0.134            |
+| FastPitch | 4.080 ± 0.133            |
+
 The FastPitch model is based on the [FastSpeech](https://arxiv.org/abs/1905.09263) model. The main differences between FastPitch and FastSpeech are that FastPitch:
 * explicitly learns to predict the pitch contour,
-*  pitch conditioning removes harsh sounding artifacts and provides faster convergence,
+* pitch conditioning removes harsh sounding artifacts and provides faster convergence,
 * no need for distilling mel-spectrograms with a teacher model,
 * [character durations](#glossary) are extracted with a pre-trained Tacotron 2 model.
 
+The FastPitch model is similar to [FastSpeech2](https://arxiv.org/abs/2006.04558), which has been developed concurrently. FastPitch averages pitch values over input tokens, and does not use additional conditioning such as the energy.
+
 FastPitch is trained on a publicly
 available [LJ Speech dataset](https://keithito.com/LJ-Speech-Dataset/).
 
-This model is trained with mixed precision using Tensor Cores on NVIDIA Volta and Turing GPUs. Therefore, researchers can get results from 2.0x to 2.7x2.2x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
+This model is trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results from 2.0x to 2.7x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
 
 ### Model architecture
 
@@ -75,7 +91,7 @@ from raw text (Figure 1). The entire process is parallel, which means that all i
   <img src="./img/fastpitch_model.png" alt="FastPitch model architecture" />
 </p>
 <p align="center">
-  <em>Figure 1. Architecture of FastPitch. The model is composed of a bidirectional Transformer backbone (also known as a Transformer encoder), a pitch predictor, and a duration predictor. After passing through the first *N* Transformer blocks, encoding, the signal is augmented with pitch information and discretely upsampled. Then it goes through another set of *N* Transformer blocks, with the goal of
+  <em>Figure 1. Architecture of FastPitch (<a href=”https://arxiv.org/abs/2006.06873”>source</a>). The model is composed of a bidirectional Transformer backbone (also known as a Transformer encoder), a pitch predictor, and a duration predictor. After passing through the first *N* Transformer blocks, encoding, the signal is augmented with pitch information and discretely upsampled. Then it goes through another set of *N* Transformer blocks, with the goal of
 smoothing out the upsampled signal, and constructing a mel-spectrogram.
   </em>
 </p>
@@ -108,7 +124,7 @@ The following features are supported by this model.
 | :------------------------------------------------------------------|------------:|
 |[AMP](https://nvidia.github.io/apex/amp.html)                               | Yes |
 |[Apex DistributedDataParallel](https://nvidia.github.io/apex/parallel.html) | Yes |
-         
+
 #### Features
 
 AMP - a tool that enables Tensor Core-accelerated training. For more information,
@@ -123,8 +139,8 @@ required.
 
 ### Mixed precision training
 
-Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
-1.  Porting the model to use the FP16 data type where appropriate.    
+Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in Volta, and following with both the Turing and Ampere architectures, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps:
+1.  Porting the model to use the FP16 data type where appropriate.
 2.  Adding loss scaling to preserve small gradient values.
 
 The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK.
@@ -145,9 +161,8 @@ step must be included when applying gradients. In PyTorch, loss scaling can be
 easily applied by using the `scale_loss()` method provided by AMP. The scaling value
 to be used can be [dynamic](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.DynamicLossScaler) or fixed.
 
-By default, the `train_fastpitch.sh` script will
-launch mixed precision training with Tensor Cores. You can change this
-behaviour by removing the `--amp-run` flag from the `train.py` script.
+By default, the `scripts/train.sh` script will run in full precision.To launch mixed precision training with Tensor Cores, either set env variable `AMP=true`
+when using `scripts/train.sh`, or add `--amp` flag when directly executing `train.py` without the helper script.
 
 To enable mixed precision, the following steps were performed:
 * Import AMP from APEX:
@@ -157,14 +172,14 @@ To enable mixed precision, the following steps were performed:
 
 * Initialize AMP:
     ```bash
-	model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
+    model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
     ```
 
 * If running on multi-GPU, wrap the model with `DistributedDataParallel`:
     ```bash
     from apex.parallel import DistributedDataParallel as DDP
     model = DDP(model)
-	```
+    ```
 
 * Scale loss before backpropagation (assuming loss is stored in a variable
 called `losses`):
@@ -179,7 +194,15 @@ called `losses`):
         with optimizer.scale_loss(losses) as scaled_losses:
             scaled_losses.backward()
         ```
+#### Enabling TF32
+
+TensorFloat-32 (TF32) is the new math mode in [NVIDIA A100](#https://www.nvidia.com/en-us/data-center/a100/) GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. 
+
+TF32 Tensor Cores can speed up networks using FP32, typically with no loss of accuracy. It is more robust than FP16 for models which require high dynamic range for weights or activations.
+
+For more information, refer to the [TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x](#https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/) blog post.
 
+TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.
 
 ### Glossary
 
@@ -195,7 +218,7 @@ The lowest vibration frequency of a periodic soundwave, for example, produced by
 **Pitch**
 A perceived frequency of vibration of music or sound.
 
-**Transformer**  
+**Transformer**
 The paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762) introduces a novel architecture called Transformer, which repeatedly applies the attention mechanism. It transforms one sequence into another.
 
 ## Setup
@@ -206,20 +229,24 @@ The following section lists the requirements that you need to meet in order to s
 
 This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
 -   [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
--   [PyTorch 20.03-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
+-   [PyTorch 20.06-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
 or newer
--   [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
+- supported GPUs:
+    - [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
+    - [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
+    - [NVIDIA Ampere architecture](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/)
+
 
 For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
 -   [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html)
 -   [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#accessing_registry)
 -   [Running PyTorch](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/running.html#running)
-  
+
 For those unable to use the PyTorch NGC container, to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
 
 ## Quick Start Guide
 
-To train your model using mixed precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the FastPitch model on the LJSpeech 1.1 dataset. For the specifics concerning training and inference, see the [Advanced](#advanced) section.
+To train your model using mixed or TF32 precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the FastPitch model on the LJSpeech 1.1 dataset. For the specifics concerning training and inference, see the [Advanced](#advanced) section.
 
 1. Clone the repository.
    ```bash
@@ -229,7 +256,7 @@ To train your model using mixed precision with Tensor Cores or using FP32, perfo
 
 2. Build and run the FastPitch PyTorch NGC container.
 
-   By default the container will use the first available GPU. Modify the script to include other available devices.
+   By default the container will use all available GPUs.
    ```bash
    bash scripts/docker/build.sh
    bash scripts/docker/interactive.sh
@@ -263,7 +290,7 @@ To train your model using mixed precision with Tensor Cores or using FP32, perfo
    ```
    The training will produce a FastPitch model capable of generating mel-spectrograms from raw text.
    It will be serialized as a single `.pt` checkpoint file, along with a series of intermediate checkpoints.
-   The script is configured for 8x GPU with 16GB of memory. Consult [Training process](#training-process) and [example configs](#-training-performance-benchmark) to adjust to a different configuration.
+   The script is configured for 8x GPU with at least 16GB of memory. Consult [Training process](#training-process) and [example configs](#-training-performance-benchmark) to adjust to a different configuration or enable Automatic Mixed Precision.
 
 5. Start validation/evaluation.
 
@@ -281,9 +308,10 @@ To train your model using mixed precision with Tensor Cores or using FP32, perfo
    You can perform inference using the respective `.pt` checkpoints that are passed as `--fastpitch`
    and `--waveglow` arguments:
    ```bash
-   python inference.py --cuda --wn-channels 256 --amp-run \
+   python inference.py --cuda \
                        --fastpitch output/<FastPitch checkpoint> \
-                       --waveglow pretrained_models/waveglow/<waveglow checkpoint> \
+                       --waveglow pretrained_models/waveglow/<WaveGlow checkpoint> \
+                       --wn-channels 256 \
                        -i phrases/devset10.tsv \
                        -o output/wavs_devset10
    ```
@@ -293,7 +321,7 @@ To train your model using mixed precision with Tensor Cores or using FP32, perfo
    `<output wav file name>|<utterance>`
    ```
 To run
-   inference in mixed precision, use the `--amp-run` flag. The output audio will
+   inference in mixed precision, use the `--amp` flag. The output audio will
    be stored in the path specified by the `-o` argument. Consult the `inference.py` to learn more options, such as setting the batch size.
 
 ## Advanced
@@ -314,7 +342,7 @@ given model
 
 The common scripts contain layer definitions common to both models
 (`common/layers.py`), some utility scripts (`common/utils.py`) and scripts
-for audio processing (`common/audio_processing.py` and `common/stft.py`). 
+for audio processing (`common/audio_processing.py` and `common/stft.py`).
 
 In the root directory `./` of this repository, the `./train.py` script is used for
 training while inference can be executed with the `./inference.py` script. The
@@ -332,7 +360,7 @@ together with their default values that are used to train FastPitch.
 * `--epochs` - number of epochs (default: 1500)
 * `--learning-rate` - learning rate (default: 0.1)
 * `--batch-size` - batch size (default: 32)
-* `--amp-run` - use mixed precision training
+* `--amp` - use mixed precision training (default: disabled)
 
 * `--pitch-predictor-loss-scale` - rescale the loss of the pitch predictor module to dampen
 its influence on the shared feedforward transformer blocks
@@ -404,7 +432,7 @@ Follow these steps to use datasets different from the default LJSpeech dataset.
    ```
 
 2. Prepare filelists with transcripts and paths to .wav files. They define training/validation split of the data (test is currently unused):
-   ```
+   ```bash
    ./filelists
    ├── my_dataset_mel_ali_pitch_text_train_filelist.txt
    └── my_dataset_mel_ali_pitch_text_val_filelist.txt
@@ -441,7 +469,7 @@ In order to use the prepared dataset, pass the following to the `train.py` scrip
    --training-files ./filelists/my_dataset_mel_ali_pitch_text_train_filelist.txt \
    --validation files ./filelists/my_dataset_mel_ali_pitch_text_val_filelist.txt
    ```
-   
+
 ### Training process
 
 FastPitch is trained to generate mel-spectrograms from raw text input. It uses short time Fourier transform (STFT)
@@ -453,14 +481,19 @@ reported in total output mel-spectrogram frames per second and recorded as `trai
 The result is averaged over an entire training epoch and summed over all GPUs that were
 included in the training.
 
-The `scripts/train.sh` script is configured for 8x GPU with 16GB of memory:
-    ```
+The `scripts/train.sh` script is configured for 8x GPU with at least 16GB of memory:
+    ```bash
     --batch-size 32
     --gradient-accumulation-steps 1
     ```
-In a single accumulated step, there are `batch_size x gradient_accumulation_steps x #GPUs = 256` examples being processed in parallel. With a smaller number of GPUs, increase `--gradient_accumulation_steps` to keep this relation satisfied.
-
-Even though the training script uses all available GPUs, you can select the devices with the `CUDA_VISIBLE_DEVICES` environmental variable (see this [CUDA Pro Tip](https://devblogs.nvidia.com/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/) for more details).
+In a single accumulated step, there are `batch_size x gradient_accumulation_steps x GPUs = 256` examples being processed in parallel. With a smaller number of GPUs, increase `--gradient_accumulation_steps` to keep this relation satisfied, e.g., through env variables
+    ```bash
+    NGPU=4 GRAD_ACC=2 bash scripts/train.sh
+    ```
+With automatic mixed precision (AMP), a larger batch size fits in 16GB of memory:
+    ```bash
+    NGPU=4 GRAD_ACC=1 BS=64 AMP=true bash scripta/train.sh
+    ```
 
 ### Inference process
 
@@ -476,9 +509,6 @@ bash scripts/inference_example.sh
 Examine the `inference_example.sh` script to adjust paths to pre-trained models,
 and call `python inference.py --help` to learn all available options.
 By default, synthesized audio samples are saved in `./output/audio_*` folders.
-The audio files <a href=”./audio/sample_fp16.wav>sample_fp16.wav</a> and <a href=”./audio/sample_fp32.wav>sample_fp32.wav</a>
- were generated using checkpoints from
-mixed precision and FP32 training, respectively.
 
 FastPitch allows us to linearly adjust the pace of synthesized speech like [FastSpeech](https://arxiv.org/abs/1905.09263).
 For instance, pass `--pace 0.5` for a twofold decrease in speed.
@@ -488,6 +518,7 @@ Pitch can be adjusted by transforming those pitch cues. A few simple examples ar
 
 | Transformation                              | Flag                          | Samples                                 |
 | :-------------------------------------------|:------------------------------|:---------------------------------------:|
+| -                                           | -                             | [link](./audio/sample_fp16.wav)         |
 | Amplify pitch wrt. to the mean pitch        |`--pitch-transform-amplify`    | [link](./audio/sample_fp16_amplify.wav) |
 | Invert pitch wrt. to the mean pitch         |`--pitch-transform-invert`     | [link](./audio/sample_fp16_invert.wav)  |
 | Raise/lower pitch by <hz>                   |`--pitch-transform-shift <hz>` | [link](./audio/sample_fp16_shift.wav)   |
@@ -497,6 +528,7 @@ Pitch can be adjusted by transforming those pitch cues. A few simple examples ar
 The flags can be combined. Modify these functions directly in the `inference.py` script to gain more control over the final result.
 
 You can find all the available options by calling `python inference.py --help`.
+More examples are presented on the website with [samples](https://fastpitch.github.io/).
 
 ## Performance
 
@@ -509,103 +541,21 @@ performance in training and inference mode.
 
 To benchmark the training performance on a specific batch size, run:
 
-**FastPitch**
-
-* For 1 GPU
-    * FP16
-        ```bash
-        python train.py \
-            --amp-run \
-            --batch-size 64 \
-            --gradient-accumulation-steps 4 \
-            --cuda \
-            --cudnn-enabled \
-            -o <output-dir> \
-            --log-file <output-dir>/nvlog.json \
-            --dataset-path <dataset-path> \
-            --training-files <train-filelist-path> \
-            --validation-files <val-filelist-path> \
-            --pitch-mean-std-file <pitch-stats-path> \
-            --epochs 10 \
-            --warmup-steps 1000 \
-            -lr 0.1 \
-            --optimizer lamb \
-            --gad-clip-thresh 1000.0 \
-            --dur-predictor-loss-scale 0.1 \
-            --pitch-predictor-loss-scale 0.1 \
-            --weight-decay 1e-6
-        ```
-
-    * FP32
-        ```bash
-        python train.py \
-            --batch-size 32 \
-            --gradient-accumulation-steps 8 \
-            --cuda \
-            --cudnn-enabled \
-            -o <output-dir> \
-            --log-file <output-dir>/nvlog.json \
-            --dataset-path <dataset-path> \
-            --training-files <train-filelist-path> \
-            --validation-files <val-filelist-path> \
-            --pitch-mean-std-file <pitch-stats-path> \
-            --epochs 10 \
-            --warmup-steps 1000 \
-            -lr 0.1 \
-            --optimizer lamb \
-            --grad-clip-thresh 1000.0 \
-            --dur-predictor-loss-scale 0.1 \
-            --pitch-predictor-loss-scale 0.1 \
-            --weight-decay 1e-6
-        ```
-
-* For multiple GPUs
-    * FP16
-        ```bash
-        python -m multiproc train.py \
-            --amp-run \
-            --batch-size 32 \
-            --gradient-accumulation-steps 1 \
-            --cuda \
-            --cudnn-enabled \
-            -o <output-dir> \
-            --log-file <output-dir>/nvlog.json \
-            --dataset-path <dataset-path> \
-            --training-files <train-filelist-path> \
-            --validation-files <val-filelist-path> \
-            --pitch-mean-std-file <pitch-stats-path> \
-            --epochs 10 \
-            --warmup-steps 1000 \
-            -lr 0.1 \
-            --optimizer lamb \
-            --grad-clip-thresh 1000.0 \
-            --dur-predictor-loss-scale 0.1 \
-            --pitch-predictor-loss-scale 0.1 \
-            --weight-decay 1e-6
-        ```
+* NVIDIA DGX A100 (8x A100 40GB)
+    ```bash
+        AMP=true NGPU=1 BS=128 GRAD_ACC=2 EPOCHS=10 bash scripts/train.sh
+        AMP=true NGPU=8 BS=32 GRAD_ACC=1 EPOCHS=10 bash scripts/train.sh
+        NGPU=1 BS=128 GRAD_ACC=2 EPOCHS=10 bash scripts/train.sh
+        NGPU=8 BS=32 GRAD_ACC=1 EPOCHS=10 bash scripts/train.sh
+    ```
 
-    * FP32
-        ```bash
-        python -m multiproc train.py \
-            --batch-size 32 \
-            --gradient-accumulation-steps 1
-            --cuda \
-            --cudnn-enabled \
-            -o <output-dir> \
-            --log-file <output-dir>/nvlog.json \
-            --dataset-path <dataset-path> \
-            --training-files <train-filelist-path> \
-            --validation-files <val-filelist-path> \
-            --pitch-mean-std-file <pitch-stats-path> \
-            --epochs 10 \
-            --warmup-steps 1000 \
-            -lr 0.1 \
-            --optimizer lamb \
-            --grad-clip-thresh 1000.0 \
-            --dur-predictor-loss-scale 0.1 \
-            --pitch-predictor-loss-scale 0.1 \
-            --weight-decay 1e-6
-        ```
+* NVIDIA DGX-1 (8x V100 16GB)
+    ```bash
+        AMP=true NGPU=1 BS=64 GRAD_ACC=4 EPOCHS=10 bash scripts/train.sh
+        AMP=true NGPU=8 BS=32 GRAD_ACC=1 EPOCHS=10 bash scripts/train.sh
+        NGPU=1 BS=32 GRAD_ACC=8 EPOCHS=10 bash scripts/train.sh
+        NGPU=8 BS=32 GRAD_ACC=1 EPOCHS=10 bash scripts/train.sh
+    ```
 
 Each of these scripts runs for 10 epochs and for each epoch measures the
 average number of items per second. The performance results can be read from
@@ -616,36 +566,21 @@ the `nvlog.json` files produced by the commands.
 To benchmark the inference performance on a specific batch size, run:
 
 * For FP16
-    ```
-    python inference.py --cuda --amp-run \
-                         --fastpitch output/checkpoint_FastPitch_1500.pt \
-                         --waveglow pretrained_models/waveglow/waveglow_256channels_ljs_v3.pt \
-                         --wn-channels 256 \
-                         --include-warmup \
-                         --batch-size 1 \
-                         --repeats 1000 \
-                         --input phrases/benchmark_8_128.tsv \
-                         --log-file output/nvlog_inference.json
+    ```bash
+    AMP=true BS_SEQ=”1 4 8” REPEATS=100 bash scripts/inference_benchmark.sh
     ```
 
-* For FP32
-    ```
-    python inference.py --cuda \
-                         --fastpitch output/checkpoint_FastPitch_1500.pt \
-                         --waveglow pretrained_models/waveglow/waveglow_256channels_ljs_v3.pt \
-                         --wn-channels 256 \
-                         --include-warmup \
-                         --batch-size 1 \
-                         --pitch \
-                         --repeats 1000 \
-                         --input phrases/benchmark_8_128.tsv \
-                         --log-file output/nvlog_inference.json
+* For FP32 or TF32
+    ```bash
+    BS_SEQ=”1 4 8” REPEATS=100 bash scripts/inference_benchmark.sh
     ```
 
 The output log files will contain performance numbers for the FastPitch model
-(number of output mel-spectrogram frames per second, reported as `generator_frames/s`)
-and for WaveGlow (number of output samples per second, reported as ` waveglow_samples/s`).
-The `inference.py` script will run a few warm-up iterations before running the benchmark. Inference will be averaged over 100 runs, as set by the `--repeats` flag.
+(number of output mel-spectrogram frames per second, reported as `generator_frames/s w
+`)
+and for WaveGlow (number of output samples per second, reported as ` waveglow_samples/s
+`).
+The `inference.py` script will run a few warm-up iterations before running the benchmark. Inference will be averaged over 100 runs, as set by the `REPEATS` env variable.
 
 ### Results
 
@@ -654,101 +589,144 @@ and accuracy in training and inference.
 
 #### Training accuracy results
 
-##### Training accuracy: NVIDIA DGX-1 (8x V100 16G)
+##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
 
-Our results were obtained by running the `./platform/train_fastpitch_{AMP,FP32}_DGX1_16GB_8GPU.sh` training script in the PyTorch 20.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs.
+Our results were obtained by running the `./platform/DGXA100_FastPitch_{AMP,TF32}_8GPU.sh` training script in the 20.06-py3 NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs.
+
+| Loss (Model/Epoch)   |    50 |   250 |   500 |   750 |  1000 |  1250 |  1500 |
+|:---------------------|------:|------:|------:|------:|------:|------:|------:|
+| FastPitch AMP        | 0.503 | 0.252 | 0.214 | 0.202 | 0.193 | 0.188 | 0.184 |
+| FastPitch TF32       | 0.500 | 0.252 | 0.215 | 0.201 | 0.193 | 0.187 | 0.183 |
+
+##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
+
+Our results were obtained by running the `./platform/DGX1_FastPitch_{AMP,FP32}_8GPU.sh` training script in the PyTorch 20.06-py3 NGC container on NVIDIA DGX-1 with 8x V100 16GB GPUs.
 
 All of the results were produced using the `train.py` script as described in the
 [Training process](#training-process) section of this document.
 
-| Loss (Model/Epoch)   |      0 |   250 |   500 |   750 |   1000 |   1250 |   1500 |
-|:---------------------|-------:|------:|------:|------:|-------:|-------:|-------:|
-| FastPitch AMP        | 35.094 | 0.254 | 0.216 | 0.201 |  0.193 |  0.187 |  0.184 |
-| FastPitch FP32       | 35.108 | 0.254 | 0.216 | 0.200 |  0.194 |  0.188 |  0.184 |
+| Loss (Model/Epoch)   |    50 |   250 |   500 |   750 |  1000 |  1250 |  1500 |
+|:---------------------|------:|------:|------:|------:|------:|------:|------:|
+| FastPitch AMP        | 0.499 | 0.250 | 0.211 | 0.198 | 0.190 | 0.184 | 0.180 |
+| FastPitch FP32       | 0.503 | 0.251 | 0.214 | 0.201 | 0.192 | 0.186 | 0.182 |
+
+
+<div style="text-align:center" align="center">
+  <img src="./img/loss.png" alt="Loss curves" />
+</div>
+
 
 
-<p align="center">
-  <img src="./img/loss_fp16.png" alt="AMP loss curve" />
-  <img src="./img/loss_fp32.png" alt="FP32 loss curve" />
-</p>
 
 #### Training performance results
 
-##### Training performance: NVIDIA DGX-1 (8x V100 16G)
+##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
+
+Our results were obtained by running the `./platform/DGXA100_FastPitch_{AMP,TF32}_8GPU.sh` training script in the 20.06-py3 NGC container on NVIDIA DGX A100 (8x A100 40GB) GPUs. Performance numbers, in output mel-scale spectrogram frames per second, were averaged over
+an entire training epoch.
+
+|Number of GPUs|Batch size per GPU|Frames/s with mixed precision|Frames/s with TF32|Speed-up with mixed precision|Multi-GPU strong scaling with mixed precision|Multi-GPU strong scaling with TF32|
+|---:|------------------:|--------:|-------:|-----:|-----:|-----:|
+|  1 | 128@AMP, 128@TF32 |  164955 | 113725 | 1.45 | 1.00 | 1.00 |
+|  4 |  64@AMP,  64@TF32 |  619527 | 435951 | 1.42 | 3.76 | 3.83 |
+|  8 |  32@AMP,  32@TF32 | 1040206 | 643569 | 1.62 | 6.31 | 5.66 |
+
+###### Expected training time
 
-Our results were obtained by running the `./platform/train_fastpitch_{AMP,FP32}_DGX1_16GB_8GPU.sh`
-training script in the PyTorch 20.03-py3 NGC container on NVIDIA DGX-1 with
-8x V100 16G GPUs. Performance numbers, in output mel-spectrograms per second, were averaged over
+The following table shows the expected training time for convergence for 1500 epochs:
+
+|Number of GPUs|Batch size per GPU|Time to train with mixed precision (Hrs)|Time to train with TF32 (Hrs)|Speed-up with mixed precision|
+|---:|-----------------:|-----:|-----:|-----:|
+|  1 |128@AMP, 128@TF32 | 18.5 | 26.6 | 1.44 |
+|  4 | 64@AMP,  64@TF32 |  5.5 |  7.5 | 1.36 |
+|  8 | 32@AMP,  32@TF32 |  3.6 |  5.3 | 1.47 |
+
+##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
+
+Our results were obtained by running the `./platform/DGX1_FastPitch_{AMP,FP32}_8GPU.sh`
+training script in the PyTorch 20.06-py3 NGC container on NVIDIA DGX-1 with
+8x V100 16GB GPUs. Performance numbers, in output mel-scale spectrogram frames per second, were averaged over
 an entire training epoch.
 
-|Number of GPUs|Batch size per GPU|Number of mels used with AMP|Number of mels used with FP32|Speed-up with AMP|Multi-GPU strong scaling with AMP|Multi-GPU strong scaling with FP32|
-|---:|---------------:|-------:|-------:|-----:|-----:|-----:|
-|  1 |64@AMP, 32@FP32 | 109769 |  40636 | 2.70 | 1.00 | 1.00 |
-|  4 |64@AMP, 32@FP32 | 361195 | 150921 | 2.39 | 3.29 | 3.71 |
-|  8 |32@AMP, 32@FP32 | 562136 | 278778 | 2.02 | 5.12 | 6.86 |
+|Number of GPUs|Batch size per GPU|Frames/s with mixed precision|Frames/s with FP32|Speed-up with mixed precision|Multi-GPU strong scaling with mixed precision|Multi-GPU strong scaling with FP32|
+|---:|----------------:|-------:|-------:|-----:|-----:|-----:|
+|  1 | 64@AMP, 32@FP32 | 110370 |  41066 | 2.69 | 1.00 | 1.00 |
+|  4 | 64@AMP, 32@FP32 | 402368 | 153853 | 2.62 | 3.65 | 3.75 |
+|  8 | 32@AMP, 32@FP32 | 570968 | 296767 | 1.92 | 5.17 | 7.23 |
 
 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
 
-##### Expected training time
+###### Expected training time
 
 The following table shows the expected training time for convergence for 1500 epochs:
 
-|Number of GPUs|Batch size per GPU|Time to train with AMP (Hrs)|Time to train with FP32 (Hrs)|Speed-up with AMP|
-|---:|---------------:|-----:|-----:|-----:|
-|  1 |64@AMP, 32@FP32 | 27.0 | 73.7 | 2.73 |
-|  4 |64@AMP, 32@FP32 |  8.4 | 19.7 | 2.36 |
-|  8 |32@AMP, 32@FP32 |  5.5 | 10.8 | 1.97 |
-
+|Number of GPUs|Batch size per GPU|Time to train with mixed precision (Hrs)|Time to train with TF32 (Hrs)|Speed-up with mixed precision|
+|---:|-----------------:|-----:|-----:|-----:|
+|  1 | 64@AMP,  32@FP32 | 27.6 | 72.7 | 2.63 |
+|  4 | 64@AMP,  32@FP32 |  8.2 | 20.3 | 2.48 |
+|  8 | 32@AMP,  32@FP32 |  5.9 | 10.9 | 1.85 |
 
 Note that most of the quality is achieved after the initial 500 epochs.
 
 #### Inference performance results
 
 The following tables show inference statistics for the FastPitch and WaveGlow
-text-to-speech system, gathered from 1000 inference runs, on a single V100 and a single T4,
-respectively. Latency is measured from the start of FastPitch inference to
+text-to-speech system, gathered from 100 inference runs. Latency is measured from the start of FastPitch inference to
 the end of WaveGlow inference. Throughput is measured
 as the number of generated audio samples per second at 22KHz. RTF is the real-time factor which denotes the number of seconds of speech generated in a second of wall-clock time, per input utterance.
-The used WaveGlow model is a 256-channel model [published on NGC](https://ngc.nvidia.com/catalog/models/nvidia:waveglow_ljs_256channels).
+The used WaveGlow model is a 256-channel model.
 
-Our results were obtained by running the `./scripts/inference_benchmark.sh` script in
-the PyTorch 20.03-py3 NGC container. Note that to reproduce the results,
-you need to provide pre-trained checkpoins for FastPitch and WaveGlow. Edit the script to provide your checkpoint filenames.
+Note that performance numbers are related to the length of input. The numbers reported below were taken with a moderate length of 128 characters. Longer utterances yield higher RTF, as the generator is fully parallel.
+##### Inference performance: NVIDIA DGX A100 (1x A100 40GB)
 
-Note that performance numbers are related to the length of input. The numbers reported below were taken with a moderate length of 128 characters. For longer utterances even better numbers are expected, as the generator is fully parallel.
+Our results were obtained by running the `./scripts/inference_benchmark.sh` inferencing benchmarking script in the 20.06-py3 NGC container on NVIDIA DGX A100 (1x A100 40GB) GPU.
 
-##### Inference performance: NVIDIA DGX-1 (1x V100 16G)
+|Batch size|Precision|Avg latency (s)|Latency tolerance interval 90% (s)|Latency tolerance interval 95% (s)|Latency tolerance interval 99% (s)|Throughput (samples/sec)|Speed-up with mixed precision|Avg RTF|
+|------:|------------:|--------------:|--------------:|--------------:|--------------:|----------------:|---------------:|----------:|
+|    1 | FP16   |     0.106 |   0.106 |   0.106 |   0.107 |      1,636,913 |      1.60 | 74.24 |
+|    4 | FP16   |     0.390 |   0.391 |   0.391 |   0.391 |      1,780,764 |      1.55 | 20.19 |
+|    8 | FP16   |     0.758 |   0.758 |   0.758 |   0.758 |      1,832,544 |      1.52 | 10.39 |
+|    1 | TF32   |     0.170 |   0.170 |   0.170 |   0.170 |      1,020,894 |         - | 46.30 |
+|    4 | TF32   |     0.603 |   0.603 |   0.603 |   0.603 |      1,150,598 |         - | 13.05 |
+|    8 | TF32   |     1.153 |   1.154 |   1.154 |   1.154 |      1,202,463 |         - |  6.82 |
+
+##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
+
+Our results were obtained by running the `./scripts/inference_benchmark.sh` script in
+the PyTorch 20.06-py3 NGC container. The input utterance has 128 characters, synthesized audio has 8.05 s.
 
-The input utterance has 128 characters, synthesized audio has 8.05 s.
 
 |Batch size|Precision|Avg latency (s)|Latency tolerance interval 90% (s)|Latency tolerance interval 95% (s)|Latency tolerance interval 99% (s)|Throughput (samples/sec)|Speed-up with mixed precision|Avg RTF|
 |------:|------------:|--------------:|--------------:|--------------:|--------------:|----------------:|---------------:|----------:|
-|     1 | FP16        |         0.253 |         0.254 |         0.255 |         0.255 | 702,735         | 1.51           |     31.87 |
-|     4 | FP16        |         0.572 |         0.575 |         0.575 |         0.576 | 1,243,094       | 2.55           |     14.09 |
-|     8 | FP16        |         1.118 |         1.121 |         1.121 |         1.123 | 1,269,479       | 2.70           |      7.20 |
-|     1 | FP32        |         0.382 |         0.384 |         0.384 |         0.385 | 464,920         | -              |     21.08 |
-|     4 | FP32        |         1.458 |         1.461 |         1.461 |         1.462 | 486,756         | -              |      5.52 |
-|     8 | FP32        |         3.015 |         3.023 |         3.024 |         3.027 | 470,741         | -              |      2.67 |
-
+|    1 | FP16   |     0.193 |   0.194 |   0.194 |   0.194 |       902,960 |      2.35 | 40.95 |
+|    4 | FP16   |     0.610 |   0.613 |   0.613 |   0.614 |     1,141,207 |      2.78 | 12.94 |
+|    8 | FP16   |     1.157 |   1.161 |   1.161 |   1.162 |     1,201,684 |      2.68 |  6.81 |
+|    1 | FP32   |     0.453 |   0.455 |   0.456 |   0.457 |       385,027 |         - | 17.46 |
+|    4 | FP32   |     1.696 |   1.703 |   1.705 |   1.707 |       411,124 |         - |  4.66 |
+|    8 | FP32   |     3.111 |   3.118 |   3.120 |   3.122 |       448,275 |         - |  2.54 |
 
 ##### Inference performance: NVIDIA T4
 
+Our results were obtained by running the `./scripts/inference_benchmark.sh` script in
+the PyTorch 20.06-py3 NGC container.
 The input utterance has 128 characters, synthesized audio has 8.05 s.
 
 |Batch size|Precision|Avg latency (s)|Latency tolerance interval 90% (s)|Latency tolerance interval 95% (s)|Latency tolerance interval 99% (s)|Throughput (samples/sec)|Speed-up with mixed precision|Avg RTF|
-|------:|------------:|--------------:|--------------:|--------------:|--------------:|----------------:|---------------:|----------:|
-|     1 | FP16        |         0.952 |         0.958 |         0.960 |         0.962 | 186,349         | 1.30           |      8.45 |
-|     4 | FP16        |         4.187 |         4.209 |         4.213 |         4.221 | 169,473         | 1.21           |      1.92 |
-|     8 | FP16        |         7.799 |         7.824 |         7.829 |         7.839 | 181,978         | 1.38           |      1.03 |
-|     1 | FP32        |         1.238 |         1.245 |         1.247 |         1.250 | 143,292         | -              |      6.50 |
-|     4 | FP32        |         5.083 |         5.109 |         5.114 |         5.124 | 139,613         | -              |      1.58 |
-|     8 | FP32        |        10.756 |        10.797 |        10.805 |        10.820 | 131,951         | -              |      0.75 |
-
+|-----:|-------:|----------:|--------:|--------:|--------:|-------------:|----------:|------:|
+|    1 | FP16   |     0.533 |   0.540 |   0.541 |   0.543 |      326,471 |      2.56 | 14.81 |
+|    4 | FP16   |     2.292 |   2.302 |   2.304 |   2.308 |      304,283 |      2.38 |  3.45 |
+|    8 | FP16   |     4.564 |   4.578 |   4.580 |   4.585 |      305,568 |      1.99 |  1.73 |
+|    1 | FP32   |     1.365 |   1.383 |   1.387 |   1.394 |      127,765 |         - |  5.79 |
+|    4 | FP32   |     5.192 |   5.214 |   5.218 |   5.226 |      134,309 |         - |  1.52 |
+|    8 | FP32   |     9.09  |   9.11  |   9.114 |   9.122 |      153,434 |         - |  0.87 |
 
 ## Release notes
 
 ### Changelog
 
+June 2020
+- Updated performance tables to include A100 results
+
 May 2020
 - Initial release
 

BIN
PyTorch/SpeechSynthesis/FastPitch/audio/sample_fp32.wav


+ 15 - 0
PyTorch/SpeechSynthesis/FastPitch/common/log_helper.py

@@ -1,5 +1,7 @@
 import atexit
+import glob
 import os
+import re
 import numpy as np
 
 from tensorboardX import SummaryWriter
@@ -8,6 +10,19 @@ import dllogger as DLLogger
 from dllogger import StdOutBackend, JSONStreamBackend, Verbosity
 
 
+def unique_dllogger_fpath(log_fpath):
+
+    if not os.path.isfile(log_fpath):
+        return log_fpath
+
+    # Avoid overwriting old logs
+    saved = sorted([int(re.search('\.(\d+)', f).group(1))
+                    for f in glob.glob(f'{log_fpath}.*')])
+
+    log_num = (saved[-1] if saved else 0) + 1
+    return f'{log_fpath}.{log_num}'
+
+
 def stdout_step_format(step):
     if isinstance(step, str):
         return step

+ 6 - 5
PyTorch/SpeechSynthesis/FastPitch/common/stft.py

@@ -108,11 +108,12 @@ class STFT(torch.nn.Module):
         recombine_magnitude_phase = torch.cat(
             [magnitude*torch.cos(phase), magnitude*torch.sin(phase)], dim=1)
 
-        inverse_transform = F.conv_transpose1d(
-            recombine_magnitude_phase,
-            Variable(self.inverse_basis, requires_grad=False),
-            stride=self.hop_length,
-            padding=0)
+        with torch.no_grad():
+            inverse_transform = F.conv_transpose2d(
+                recombine_magnitude_phase.unsqueeze(-1),
+                self.inverse_basis.unsqueeze(-1),
+                stride=self.hop_length,
+                padding=0).squeeze(-1)
 
         if self.window is not None:
             window_sum = window_sumsquare(

+ 2 - 2
PyTorch/SpeechSynthesis/FastPitch/export_torchscript.py

@@ -37,7 +37,7 @@ def parse_args(parser):
                         help='full path to the generator checkpoint file')
     parser.add_argument('-o', '--output', type=str, default="trtis_repo/tacotron/1/model.pt",
                         help='filename for the Tacotron 2 TorchScript model')
-    parser.add_argument('--amp-run', action='store_true',
+    parser.add_argument('--amp', action='store_true',
                         help='inference with AMP')
     return parser
 
@@ -49,7 +49,7 @@ def main():
 
     model = load_and_setup_model(
         args.generator_name, parser, args.generator_checkpoint,
-        args.amp_run, device='cpu', forward_is_infer=True, polyak=False,
+        args.amp, device='cpu', forward_is_infer=True, polyak=False,
         jitable=True)
     
     torch.jit.save(torch.jit.script(model), args.output)

+ 1 - 1
PyTorch/SpeechSynthesis/FastPitch/extract_mels.py

@@ -194,7 +194,7 @@ def main():
         DLLogger.log(step="PARAMETER", data={k:v})
 
     model = load_and_setup_model(
-        'Tacotron2', parser, args.tacotron2_checkpoint, amp_run=False,
+        'Tacotron2', parser, args.tacotron2_checkpoint, amp=False,
         device=torch.device('cuda' if args.cuda else 'cpu'),
         forward_is_infer=False, ema=False)
 

BIN
PyTorch/SpeechSynthesis/FastPitch/img/loss.png


+ 37 - 29
PyTorch/SpeechSynthesis/FastPitch/inference.py

@@ -40,10 +40,10 @@ from scipy.io.wavfile import write
 from torch.nn.utils.rnn import pad_sequence
 
 import dllogger as DLLogger
-from apex import amp
 from dllogger import StdOutBackend, JSONStreamBackend, Verbosity
 
 from common import utils
+from common.log_helper import unique_dllogger_fpath
 from common.text import text_to_sequence
 from waveglow import model as glow
 from waveglow.denoiser import Denoiser
@@ -59,8 +59,8 @@ def parse_args(parser):
                         help='Full path to the input text (phareses separated by newlines)')
     parser.add_argument('-o', '--output', default=None,
                         help='Output folder to save audio (file per phrase)')
-    parser.add_argument('--log-file', type=str, default='nvlog.json',
-                        help='Filename for logging')
+    parser.add_argument('--log-file', type=str, default=None,
+                        help='Path to a DLLogger log file')
     parser.add_argument('--cuda', action='store_true',
                         help='Run inference on a GPU using CUDA')
     parser.add_argument('--fastpitch', type=str,
@@ -75,7 +75,7 @@ def parse_args(parser):
                         help='Sampling rate')
     parser.add_argument('--stft-hop-length', type=int, default=256,
                         help='STFT hop length for estimating audio length from mel size')
-    parser.add_argument('--amp-run', action='store_true',
+    parser.add_argument('--amp', action='store_true',
                         help='Inference with AMP')
     parser.add_argument('--batch-size', type=int, default=64)
     parser.add_argument('--include-warmup', action='store_true',
@@ -105,7 +105,7 @@ def parse_args(parser):
     return parser
 
 
-def load_and_setup_model(model_name, parser, checkpoint, amp_run, device,
+def load_and_setup_model(model_name, parser, checkpoint, amp, device,
                          unk_args=[], forward_is_infer=False, ema=True,
                          jitable=False):
     model_parser = models.parse_model_args(model_name, parser, add_help=False)
@@ -139,7 +139,7 @@ def load_and_setup_model(model_name, parser, checkpoint, amp_run, device,
 
     if model_name == "WaveGlow":
         model = model.remove_weightnorm(model)
-    if amp_run:
+    if amp:
         model.half()
     model.eval()
     return model.to(device)
@@ -232,25 +232,28 @@ def main():
     Launches text to speech (inference).
     Inference is executed on a single GPU.
     """
+
+    torch.backends.cudnn.benchmark = True
+
     parser = argparse.ArgumentParser(description='PyTorch FastPitch Inference',
                                      allow_abbrev=False)
     parser = parse_args(parser)
     args, unk_args = parser.parse_known_args()
 
-    DLLogger.init(backends=[JSONStreamBackend(Verbosity.DEFAULT, args.log_file),
-                            StdOutBackend(Verbosity.VERBOSE)])
-    for k,v in vars(args).items():
-        DLLogger.log(step="PARAMETER", data={k:v})
-    DLLogger.log(step="PARAMETER", data={'model_name': 'FastPitch_PyT'})
-
     if args.output is not None:
         Path(args.output).mkdir(parents=False, exist_ok=True)
 
+    log_fpath = args.log_file or str(Path(args.output, 'nvlog_infer.json'))
+    log_fpath = unique_dllogger_fpath(log_fpath)
+    DLLogger.init(backends=[JSONStreamBackend(Verbosity.DEFAULT, log_fpath),
+                            StdOutBackend(Verbosity.VERBOSE)])
+    [DLLogger.log("PARAMETER", {k:v}) for k,v in vars(args).items()]
+
     device = torch.device('cuda' if args.cuda else 'cpu')
 
     if args.fastpitch is not None:
         generator = load_and_setup_model(
-            'FastPitch', parser, args.fastpitch, args.amp_run, device,
+            'FastPitch', parser, args.fastpitch, args.amp, device,
             unk_args=unk_args, forward_is_infer=True, ema=args.ema,
             jitable=args.torchscript)
 
@@ -263,7 +266,7 @@ def main():
         with warnings.catch_warnings():
             warnings.simplefilter("ignore")
             waveglow = load_and_setup_model(
-                'WaveGlow', parser, args.waveglow, args.amp_run, device,
+                'WaveGlow', parser, args.waveglow, args.amp, device,
                 unk_args=unk_args, forward_is_infer=True, ema=args.ema)
         denoiser = Denoiser(waveglow).to(device)
         waveglow = getattr(waveglow, 'infer', waveglow)
@@ -305,13 +308,14 @@ def main():
     all_frames = 0
 
     reps = args.repeats
-    log_enabled = reps == 1
+    log_enabled = True  # reps == 1
     log = lambda s, d: DLLogger.log(step=s, data=d) if log_enabled else None
 
-    for repeat in (tqdm.tqdm(range(reps)) if reps > 1 else range(reps)):
+    # for repeat in (tqdm.tqdm(range(reps)) if reps > 1 else range(reps)):
+    for rep in range(reps):
         for b in batches:
             if generator is None:
-                log(0, {'Synthesizing from ground truth mels'})
+                log(rep, {'Synthesizing from ground truth mels'})
                 mel, mel_lens = b['mel'], b['mel_lens']
             else:
                 with torch.no_grad(), gen_measures:
@@ -321,8 +325,8 @@ def main():
                 gen_infer_perf = mel.size(0) * mel.size(2) / gen_measures[-1]
                 all_letters += b['text_lens'].sum().item()
                 all_frames += mel.size(0) * mel.size(2)
-                log(0, {"generator_frames_per_sec": gen_infer_perf})
-                log(0, {"generator_latency": gen_measures[-1]})
+                log(rep, {"fastpitch_frames_per_sec": gen_infer_perf})
+                log(rep, {"fastpitch_latency": gen_measures[-1]})
 
             if waveglow is not None:
                 with torch.no_grad(), waveglow_measures:
@@ -336,8 +340,8 @@ def main():
                 waveglow_infer_perf = (
                     audios.size(0) * audios.size(1) / waveglow_measures[-1])
 
-                log(0, {"waveglow_samples_per_sec": waveglow_infer_perf})
-                log(0, {"waveglow_latency": waveglow_measures[-1]})
+                log(rep, {"waveglow_samples_per_sec": waveglow_infer_perf})
+                log(rep, {"waveglow_latency": waveglow_measures[-1]})
 
                 if args.output is not None and reps == 1:
                     for i, audio in enumerate(audios):
@@ -354,27 +358,31 @@ def main():
                         write(audio_path, args.sampling_rate, audio.cpu().numpy())
 
             if generator is not None and waveglow is not None:
-                log(0, {"latency": (gen_measures[-1] + waveglow_measures[-1])})
+                log(rep, {"latency": (gen_measures[-1] + waveglow_measures[-1])})
 
     log_enabled = True
     if generator is not None:
         gm = np.sort(np.asarray(gen_measures))
-        log('avg', {"generator letters/s": all_letters / gm.sum()})
-        log('avg', {"generator_frames/s": all_frames / gm.sum()})
-        log('avg', {"generator_latency": gm.mean()})
-        log('90%', {"generator_latency": gm.mean() + norm.ppf((1.0 + 0.90) / 2) * gm.std()})
-        log('95%', {"generator_latency": gm.mean() + norm.ppf((1.0 + 0.95) / 2) * gm.std()})
-        log('99%', {"generator_latency": gm.mean() + norm.ppf((1.0 + 0.99) / 2) * gm.std()})
+        rtf = all_samples / (all_utterances * gm.mean() * args.sampling_rate)
+        log('avg', {"fastpitch letters/s": all_letters / gm.sum()})
+        log('avg', {"fastpitch_frames/s": all_frames / gm.sum()})
+        log('avg', {"fastpitch_latency": gm.mean()})
+        log('avg', {"fastpitch RTF": rtf})
+        log('90%', {"fastpitch_latency": gm.mean() + norm.ppf((1.0 + 0.90) / 2) * gm.std()})
+        log('95%', {"fastpitch_latency": gm.mean() + norm.ppf((1.0 + 0.95) / 2) * gm.std()})
+        log('99%', {"fastpitch_latency": gm.mean() + norm.ppf((1.0 + 0.99) / 2) * gm.std()})
     if waveglow is not None:
         wm = np.sort(np.asarray(waveglow_measures))
+        rtf = all_samples / (all_utterances * wm.mean() * args.sampling_rate)
         log('avg', {"waveglow_samples/s": all_samples / wm.sum()})
         log('avg', {"waveglow_latency": wm.mean()})
+        log('avg', {"waveglow RTF": rtf})
         log('90%', {"waveglow_latency": wm.mean() + norm.ppf((1.0 + 0.90) / 2) * wm.std()})
         log('95%', {"waveglow_latency": wm.mean() + norm.ppf((1.0 + 0.95) / 2) * wm.std()})
         log('99%', {"waveglow_latency": wm.mean() + norm.ppf((1.0 + 0.99) / 2) * wm.std()})
     if generator is not None and waveglow is not None:
         m = gm + wm
-        rtf = all_samples / (len(batches) * all_utterances * m.mean() * args.sampling_rate)
+        rtf = all_samples / (all_utterances * m.mean() * args.sampling_rate)
         log('avg', {"samples/s": all_samples / m.sum()})
         log('avg', {"letters/s": all_letters / m.sum()})
         log('avg', {"latency": m.mean()})

+ 0 - 107
PyTorch/SpeechSynthesis/FastPitch/inference_perf.py

@@ -1,107 +0,0 @@
-# *****************************************************************************
-#  Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
-#
-#  Redistribution and use in source and binary forms, with or without
-#  modification, are permitted provided that the following conditions are met:
-#      * Redistributions of source code must retain the above copyright
-#        notice, this list of conditions and the following disclaimer.
-#      * Redistributions in binary form must reproduce the above copyright
-#        notice, this list of conditions and the following disclaimer in the
-#        documentation and/or other materials provided with the distribution.
-#      * Neither the name of the NVIDIA CORPORATION nor the
-#        names of its contributors may be used to endorse or promote products
-#        derived from this software without specific prior written permission.
-#
-#  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-#  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-#  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-#  DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
-#  DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-#  (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-#  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
-#  ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-#  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-#
-# *****************************************************************************
-
-import argparse
-import json
-import time
-
-import torch
-import numpy as np
-
-import dllogger as DLLogger
-from apex import amp
-from dllogger import StdOutBackend, JSONStreamBackend, Verbosity
-
-import models
-from inference import load_and_setup_model, MeasureTime
-
-
-def parse_args(parser):
-    """
-    Parse commandline arguments.
-    """
-    parser.add_argument('--amp-run', action='store_true',
-                        help='inference with AMP')
-    parser.add_argument('-bs', '--batch-size', type=int, default=1)
-    parser.add_argument('-o', '--output', type=str, required=True,
-                        help='Directory to save results')
-    parser.add_argument('--log-file', type=str, default='nvlog.json',
-                        help='Filename for logging')
-    return parser
-
-
-def main():
-    """
-    Launches inference benchmark.
-    Inference is executed on a single GPU.
-    """
-    parser = argparse.ArgumentParser(
-        description='PyTorch FastPitch Inference Benchmark')
-    parser = parse_args(parser)
-    args, _ = parser.parse_known_args()
-
-    log_file = args.log_file
-    DLLogger.init(backends=[JSONStreamBackend(Verbosity.DEFAULT,
-                                              args.log_file),
-                            StdOutBackend(Verbosity.VERBOSE)])
-    for k,v in vars(args).items():
-        DLLogger.log(step="PARAMETER", data={k:v})
-    DLLogger.log(step="PARAMETER", data={'model_name': 'FastPitch_PyT'})
-
-    model = load_and_setup_model('FastPitch', parser, None, args.amp_run,
-                                 'cuda', unk_args=[], forward_is_infer=True,
-                                 ema=False, jitable=True)
-
-    # FIXME Temporarily disabled due to nn.LayerNorm fp16 casting bug in pytorch:20.02-py3 and 20.03
-    # model = torch.jit.script(model)
-
-    warmup_iters = 3
-    iters = 1
-    gen_measures = MeasureTime()
-    all_frames = 0
-    for i in range(-warmup_iters, iters):
-        text_padded = torch.randint(low=0, high=148, size=(args.batch_size, 128),
-                                    dtype=torch.long).to('cuda')
-        input_lengths = torch.IntTensor([text_padded.size(1)] * args.batch_size).to('cuda')
-        durs = torch.ones_like(text_padded).mul_(4).to('cuda')
-
-        with torch.no_grad(), gen_measures:
-            mels, *_ = model(text_padded, input_lengths, dur_tgt=durs)
-        num_frames = mels.size(0) * mels.size(2)
-
-        if i >= 0:
-            all_frames += num_frames
-            DLLogger.log(step=(i,), data={"latency": gen_measures[-1]})
-            DLLogger.log(step=(i,), data={"frames/s": num_frames / gen_measures[-1]})
-
-    measures = gen_measures[warmup_iters:]
-    DLLogger.log(step=(), data={'avg latency': np.mean(measures)})
-    DLLogger.log(step=(), data={'avg frames/s': all_frames / np.sum(measures)})
-    DLLogger.flush()
-
-if __name__ == '__main__':
-    main()

+ 0 - 91
PyTorch/SpeechSynthesis/FastPitch/multiproc.py

@@ -1,91 +0,0 @@
-# *****************************************************************************
-#  Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
-#
-#  Redistribution and use in source and binary forms, with or without
-#  modification, are permitted provided that the following conditions are met:
-#      * Redistributions of source code must retain the above copyright
-#        notice, this list of conditions and the following disclaimer.
-#      * Redistributions in binary form must reproduce the above copyright
-#        notice, this list of conditions and the following disclaimer in the
-#        documentation and/or other materials provided with the distribution.
-#      * Neither the name of the NVIDIA CORPORATION nor the
-#        names of its contributors may be used to endorse or promote products
-#        derived from this software without specific prior written permission.
-#
-#  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-#  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-#  WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-#  DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
-#  DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-#  (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-#  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
-#  ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-#  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-#  SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-#
-# *****************************************************************************
-
-import sys
-import subprocess
-
-import torch
-
-
-def main():
-    argslist = list(sys.argv)[1:]
-    world_size = torch.cuda.device_count()
-
-    if '--set-world-size' in argslist:
-        idx = argslist.index('--set-world-size')
-        world_size = int(argslist[idx + 1])
-        del argslist[idx + 1]
-        del argslist[idx]
-
-    if '--world-size' in argslist:
-        argslist[argslist.index('--world-size') + 1] = str(world_size)
-    else:
-        argslist.append('--world-size')
-        argslist.append(str(world_size))
-
-    workers = []
-
-    for i in range(world_size):
-        if '--rank' in argslist:
-            argslist[argslist.index('--rank') + 1] = str(i)
-        else:
-            argslist.append('--rank')
-            argslist.append(str(i))
-        stdout = None if i == 0 else subprocess.DEVNULL
-        worker = subprocess.Popen(
-            [str(sys.executable)] + argslist, stdout=stdout)
-        workers.append(worker)
-
-    returncode = 0
-    try:
-        pending = len(workers)
-        while pending > 0:
-            for worker in workers:
-                try:
-                    worker_returncode = worker.wait(1)
-                except subprocess.TimeoutExpired:
-                    continue
-                pending -= 1
-                if worker_returncode != 0:
-                    if returncode != 1:
-                        for worker in workers:
-                            worker.terminate()
-                    returncode = 1
-
-    except KeyboardInterrupt:
-        print('Pressed CTRL-C, TERMINATING')
-        for worker in workers:
-            worker.terminate()
-        for worker in workers:
-            worker.wait()
-        raise
-
-    sys.exit(returncode)
-
-
-if __name__ == "__main__":
-    main()

+ 1 - 1
PyTorch/SpeechSynthesis/FastPitch/platform/train_fastpitch_AMP_DGX1_16GB_1GPU.sh → PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_AMP_1GPU.sh

@@ -2,7 +2,7 @@
 
 mkdir -p output
 python train.py \
-    --amp-run \
+    --amp \
     --cuda \
     --cudnn-enabled \
     -o ./output/ \

+ 2 - 3
PyTorch/SpeechSynthesis/FastPitch/platform/train_fastpitch_AMP_DGX1_16GB_4GPU.sh → PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_AMP_4GPU.sh

@@ -1,9 +1,8 @@
 #!/bin/bash
 
 mkdir -p output
-python -m multiproc train.py \
-    --amp-run \
-    --set-world-size 4 \
+python -m torch.distributed.launch --nproc_per_node 4 train.py \
+    --amp \
     --cuda \
     --cudnn-enabled \
     -o ./output/ \

+ 24 - 0
PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_AMP_8GPU.sh

@@ -0,0 +1,24 @@
+#!/bin/bash
+
+mkdir -p output
+python -m torch.distributed.launch --nproc_per_node 8 train.py \
+    --amp \
+    --cuda \
+    --cudnn-enabled \
+    -o ./output/ \
+    --log-file output/nvlog.json \
+    --dataset-path LJSpeech-1.1 \
+    --training-files filelists/ljs_mel_dur_pitch_text_train_filelist.txt \
+    --validation-files filelists/ljs_mel_dur_pitch_text_test_filelist.txt \
+    --pitch-mean-std LJSpeech-1.1/pitch_char_stats__ljs_audio_text_train_filelist.json \
+    --epochs 1500 \
+    --epochs-per-checkpoint 100 \
+    --warmup-steps 1000 \
+    -lr 0.1 \
+    -bs 32 \
+    --optimizer lamb \
+    --grad-clip-thresh 1000.0 \
+    --dur-predictor-loss-scale 0.1 \
+    --pitch-predictor-loss-scale 0.1 \
+    --weight-decay 1e-6 \
+    --gradient-accumulation-steps 1

+ 0 - 0
PyTorch/SpeechSynthesis/FastPitch/platform/train_fastpitch_FP32_DGX1_16GB_1GPU.sh → PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_FP32_1GPU.sh


+ 1 - 2
PyTorch/SpeechSynthesis/FastPitch/platform/train_fastpitch_FP32_DGX1_16GB_4GPU.sh → PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_FP32_4GPU.sh

@@ -1,8 +1,7 @@
 #!/bin/bash
 
 mkdir -p output
-python -m multiproc train.py \
-    --set-world-size 4 \
+python -m torch.distributed.launch --nproc_per_node 4 train.py \
     --cuda \
     --cudnn-enabled \
     -o ./output/ \

+ 1 - 2
PyTorch/SpeechSynthesis/FastPitch/platform/train_fastpitch_AMP_DGX1_16GB_8GPU.sh → PyTorch/SpeechSynthesis/FastPitch/platform/DGX1_FastPitch_FP32_8GPU.sh

@@ -1,8 +1,7 @@
 #!/bin/bash
 
 mkdir -p output
-python -m multiproc train.py \
-    --amp-run \
+python -m torch.distributed.launch --nproc_per_node 8 train.py \
     --cuda \
     --cudnn-enabled \
     -o ./output/ \

+ 24 - 0
PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_AMP_1GPU.sh

@@ -0,0 +1,24 @@
+#!/bin/bash
+
+mkdir -p output
+python train.py \
+    --amp \
+    --cuda \
+    --cudnn-enabled \
+    -o ./output/ \
+    --log-file output/nvlog.json \
+    --dataset-path LJSpeech-1.1 \
+    --training-files filelists/ljs_mel_dur_pitch_text_train_filelist.txt \
+    --validation-files filelists/ljs_mel_dur_pitch_text_test_filelist.txt \
+    --pitch-mean-std LJSpeech-1.1/pitch_char_stats__ljs_audio_text_train_filelist.json \
+    --epochs 1500 \
+    --epochs-per-checkpoint 100 \
+    --warmup-steps 1000 \
+    -lr 0.1 \
+    -bs 128 \
+    --optimizer lamb \
+    --grad-clip-thresh 1000.0 \
+    --dur-predictor-loss-scale 0.1 \
+    --pitch-predictor-loss-scale 0.1 \
+    --weight-decay 1e-6 \
+    --gradient-accumulation-steps 2

+ 24 - 0
PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_AMP_4GPU.sh

@@ -0,0 +1,24 @@
+#!/bin/bash
+
+mkdir -p output
+python -m torch.distributed.launch --nproc_per_node 4 train.py \
+    --amp \
+    --cuda \
+    --cudnn-enabled \
+    -o ./output/ \
+    --log-file output/nvlog.json \
+    --dataset-path LJSpeech-1.1 \
+    --training-files filelists/ljs_mel_dur_pitch_text_train_filelist.txt \
+    --validation-files filelists/ljs_mel_dur_pitch_text_test_filelist.txt \
+    --pitch-mean-std LJSpeech-1.1/pitch_char_stats__ljs_audio_text_train_filelist.json \
+    --epochs 1500 \
+    --epochs-per-checkpoint 100 \
+    --warmup-steps 1000 \
+    -lr 0.1 \
+    -bs 64 \
+    --optimizer lamb \
+    --grad-clip-thresh 1000.0 \
+    --dur-predictor-loss-scale 0.1 \
+    --pitch-predictor-loss-scale 0.1 \
+    --weight-decay 1e-6 \
+    --gradient-accumulation-steps 1

+ 24 - 0
PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_AMP_8GPU.sh

@@ -0,0 +1,24 @@
+#!/bin/bash
+
+mkdir -p output
+python -m torch.distributed.launch --nproc_per_node 8 train.py \
+    --amp \
+    --cuda \
+    --cudnn-enabled \
+    -o ./output/ \
+    --log-file output/nvlog.json \
+    --dataset-path LJSpeech-1.1 \
+    --training-files filelists/ljs_mel_dur_pitch_text_train_filelist.txt \
+    --validation-files filelists/ljs_mel_dur_pitch_text_test_filelist.txt \
+    --pitch-mean-std LJSpeech-1.1/pitch_char_stats__ljs_audio_text_train_filelist.json \
+    --epochs 1500 \
+    --epochs-per-checkpoint 100 \
+    --warmup-steps 1000 \
+    -lr 0.1 \
+    -bs 32 \
+    --optimizer lamb \
+    --grad-clip-thresh 1000.0 \
+    --dur-predictor-loss-scale 0.1 \
+    --pitch-predictor-loss-scale 0.1 \
+    --weight-decay 1e-6 \
+    --gradient-accumulation-steps 1

+ 23 - 0
PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_TF32_1GPU.sh

@@ -0,0 +1,23 @@
+#!/bin/bash
+
+mkdir -p output
+python train.py \
+    --cuda \
+    --cudnn-enabled \
+    -o ./output/ \
+    --log-file output/nvlog.json \
+    --dataset-path LJSpeech-1.1 \
+    --training-files filelists/ljs_mel_dur_pitch_text_train_filelist.txt \
+    --validation-files filelists/ljs_mel_dur_pitch_text_test_filelist.txt \
+    --pitch-mean-std LJSpeech-1.1/pitch_char_stats__ljs_audio_text_train_filelist.json \
+    --epochs 1500 \
+    --epochs-per-checkpoint 100 \
+    --warmup-steps 1000 \
+    -lr 0.1 \
+    -bs 32 \
+    --optimizer lamb \
+    --grad-clip-thresh 1000.0 \
+    --dur-predictor-loss-scale 0.1 \
+    --pitch-predictor-loss-scale 0.1 \
+    --weight-decay 1e-6 \
+    --gradient-accumulation-steps 8

+ 23 - 0
PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_TF32_4GPU.sh

@@ -0,0 +1,23 @@
+#!/bin/bash
+
+mkdir -p output
+python -m torch.distributed.launch --nproc_per_node 4 train.py \
+    --cuda \
+    --cudnn-enabled \
+    -o ./output/ \
+    --log-file output/nvlog.json \
+    --dataset-path LJSpeech-1.1 \
+    --training-files filelists/ljs_mel_dur_pitch_text_train_filelist.txt \
+    --validation-files filelists/ljs_mel_dur_pitch_text_test_filelist.txt \
+    --pitch-mean-std LJSpeech-1.1/pitch_char_stats__ljs_audio_text_train_filelist.json \
+    --epochs 1500 \
+    --epochs-per-checkpoint 100 \
+    --warmup-steps 1000 \
+    -lr 0.1 \
+    -bs 32 \
+    --optimizer lamb \
+    --grad-clip-thresh 1000.0 \
+    --dur-predictor-loss-scale 0.1 \
+    --pitch-predictor-loss-scale 0.1 \
+    --weight-decay 1e-6 \
+    --gradient-accumulation-steps 2

+ 1 - 1
PyTorch/SpeechSynthesis/FastPitch/platform/train_fastpitch_FP32_DGX1_16GB_8GPU.sh → PyTorch/SpeechSynthesis/FastPitch/platform/DGXA100_FastPitch_TF32_8GPU.sh

@@ -1,7 +1,7 @@
 #!/bin/bash
 
 mkdir -p output
-python -m multiproc train.py \
+python -m torch.distributed.launch --nproc_per_node 8 train.py \
     --cuda \
     --cudnn-enabled \
     -o ./output/ \

+ 2 - 2
PyTorch/SpeechSynthesis/FastPitch/scripts/download_dataset.sh

@@ -11,8 +11,8 @@ LJS_URL="http://data.keithito.com/data/speech/${LJS_ARCH}"
 TACO_CH="nvidia_tacotron2pyt_fp32_20190427.pt"
 TACO_URL="https://api.ngc.nvidia.com/v2/models/nvidia/tacotron2pyt_fp32/versions/2/files/nvidia_tacotron2pyt_fp32_20190427"
 
-WAVEG_CH="waveglow_256channels_ljs_v3.pt"
-WAVEG_URL="https://api.ngc.nvidia.com/v2/models/nvidia/waveglow_ljs_256channels/versions/3/files/waveglow_256channels_ljs_v3.pt"
+WAVEG_CH="waveglow_1076430_14000_amp.pt"
+WAVEG_URL="https://api.ngc.nvidia.com/v2/models/nvidia/waveglow256pyt_fp16/versions/2/files/waveglow_1076430_14000_amp"
 
 
 if [ ! -f ${LJS_ARCH} ]; then

+ 21 - 21
PyTorch/SpeechSynthesis/FastPitch/scripts/inference_benchmark.sh

@@ -1,27 +1,27 @@
 #!/bin/bash
 
-MODEL_DIR="pretrained_models"
-EXP_DIR="output"
+[ ! -n "$WAVEG_CH" ] && WAVEG_CH="pretrained_models/waveglow/waveglow_1076430_14000_amp.pt"
+[ ! -n "$FASTPITCH_CH" ] && FASTPITCH_CH="output/FastPitch_checkpoint_1500.pt"
+[ ! -n "$REPEATS" ] && REPEATS=1000
+[ ! -n "$BS_SEQ" ] && BS_SEQ="1 4 8"
+[ ! -n "$PHRASES" ] && PHRASES="phrases/benchmark_8_128.tsv"
+[ ! -n "$OUTPUT_DIR" ] && OUTPUT_DIR="./output/audio_$(basename ${PHRASES} .tsv)"
+[ "$AMP" == "true" ] && AMP_FLAG="--amp" || AMP=false
+[ "$SET_AFFINITY" == "true" ] && SET_AFFINITY_FLAG="--set-affinity"
 
-WAVEG_CH="waveglow_256channels_ljs_v3.pt"
+for BS in $BS_SEQ ; do
 
-BSZ=${1:-4}
-PRECISION=${2:-fp16}
+  echo -e "\nAMP: ${AMP}, batch size: ${BS}\n"
 
-for PRECISION in fp16 fp32; do
-  for BSZ in 1 4 8 ; do
-
-    echo -e "\nprecision=${PRECISION} batch size=${BSZ}\n"
-
-    [ "$PRECISION" == "fp16" ] && AMP_FLAG="--amp-run" || AMP_FLAG=""
-
-    python inference.py --cuda --wn-channels 256 ${AMP_FLAG} \
-                        --fastpitch ${EXP_DIR}/checkpoint_FastPitch_1500.pt \
-                        --waveglow ${MODEL_DIR}/waveglow/${WAVEG_CH} \
-                        --include-warmup \
-                        --batch-size ${BSZ} \
-                        --repeats 1000 \
-                        --torchscript \
-                        -i phrases/benchmark_8_128.tsv
-  done
+  python inference.py --cuda \
+                      -i ${PHRASES} \
+                      -o ${OUTPUT_DIR} \
+                      --fastpitch ${FASTPITCH_CH} \
+                      --waveglow ${WAVEG_CH} \
+                      --wn-channels 256 \
+                      --include-warmup \
+                      --batch-size ${BS} \
+                      --repeats ${REPEATS} \
+                      --torchscript \
+                      ${AMP_FLAG} ${SET_AFFINITY_FLAG}
 done

+ 12 - 10
PyTorch/SpeechSynthesis/FastPitch/scripts/inference_example.sh

@@ -1,18 +1,20 @@
 #!/usr/bin/env bash
 
 DATA_DIR="LJSpeech-1.1"
-EXP_DIR="output"
-WAVEG_CH="pretrained_models/waveglow/waveglow_256channels_ljs_v3.pt"
 
-CHECKPOINT=${1:-1500}
+[ ! -n "$WAVEG_CH" ] && WAVEG_CH="pretrained_models/waveglow/waveglow_1076430_14000_amp.pt"
+[ ! -n "$FASTPITCH_CH" ] && FASTPITCH_CH="output/FastPitch_checkpoint_1500.pt"
+[ ! -n "$BS" ] && BS=32
+[ ! -n "$PHRASES" ] && PHRASES="phrases/devset10.tsv"
+[ ! -n "$OUTPUT_DIR" ] && OUTPUT_DIR="./output/audio_$(basename ${PHRASES} .tsv)"
+[ "$AMP" == "true" ] && AMP_FLAG="--amp"
 
-python inference.py -i phrases/devset10.tsv \
-                    -o ${EXP_DIR}/audio_devset10_checkpoint${CHECKPOINT} \
-                    --log-file ${EXP_DIR}/nvlog_inference.json \
+python inference.py --cuda \
+                    -i ${PHRASES} \
+                    -o ${OUTPUT_DIR} \
                     --dataset-path ${DATA_DIR} \
-                    --fastpitch ${EXP_DIR}/checkpoint_FastPitch_${CHECKPOINT}.pt \
+                    --fastpitch ${FASTPITCH_CH} \
                     --waveglow ${WAVEG_CH} \
 		    --wn-channels 256 \
-                    --batch-size 32 \
-                    --amp-run \
-                    --cuda
+                    --batch-size ${BS} \
+                    ${AMP_FLAG}

+ 22 - 22
PyTorch/SpeechSynthesis/FastPitch/scripts/train.sh

@@ -1,40 +1,40 @@
 #!/bin/bash
 
-# Default recipe for 8x GPU 16GB with TensorCores (fp16/AMP).
-# For other configurations, adjust
+# Adjust env variables to maintain the global batch size
 #
-#     batch-size x graient-accumulation-steps
-#
-# to maintain a total of 64x4=256 samples per step.
-#
-#   | Prec. | #GPU | -bs | --gradient-accumulation-steps |
-#   |-------|------|-----|-------------------------------|
-#   | AMP   |    1 |  64 |                             4 |
-#   | AMP   |    4 |  64 |                             1 |
-#   | AMP   |    8 |  32 |                             1 |
-#   | FP32  |    1 |  32 |                             8 |
-#   | FP32  |    4 |  32 |                             2 |
-#   | FP32  |    8 |  32 |                             1 |
+#    NGPU x BS x GRAD_ACC = 256.
+
+[ ! -n "$OUTPUT_DIR" ] && OUTPUT_DIR="./output"
+[ ! -n "$NGPU" ] && NGPU=8
+[ ! -n "$BS" ] && BS=32
+[ ! -n "$GRAD_ACC" ] && GRAD_ACC=1
+[ ! -n "$EPOCHS" ] && EPOCHS=1500
+[ "$AMP" == "true" ] && AMP_FLAG="--amp"
+
+GBS=$(($NGPU * $BS * $GRAD_ACC))
+[ $GBS -ne 256 ] && echo -e "\nWARNING: Global batch size changed from 256 to ${GBS}.\n"
+
+echo -e "\nSetup: ${NGPU}x${BS}x${GRAD_ACC} - global batch size ${GBS}\n"
 
-mkdir -p output
-python -m multiproc train.py \
+mkdir -p "$OUTPUT_DIR"
+python -m torch.distributed.launch --nproc_per_node ${NGPU} train.py \
     --cuda \
     --cudnn-enabled \
-    -o ./output/ \
-    --log-file ./output/nvlog.json \
+    -o "$OUTPUT_DIR/" \
+    --log-file "$OUTPUT_DIR/nvlog.json" \
     --dataset-path LJSpeech-1.1 \
     --training-files filelists/ljs_mel_dur_pitch_text_train_filelist.txt \
     --validation-files filelists/ljs_mel_dur_pitch_text_test_filelist.txt \
     --pitch-mean-std-file LJSpeech-1.1/pitch_char_stats__ljs_audio_text_train_filelist.json \
-    --epochs 1500 \
+    --epochs ${EPOCHS} \
     --epochs-per-checkpoint 100 \
     --warmup-steps 1000 \
     -lr 0.1 \
-    -bs 32 \
+    -bs ${BS} \
     --optimizer lamb \
     --grad-clip-thresh 1000.0 \
     --dur-predictor-loss-scale 0.1 \
     --pitch-predictor-loss-scale 0.1 \
     --weight-decay 1e-6 \
-    --gradient-accumulation-steps 1 \
-    --amp-run
+    --gradient-accumulation-steps ${GRAD_ACC} \
+    ${AMP_FLAG}

+ 36 - 35
PyTorch/SpeechSynthesis/FastPitch/train.py

@@ -40,6 +40,7 @@ import numpy as np
 import torch.distributed as dist
 from scipy.io.wavfile import write as write_wav
 from torch.autograd import Variable
+from torch.nn.parallel import DistributedDataParallel
 from torch.nn.parameter import Parameter
 from torch.utils.data import DataLoader
 from torch.utils.data.distributed import DistributedSampler
@@ -47,13 +48,12 @@ from torch.utils.data.distributed import DistributedSampler
 import dllogger as DLLogger
 from apex import amp
 from apex.optimizers import FusedAdam, FusedLAMB
-from apex.parallel import DistributedDataParallel as DDP
 
 import common
 import data_functions
 import loss_functions
 import models
-from common.log_helper import init_dllogger, TBLogger
+from common.log_helper import init_dllogger, TBLogger, unique_dllogger_fpath
 
 
 def parse_args(parser):
@@ -64,8 +64,8 @@ def parse_args(parser):
                         help='Directory to save checkpoints')
     parser.add_argument('-d', '--dataset-path', type=str, default='./',
                         help='Path to dataset')
-    parser.add_argument('--log-file', type=str, default='nvlog.json',
-                        help='Filename for logging')
+    parser.add_argument('--log-file', type=str, default=None,
+                        help='Path to a DLLogger log file')
 
     training = parser.add_argument_group('training setup')
     training.add_argument('--epochs', type=int, required=True,
@@ -74,11 +74,11 @@ def parse_args(parser):
                           help='Number of epochs per checkpoint')
     training.add_argument('--checkpoint-path', type=str, default=None,
                           help='Checkpoint path to resume training')
-    training.add_argument('--checkpoint-resume', action='store_true',
+    training.add_argument('--resume', action='store_true',
                           help='Resume training from the last available checkpoint')
     training.add_argument('--seed', type=int, default=1234,
                           help='Seed for PyTorch random number generators')
-    training.add_argument('--amp-run', action='store_true',
+    training.add_argument('--amp', action='store_true',
                           help='Enable AMP')
     training.add_argument('--cuda', action='store_true',
                           help='Run on GPU using CUDA')
@@ -121,16 +121,10 @@ def parse_args(parser):
                          help='Type of text cleaners for input text')
 
     distributed = parser.add_argument_group('distributed setup')
-    distributed.add_argument('--rank', default=0, type=int,
+    distributed.add_argument('--local_rank', type=int, default=os.getenv('LOCAL_RANK', 0),
                              help='Rank of the process for multiproc. Do not set manually.')
-    distributed.add_argument('--world-size', default=1, type=int,
+    distributed.add_argument('--world_size', type=int, default=os.getenv('WORLD_SIZE', 1),
                              help='Number of processes for multiproc. Do not set manually.')
-    distributed.add_argument('--dist-url', type=str, default='tcp://localhost:23456',
-                             help='Url used to set up distributed training')
-    distributed.add_argument('--group-name', type=str, default='group_name',
-                             required=False, help='Distributed group name')
-    distributed.add_argument('--dist-backend', default='nccl', type=str, choices={'nccl'},
-                             help='Distributed run backend')
     return parser
 
 
@@ -141,7 +135,7 @@ def reduce_tensor(tensor, num_gpus):
     return rt
 
 
-def init_distributed(args, world_size, rank, group_name):
+def init_distributed(args, world_size, rank):
     assert torch.cuda.is_available(), "Distributed mode requires CUDA."
     print("Initializing distributed training")
 
@@ -149,9 +143,8 @@ def init_distributed(args, world_size, rank, group_name):
     torch.cuda.set_device(rank % torch.cuda.device_count())
 
     # Initialize distributed communication
-    dist.init_process_group(
-        backend=args.dist_backend, init_method=args.dist_url,
-        world_size=world_size, rank=rank, group_name=group_name)
+    dist.init_process_group(backend=('nccl' if args.cuda else 'gloo'),
+                            init_method='env://')
     print("Done initializing distributed training")
 
 
@@ -177,13 +170,14 @@ def last_checkpoint(output):
         return None
 
 
-def save_checkpoint(local_rank, model, ema_model, optimizer, epoch, config,
-                    amp_run, filepath):
+def save_checkpoint(local_rank, model, ema_model, optimizer, epoch, total_iter,
+                    config, amp_run, filepath):
     if local_rank != 0:
         return
     print(f"Saving model and optimizer state at epoch {epoch} to {filepath}")
     ema_dict = None if ema_model is None else ema_model.state_dict()
     checkpoint = {'epoch': epoch,
+                  'iteration': total_iter,
                   'config': config,
                   'state_dict': model.state_dict(),
                   'ema_state_dict': ema_dict,
@@ -193,12 +187,13 @@ def save_checkpoint(local_rank, model, ema_model, optimizer, epoch, config,
     torch.save(checkpoint, filepath)
 
 
-def load_checkpoint(local_rank, model, ema_model, optimizer, epoch, config,
-                    amp_run, filepath, world_size):
+def load_checkpoint(local_rank, model, ema_model, optimizer, epoch, total_iter,
+                    config, amp_run, filepath, world_size):
     if local_rank == 0:
         print(f'Loading model and optimizer state from {filepath}')
     checkpoint = torch.load(filepath, map_location='cpu')
     epoch[0] = checkpoint['epoch'] + 1
+    total_iter[0] = checkpoint['iteration']
     config = checkpoint['config']
 
     sd = {k.replace('module.', ''): v
@@ -289,12 +284,14 @@ def main():
     if local_rank == 0:
         if not os.path.exists(args.output):
             os.makedirs(args.output)
-        init_dllogger(args.log_file)
+
+        log_fpath = args.log_file or os.path.join(args.output, 'nvlog.json')
+        log_fpath = unique_dllogger_fpath(log_fpath)
+        init_dllogger(log_fpath)
     else:
         init_dllogger(dummy=True)
 
-    for k,v in vars(args).items():
-        DLLogger.log(step="PARAMETER", data={k:v})
+    [DLLogger.log("PARAMETER", {k:v}) for k,v in vars(args).items()]
 
     parser = models.parse_model_args('FastPitch', parser)
     args, unk_args = parser.parse_known_args()
@@ -305,7 +302,7 @@ def main():
     torch.backends.cudnn.benchmark = args.cudnn_benchmark
 
     if distributed_run:
-        init_distributed(args, world_size, local_rank, args.group_name)
+        init_distributed(args, world_size, local_rank)
 
     device = torch.device('cuda' if args.cuda else 'cpu')
     model_config = models.get_model_config('FastPitch', args)
@@ -328,7 +325,7 @@ def main():
     else:
         raise ValueError
 
-    if args.amp_run:
+    if args.amp:
         model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
 
     if args.ema_decay > 0:
@@ -337,24 +334,29 @@ def main():
         ema_model = None
 
     if distributed_run:
-        model = DDP(model)
+        model = DistributedDataParallel(
+            model, device_ids=[args.local_rank], output_device=args.local_rank,
+            find_unused_parameters=True)
 
     start_epoch = [1]
+    start_iter = [0]
 
-    assert args.checkpoint_path is None or args.checkpoint_resume is False, (
+    assert args.checkpoint_path is None or args.resume is False, (
         "Specify a single checkpoint source")
     if args.checkpoint_path is not None:
         ch_fpath = args.checkpoint_path
-    elif args.checkpoint_resume:
+    elif args.resume:
         ch_fpath = last_checkpoint(args.output)
     else:
         ch_fpath = None
 
     if ch_fpath is not None:
         load_checkpoint(local_rank, model, ema_model, optimizer, start_epoch,
-                        model_config, args.amp_run, ch_fpath, world_size)
+                        start_iter, model_config, args.amp, ch_fpath,
+                        world_size)
 
     start_epoch = start_epoch[0]
+    total_iter = start_iter[0]
 
     criterion = loss_functions.get_loss_function('FastPitch',
         dur_predictor_loss_scale=args.dur_predictor_loss_scale,
@@ -385,7 +387,6 @@ def main():
         val_ema_tblogger = TBLogger(local_rank, args.output, 'val_ema')
 
     val_loss = 0.0
-    total_iter = 0
     torch.cuda.synchronize()
     for epoch in range(start_epoch, args.epochs + 1):
         epoch_start_time = time.time()
@@ -435,7 +436,7 @@ def main():
             meta = {k: v / args.gradient_accumulation_steps
                     for k, v in meta.items()}
 
-            if args.amp_run:
+            if args.amp:
                 with amp.scale_loss(loss, optimizer) as scaled_loss:
                     scaled_loss.backward()
             else:
@@ -459,7 +460,7 @@ def main():
             if accumulated_steps % args.gradient_accumulation_steps == 0:
 
                 train_tblogger.log_grads(total_iter, model)
-                if args.amp_run:
+                if args.amp:
                     torch.nn.utils.clip_grad_norm_(
                         amp.master_params(optimizer), args.grad_clip_thresh)
                 else:
@@ -537,7 +538,7 @@ def main():
             checkpoint_path = os.path.join(
                 args.output, f"FastPitch_checkpoint_{epoch}.pt")
             save_checkpoint(local_rank, model, ema_model, optimizer, epoch,
-                            model_config, args.amp_run, checkpoint_path)
+                            total_iter, model_config, args.amp, checkpoint_path)
         if local_rank == 0:
             DLLogger.flush()
 

+ 5 - 10
PyTorch/SpeechSynthesis/FastPitch/waveglow/denoiser.py

@@ -37,20 +37,15 @@ class Denoiser(torch.nn.Module):
     def __init__(self, waveglow, filter_length=1024, n_overlap=4,
                  win_length=1024, mode='zeros'):
         super(Denoiser, self).__init__()
-        device = next(waveglow.parameters()).device
+        device = waveglow.upsample.weight.device
+        dtype = waveglow.upsample.weight.dtype
         self.stft = STFT(filter_length=filter_length,
                          hop_length=int(filter_length/n_overlap),
                          win_length=win_length).to(device)
         if mode == 'zeros':
-            mel_input = torch.zeros(
-                (1, 80, 88),
-                dtype=waveglow.upsample.weight.dtype,
-                device=waveglow.upsample.weight.device)
+            mel_input = torch.zeros((1, 80, 88), dtype=dtype, device=device)
         elif mode == 'normal':
-            mel_input = torch.randn(
-                (1, 80, 88),
-                dtype=waveglow.upsample.weight.dtype,
-                device=waveglow.upsample.weight.device)
+            mel_input = torch.randn((1, 80, 88), dtype=dtype, device=device)
         else:
             raise Exception("Mode {} if not supported".format(mode))
 
@@ -61,7 +56,7 @@ class Denoiser(torch.nn.Module):
         self.register_buffer('bias_spec', bias_spec[:, :, 0][:, :, None])
 
     def forward(self, audio, strength=0.1):
-        audio_spec, audio_angles = self.stft.transform(audio.float())
+        audio_spec, audio_angles = self.stft.transform(audio)
         audio_spec_denoised = audio_spec - self.bias_spec * strength
         audio_spec_denoised = torch.clamp(audio_spec_denoised, 0.0)
         audio_denoised = self.stft.inverse(audio_spec_denoised, audio_angles)

+ 72 - 59
PyTorch/SpeechSynthesis/FastPitch/waveglow/model.py

@@ -45,6 +45,7 @@ class Invertible1x1Conv(torch.nn.Module):
     of its weight matrix.  If reverse=True it does convolution with
     inverse
     """
+
     def __init__(self, c):
         super(Invertible1x1Conv, self).__init__()
         self.conv = torch.nn.Conv1d(c, c, kernel_size=1, stride=1, padding=0,
@@ -59,35 +60,42 @@ class Invertible1x1Conv(torch.nn.Module):
         W = W.view(c, c, 1)
         self.conv.weight.data = W
 
-    def forward(self, z, reverse=False):
+    def forward(self, z):
+        # shape
+        batch_size, group_size, n_of_groups = z.size()
+
+        W = self.conv.weight.squeeze()
+
+        # Forward computation
+        log_det_W = batch_size * n_of_groups * torch.logdet(W.unsqueeze(0).float()).squeeze()
+        z = self.conv(z)
+        return z, log_det_W
+
+
+    def infer(self, z):
         # shape
         batch_size, group_size, n_of_groups = z.size()
 
         W = self.conv.weight.squeeze()
 
-        if reverse:
-            if not hasattr(self, 'W_inverse'):
-                # Reverse computation
-                W_inverse = W.float().inverse()
-                W_inverse = Variable(W_inverse[..., None])
-                if z.type() == 'torch.cuda.HalfTensor' or z.type() == 'torch.HalfTensor':
-                    W_inverse = W_inverse.half()
-                self.W_inverse = W_inverse
-            z = F.conv1d(z, self.W_inverse, bias=None, stride=1, padding=0)
-            return z
-        else:
-            # Forward computation
-            log_det_W = batch_size * n_of_groups * torch.logdet(W.unsqueeze(0).float()).squeeze()
-            z = self.conv(z)
-            return z, log_det_W
+        if not hasattr(self, 'W_inverse'):
+            # Reverse computation
+            W_inverse = W.float().inverse()
+            W_inverse = Variable(W_inverse[..., None])
+            if z.type() == 'torch.cuda.HalfTensor' or z.type() == 'torch.HalfTensor':
+                W_inverse = W_inverse.half()
+            self.W_inverse = W_inverse
+        z = F.conv1d(z, self.W_inverse, bias=None, stride=1, padding=0)
+        return z
 
 
 class WN(torch.nn.Module):
     """
-    This is the WaveNet like layer for the affine coupling.  The primary difference
-    from WaveNet is the convolutions need not be causal.  There is also no dilation
-    size reset.  The dilation only doubles on each layer
+    This is the WaveNet like layer for the affine coupling.  The primary
+    difference from WaveNet is the convolutions need not be causal.  There is
+    also no dilation size reset.  The dilation only doubles on each layer
     """
+
     def __init__(self, n_in_channels, n_mel_channels, n_layers, n_channels,
                  kernel_size):
         super(WN, self).__init__()
@@ -97,6 +105,7 @@ class WN(torch.nn.Module):
         self.n_channels = n_channels
         self.in_layers = torch.nn.ModuleList()
         self.res_skip_layers = torch.nn.ModuleList()
+        self.cond_layers = torch.nn.ModuleList()
 
         start = torch.nn.Conv1d(n_in_channels, n_channels, 1)
         start = torch.nn.utils.weight_norm(start, name='weight')
@@ -104,53 +113,54 @@ class WN(torch.nn.Module):
 
         # Initializing last layer to 0 makes the affine coupling layers
         # do nothing at first.  This helps with training stability
-        end = torch.nn.Conv1d(n_channels, 2*n_in_channels, 1)
+        end = torch.nn.Conv1d(n_channels, 2 * n_in_channels, 1)
         end.weight.data.zero_()
         end.bias.data.zero_()
         self.end = end
 
-        cond_layer = torch.nn.Conv1d(n_mel_channels, 2*n_channels*n_layers, 1)
-        self.cond_layer = torch.nn.utils.weight_norm(cond_layer, name='weight')
-
         for i in range(n_layers):
             dilation = 2 ** i
-            padding = int((kernel_size*dilation - dilation)/2)
-            in_layer = torch.nn.Conv1d(n_channels, 2*n_channels, kernel_size,
+            padding = int((kernel_size * dilation - dilation) / 2)
+            in_layer = torch.nn.Conv1d(n_channels, 2 * n_channels, kernel_size,
                                        dilation=dilation, padding=padding)
             in_layer = torch.nn.utils.weight_norm(in_layer, name='weight')
             self.in_layers.append(in_layer)
 
+            cond_layer = torch.nn.Conv1d(n_mel_channels, 2 * n_channels, 1)
+            cond_layer = torch.nn.utils.weight_norm(cond_layer, name='weight')
+            self.cond_layers.append(cond_layer)
 
             # last one is not necessary
             if i < n_layers - 1:
-                res_skip_channels = 2*n_channels
+                res_skip_channels = 2 * n_channels
             else:
                 res_skip_channels = n_channels
             res_skip_layer = torch.nn.Conv1d(n_channels, res_skip_channels, 1)
-            res_skip_layer = torch.nn.utils.weight_norm(res_skip_layer, name='weight')
+            res_skip_layer = torch.nn.utils.weight_norm(
+                res_skip_layer, name='weight')
             self.res_skip_layers.append(res_skip_layer)
 
     def forward(self, forward_input):
         audio, spect = forward_input
         audio = self.start(audio)
-        output = torch.zeros_like(audio)
-        n_channels_tensor = torch.IntTensor([self.n_channels])
-
-        spect = self.cond_layer(spect)
-        spect_list = torch.chunk(spect, self.n_layers, 1)
 
         for i in range(self.n_layers):
-            spect_offset = i*2*self.n_channels
             acts = fused_add_tanh_sigmoid_multiply(
-                self.in_layers[i](audio), spect_list[i], n_channels_tensor)
+                self.in_layers[i](audio),
+                self.cond_layers[i](spect),
+                torch.IntTensor([self.n_channels]))
 
             res_skip_acts = self.res_skip_layers[i](acts)
             if i < self.n_layers - 1:
-                audio = audio + res_skip_acts[:,:self.n_channels,:]
-                output = output + res_skip_acts[:,self.n_channels:,:]
+                audio = res_skip_acts[:, :self.n_channels, :] + audio
+                skip_acts = res_skip_acts[:, self.n_channels:, :]
             else:
-                output = output + res_skip_acts
+                skip_acts = res_skip_acts
 
+            if i == 0:
+                output = skip_acts
+            else:
+                output = skip_acts + output
         return self.end(output)
 
 
@@ -170,18 +180,18 @@ class WaveGlow(torch.nn.Module):
         self.WN = torch.nn.ModuleList()
         self.convinv = torch.nn.ModuleList()
 
-        n_half = int(n_group/2)
+        n_half = int(n_group / 2)
 
         # Set up layers with the right sizes based on how many dimensions
         # have been output already
         n_remaining_channels = n_group
         for k in range(n_flows):
             if k % self.n_early_every == 0 and k > 0:
-                n_half = n_half - int(self.n_early_size/2)
+                n_half = n_half - int(self.n_early_size / 2)
                 n_remaining_channels = n_remaining_channels - self.n_early_size
             self.convinv.append(Invertible1x1Conv(n_remaining_channels))
-            self.WN.append(WN(n_half, n_mel_channels*n_group, **WN_config))
-        self.n_remaining_channels = n_remaining_channels  # Useful during inference
+            self.WN.append(WN(n_half, n_mel_channels * n_group, **WN_config))
+        self.n_remaining_channels = n_remaining_channels
 
     def forward(self, forward_input):
         """
@@ -207,20 +217,20 @@ class WaveGlow(torch.nn.Module):
 
         for k in range(self.n_flows):
             if k % self.n_early_every == 0 and k > 0:
-                output_audio.append(audio[:,:self.n_early_size,:])
-                audio = audio[:,self.n_early_size:,:]
+                output_audio.append(audio[:, :self.n_early_size, :])
+                audio = audio[:, self.n_early_size:, :]
 
             audio, log_det_W = self.convinv[k](audio)
             log_det_W_list.append(log_det_W)
 
-            n_half = int(audio.size(1)/2)
-            audio_0 = audio[:,:n_half,:]
-            audio_1 = audio[:,n_half:,:]
+            n_half = int(audio.size(1) / 2)
+            audio_0 = audio[:, :n_half, :]
+            audio_1 = audio[:, n_half:, :]
 
             output = self.WN[k]((audio_0, spect))
             log_s = output[:, n_half:, :]
             b = output[:, :n_half, :]
-            audio_1 = torch.exp(log_s)*audio_1 + b
+            audio_1 = torch.exp(log_s) * audio_1 + b
             log_s_list.append(log_s)
 
             audio = torch.cat([audio_0, audio_1], 1)
@@ -229,6 +239,7 @@ class WaveGlow(torch.nn.Module):
         return torch.cat(output_audio, 1), log_s_list, log_det_W_list
 
     def infer(self, spect, sigma=1.0):
+
         spect = self.upsample(spect)
         # trim conv artifacts. maybe pad spec to kernel multiple
         time_cutoff = self.upsample.kernel_size[0] - self.upsample.stride[0]
@@ -240,39 +251,41 @@ class WaveGlow(torch.nn.Module):
 
         audio = torch.randn(spect.size(0),
                             self.n_remaining_channels,
-                            spect.size(2), device=spect.device, dtype=spect.dtype)
+                            spect.size(2), device=spect.device).to(spect.dtype)
 
-        audio = torch.autograd.Variable(sigma*audio)
+        audio = torch.autograd.Variable(sigma * audio)
 
         for k in reversed(range(self.n_flows)):
-            n_half = int(audio.size(1)/2)
-            audio_0 = audio[:,:n_half,:]
-            audio_1 = audio[:,n_half:,:]
+            n_half = int(audio.size(1) / 2)
+            audio_0 = audio[:, :n_half, :]
+            audio_1 = audio[:, n_half:, :]
 
             output = self.WN[k]((audio_0, spect))
-
             s = output[:, n_half:, :]
             b = output[:, :n_half, :]
-            audio_1 = (audio_1 - b)/torch.exp(s)
+            audio_1 = (audio_1 - b) / torch.exp(s)
             audio = torch.cat([audio_0, audio_1], 1)
 
-            audio = self.convinv[k](audio, reverse=True)
+            audio = self.convinv[k].infer(audio)
 
             if k % self.n_early_every == 0 and k > 0:
                 z = torch.randn(spect.size(0), self.n_early_size, spect.size(
-                    2), device=spect.device, dtype=spect.dtype)
-                audio = torch.cat((sigma*z, audio), 1)
+                    2), device=spect.device).to(spect.dtype)
+                audio = torch.cat((sigma * z, audio), 1)
 
-        audio = audio.permute(0, 2, 1).contiguous().view(audio.size(0), -1).data
+        audio = audio.permute(
+            0, 2, 1).contiguous().view(
+            audio.size(0), -1).data
         return audio
 
+
     @staticmethod
     def remove_weightnorm(model):
         waveglow = model
         for WN in waveglow.WN:
             WN.start = torch.nn.utils.remove_weight_norm(WN.start)
             WN.in_layers = remove(WN.in_layers)
-            WN.cond_layer = torch.nn.utils.remove_weight_norm(WN.cond_layer)
+            WN.cond_layers = remove(WN.cond_layers)
             WN.res_skip_layers = remove(WN.res_skip_layers)
         return waveglow