3 роки тому · 72a15ee698
--- a/PyTorch/SpeechSynthesis/FastPitch/Dockerfile
+++ b/PyTorch/SpeechSynthesis/FastPitch/Dockerfile
@@ -12,7 +12,7 @@
 
				 # See the License for the specific language governing permissions and
			
 
				 # limitations under the License.
			
 
				 
			
 
				-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:21.05-py3
			
 
				+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:22.08-py3
			
 
				 FROM ${FROM_IMAGE_NAME}
			
 
				 
			
 
				 ENV PYTHONPATH /workspace/fastpitch
			
--- a/PyTorch/SpeechSynthesis/FastPitch/README.md
+++ b/PyTorch/SpeechSynthesis/FastPitch/README.md
@@ -197,7 +197,7 @@ The following section lists the requirements that you need to meet in order to s
 
				 
			
 
				 This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
			
 
				 -   [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
			
 
				--   [PyTorch 21.05-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
			
 
				+-   [PyTorch 22.08-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
			
 
				 or newer
			
 
				 - supported GPUs:
			
 
				     - [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
			
@@ -319,7 +319,7 @@ The repository is structured similarly to the [NVIDIA Tacotron2 Deep Learning ex
 
				 In this section, we list the most important hyperparameters and command-line arguments,
			
 
				 together with their default values that are used to train FastPitch.
			
 
				 
			
 
				-* `--epochs` - number of epochs (default: 1500)
			
 
				+* `--epochs` - number of epochs (default: 1000)
			
 
				 * `--learning-rate` - learning rate (default: 0.1)
			
 
				 * `--batch-size` - batch size for a single forward-backward step (default: 16)
			
 
				 * `--grad-accumulation` - number of steps over which gradients are accumulated (default: 2)
			
@@ -542,7 +542,7 @@ and accuracy in training and inference.
 
				 
			
 
				 ##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
			
 
				 
			
 
				-Our results were obtained by running the `./platform/DGXA100_FastPitch_{AMP,TF32}_8GPU.sh` training script in the 21.05-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.
			
 
				+Our results were obtained by running the `./platform/DGXA100_FastPitch_{AMP,TF32}_8GPU.sh` training script in the PyTorch 21.05-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.
			
 
				 
			
 
				 | Loss (Model/Epoch)   |    50 |   250 |   500 |   750 |  1000 |  1250 |  1500 |
			
 
				 |:---------------------|------:|------:|------:|------:|------:|------:|------:|
			
@@ -570,50 +570,49 @@ All of the results were produced using the `train.py` script as described in the
 
				 
			
 
				 ##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
			
 
				 
			
 
				-Our results were obtained by running the `./platform/DGXA100_FastPitch_{AMP,TF32}_8GPU.sh` training script in the 21.05-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers, in output mel-scale spectrogram frames per second, were averaged over
			
 
				+Our results were obtained by running the `./platform/DGXA100_FastPitch_{AMP,TF32}_8GPU.sh` training script in the PyTorch 22.08-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers, in output mel-scale spectrogram frames per second, were averaged over
			
 
				 an entire training epoch.
			
 
				 
			
 
				-| Batch size / GPU | Grad accumulation | GPUs | Throughput - TF32 | Throughput - mixed precision | Throughput speedup (TF32 to mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
			
 
				-|---:|--:|--:|--------:|--------:|-----:|-----:|-----:|
			
 
				-| 32 | 8 | 1 |  97,735 | 101,730 | 1.04 | 1.00 | 1.00 |
			
 
				-| 32 | 2 | 4 | 337,163 | 352,300 | 1.04 | 3.45 | 3.46 |
			
 
				-| 32 | 1 | 8 | 599,221 | 623,498 | 1.04 | 6.13 | 6.13 |
			
 
				+| Batch size / GPU | GPUs | Grad accumulation | Throughput - TF32 | Throughput - mixed precision | Throughput speedup (TF32 to mixed precision) | Strong scaling - TF32 | Strong scaling - mixed precision |
			
 
				+|-----:|--:|---:|--------:|----------:|--------:|-----:|------:|
			
 
				+|  128 | 1 |  2 | 141,028 |   148,149 |    1.05 | 1.00 |  1.00 |
			
 
				+|   64 | 4 |  1 | 525,879 |   614,857 |    1.17 | 3.73 |  4.15 |
			
 
				+|   32 | 8 |  1 | 914,350 | 1,022,722 |    1.12 | 6.48 |  6.90 |
			
 
				 
			
 
				 ###### Expected training time
			
 
				 
			
 
				-The following table shows the expected training time for convergence for 1500 epochs:
			
 
				+The following table shows the expected training time for convergence for 1000 epochs:
			
 
				 
			
 
				 | Batch size / GPU | GPUs | Grad accumulation | Time to train with TF32 (Hrs) | Time to train with mixed precision (Hrs) | Speed-up with mixed precision|
			
 
				-|---:|--:|--:|-----:|-----:|-----:|
			
 
				-| 32 | 1 | 8 | 32.8 | 31.6 | 1.04 |
			
 
				-| 32 | 4 | 2 |  9.6 |  9.2 | 1.04 |
			
 
				-| 32 | 8 | 1 |  5.5 |  5.3 | 1.04 |
			
 
				+|----:|--:|--:|-----:|-----:|-----:|
			
 
				+| 128 | 1 | 2 | 14.5 | 13.8 | 1.05 |
			
 
				+| 64  | 4 | 1 |  4.1 |  3.3 | 1.17 |
			
 
				+| 32  | 8 | 1 |  2.2 |  2.0 | 1.12 |
			
 
				 
			
 
				 ##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
			
 
				 
			
 
				 Our results were obtained by running the `./platform/DGX1_FastPitch_{AMP,FP32}_8GPU.sh`
			
 
				-training script in the PyTorch 21.05-py3 NGC container on NVIDIA DGX-1 with
			
 
				+training script in the PyTorch 22.08-py3 NGC container on NVIDIA DGX-1 with
			
 
				 8x V100 16GB GPUs. Performance numbers, in output mel-scale spectrogram frames per second, were averaged over
			
 
				 an entire training epoch.
			
 
				 
			
 
				 | Batch size / GPU | GPUs | Grad accumulation | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) | Strong scaling - FP32 | Strong scaling - mixed precision |
			
 
				-|---:|--:|---:|--------:|--------:|-----:|-----:|-----:|
			
 
				-| 16 | 1 | 16 |  33,456 |  63,986 | 1.91 | 1.00 | 1.00 |
			
 
				-| 16 | 4 |  4 | 120,393 | 209,335 | 1.74 | 3.60 | 3.27 |
			
 
				-| 16 | 8 |  2 | 222,161 | 356,522 | 1.60 | 6.64 | 5.57 |
			
 
				-
			
 
				+|-----:|---:|-----:|---------:|----------:|--------:|-----:|------:|
			
 
				+|   16 |  1 |   16 |   31,863 |    83,761 |    2.63 | 1.00 |  1.00 |
			
 
				+|   16 |  4 |    4 |  117,971 |   269,143 |    2.28 | 3.70 |  3.21 |
			
 
				+|   16 |  8 |    2 |  225,826 |   435,799 |    1.93 | 7.09 |  5.20 |
			
 
				 
			
 
				 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
			
 
				 
			
 
				 ###### Expected training time
			
 
				 
			
 
				-The following table shows the expected training time for convergence for 1500 epochs:
			
 
				+The following table shows the expected training time for convergence for 1000 epochs:
			
 
				 
			
 
				 | Batch size / GPU | GPUs | Grad accumulation | Time to train with FP32 (Hrs) | Time to train with mixed precision (Hrs) | Speed-up with mixed precision|
			
 
				 |---:|--:|---:|-----:|-----:|-----:|
			
 
				-| 16 | 1 | 16 | 89.3 | 47.4 | 1.91 |
			
 
				-| 16 | 4 |  4 | 24.9 | 14.6 | 1.74 |
			
 
				-| 16 | 8 |  2 | 13.6 |  8.6 | 1.60 |
			
 
				+| 16 | 1 | 16 | 64.2 | 24.4 | 2.63 |
			
 
				+| 16 | 4 |  4 | 17.4 |  7.6 | 2.28 |
			
 
				+| 16 | 8 |  2 |  9.1 |  4.7 | 1.93 |
			
 
				 
			
 
				 Note that most of the quality is achieved after the initial 1000 epochs.
			
 
				 
			
@@ -628,46 +627,110 @@ The used WaveGlow model is a 256-channel model.
 
				 Note that performance numbers are related to the length of input. The numbers reported below were taken with a moderate length of 128 characters. Longer utterances yield higher RTF, as the generator is fully parallel.
			
 
				 ##### Inference performance: NVIDIA DGX A100 (1x A100 80GB)
			
 
				 
			
 
				-Our results were obtained by running the `./scripts/inference_benchmark.sh` inferencing benchmarking script in the 21.05-py3 NGC container on NVIDIA DGX A100 (1x A100 80GB) GPU.
			
 
				+Our results were obtained by running the `./scripts/inference_benchmark.sh` inferencing benchmarking script in the PyTorch 22.08-py3 NGC container on NVIDIA DGX A100 (1x A100 80GB) GPU.
			
 
				+
			
 
				+FastPitch (TorchScript, denoising)
			
 
				+|   Batch size | Precision   | Avg latency (s)   | Latency tolerance interval 90% (s)   | Latency tolerance interval 95% (s)   | Latency tolerance interval 99% (s)   | Throughput (frames/sec)   | Speed-up with mixed precision   | Avg RTF   |
			
 
				+|--------------|-------------|-------------------|--------------------------------------|--------------------------------------|--------------------------------------|----------------------------|---------------------------------|-----------|
			
 
				+|            1 | FP16        | 0.005            | 0.006                               | 0.006                               | 0.006                               | 120,333                    | 0.97                           | 1397.07  |
			
 
				+|            4 | FP16        | 0.006            | 0.006                               | 0.006                               | 0.006                               | 424,053                    | 1.12                           | 1230.81  |
			
 
				+|            8 | FP16        | 0.008            | 0.010                               | 0.010                               | 0.011                               | 669,549                    | 1.12                           | 971.68   |
			
 
				+|            1 | TF32        | 0.005            | 0.006                               | 0.006                               | 0.007                               | 123,718                    | -                               | 1436.37  |
			
 
				+|            4 | TF32        | 0.007            | 0.007                               | 0.007                               | 0.007                               | 379,980                    | -                               | 1102.89  |
			
 
				+|            8 | TF32        | 0.009            | 0.009                               | 0.009                               | 0.009                               | 600,435                    | -                               | 871.38   |
			
 
				+
			
 
				+FastPitch + HiFi-GAN (TorchScript, denoising)
			
 
				+|   Batch size | Precision   | Avg latency (s)   | Latency tolerance interval 90% (s)   | Latency tolerance interval 95% (s)   | Latency tolerance interval 99% (s)   | Throughput (samples/sec)   | Speed-up with mixed precision   | Avg RTF   |
			
 
				+|--------------|-------------|-------------------|--------------------------------------|--------------------------------------|--------------------------------------|----------------------------|---------------------------------|-----------|
			
 
				+|            1 | FP16        | 0.015            | 0.016                               | 0.016                               | 0.016                               | 11,431,335                 | 1.28                           | 518.43   |
			
 
				+|            4 | FP16        | 0.038            | 0.040                               | 0.040                               | 0.040                               | 17,670,528                 | 1.42                           | 200.35   |
			
 
				+|            8 | FP16        | 0.069            | 0.069                               | 0.070                               | 0.070                               | 19,750,759                 | 1.46                           | 111.97   |
			
 
				+|            1 | TF32        | 0.019            | 0.020                               | 0.020                               | 0.020                               | 8,912,296                  | -                               | 404.19   |
			
 
				+|            4 | TF32        | 0.054            | 0.055                               | 0.055                               | 0.055                               | 12,471,624                 | -                               | 141.40   |
			
 
				+|            8 | TF32        | 0.100            | 0.100                               | 0.100                               | 0.101                               | 13,543,317                 | -                               | 76.78    |
			
 
				+
			
 
				+FastPitch + WaveGlow (TorchScript, denoising)
			
 
				+|   Batch size | Precision   | Avg latency (s)   | Latency tolerance interval 90% (s)   | Latency tolerance interval 95% (s)   | Latency tolerance interval 99% (s)   | Throughput (samples/sec)   | Speed-up with mixed precision   | Avg RTF   |
			
 
				+|--------------|-------------|-------------------|--------------------------------------|--------------------------------------|--------------------------------------|----------------------------|---------------------------------|-----------|
			
 
				+|            1 | FP16        | 0.076            | 0.077                               | 0.077                               | 0.078                               | 2,223,336                  | 1.38                           | 100.83   |
			
 
				+|            4 | FP16        | 0.265            | 0.267                               | 0.267                               | 0.267                               | 2,552,577                  | 1.36                           | 28.94    |
			
 
				+|            8 | FP16        | 0.515            | 0.515                               | 0.516                               | 0.516                               | 2,630,328                  | 1.37                           | 14.91    |
			
 
				+|            1 | TF32        | 0.105            | 0.106                               | 0.106                               | 0.107                               | 1,610,266                  | -                               | 73.03    |
			
 
				+|            4 | TF32        | 0.362            | 0.363                               | 0.363                               | 0.363                               | 1,872,327                  | -                               | 21.23    |
			
 
				+|            8 | TF32        | 0.708            | 0.709                               | 0.709                               | 0.709                               | 1,915,577                  | -                               | 10.86    |
			
 
				 
			
 
				-|Batch size|Precision|Avg latency (s)|Latency tolerance interval 90% (s)|Latency tolerance interval 95% (s)|Latency tolerance interval 99% (s)|Throughput (samples/sec)|Speed-up with mixed precision|Avg RTF|
			
 
				-|-----:|-------:|----------:|--------:|--------:|--------:|---------------:|----------:|------:|
			
 
				-|    1 | FP16   |     0.091 |   0.092 |   0.092 |   0.092 |      1,879,189 | 1.28      | 85.22 |
			
 
				-|    4 | FP16   |     0.335 |   0.337 |   0.337 |   0.338 |      2,043,641 | 1.21      | 23.17 |
			
 
				-|    8 | FP16   |     0.652 |   0.654 |   0.654 |   0.655 |      2,103,765 | 1.21      | 11.93 |
			
 
				-|    1 | TF32   |     0.117 |   0.117 |   0.118 |   0.118 |      1,473,838 | -         | 66.84 |
			
 
				-|    4 | TF32   |     0.406 |   0.408 |   0.408 |   0.409 |      1,688,141 | -         | 19.14 |
			
 
				-|    8 | TF32   |     0.792 |   0.794 |   0.794 |   0.795 |      1,735,463 | -         |  9.84 |
			
 
				 
			
 
				 ##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
			
 
				 
			
 
				 Our results were obtained by running the `./scripts/inference_benchmark.sh` script in
			
 
				-the PyTorch 21.05-py3 NGC container. The input utterance has 128 characters, synthesized audio has 8.05 s.
			
 
				-
			
 
				+the PyTorch 22.08-py3 NGC container. The input utterance has 128 characters, synthesized audio has 8.05 s.
			
 
				+
			
 
				+FastPitch (TorchScript, denoising)
			
 
				+|   Batch size | Precision   | Avg latency (s)   | Latency tolerance interval 90% (s)   | Latency tolerance interval 95% (s)   | Latency tolerance interval 99% (s)   | Throughput (frames/sec)   | Speed-up with mixed precision   | Avg RTF   |
			
 
				+|--------------|-------------|-------------------|--------------------------------------|--------------------------------------|--------------------------------------|----------------------------|---------------------------------|-----------|
			
 
				+|            1 | FP16        | 0.007            | 0.008                               | 0.008                               | 0.008                               | 88,908                     | 1.10                           | 1032.23  |
			
 
				+|            4 | FP16        | 0.010            | 0.010                               | 0.010                               | 0.010                               | 272,564                    | 1.73                           | 791.12   |
			
 
				+|            8 | FP16        | 0.013            | 0.013                               | 0.013                               | 0.013                               | 415,263                    | 2.35                           | 602.65   |
			
 
				+|            1 | FP32        | 0.008            | 0.008                               | 0.008                               | 0.009                               | 80,558                     | -                               | 935.28   |
			
 
				+|            4 | FP32        | 0.017            | 0.017                               | 0.017                               | 0.017                               | 157,114                    | -                               | 456.02   |
			
 
				+|            8 | FP32        | 0.030            | 0.030                               | 0.030                               | 0.030                               | 176,754                    | -                               | 256.51   |
			
 
				+
			
 
				+FastPitch + HiFi-GAN (TorchScript, denoising)
			
 
				+|   Batch size | Precision   | Avg latency (s)   | Latency tolerance interval 90% (s)   | Latency tolerance interval 95% (s)   | Latency tolerance interval 99% (s)   | Throughput (samples/sec)   | Speed-up with mixed precision   | Avg RTF   |
			
 
				+|--------------|-------------|-------------------|--------------------------------------|--------------------------------------|--------------------------------------|----------------------------|---------------------------------|-----------|
			
 
				+|            1 | FP16        | 0.025            | 0.025                               | 0.025                               | 0.025                               | 6,788,274                  | 2.09                           | 307.86   |
			
 
				+|            4 | FP16        | 0.067            | 0.068                               | 0.068                               | 0.068                               | 10,066,291                 | 2.63                           | 114.13   |
			
 
				+|            8 | FP16        | 0.123            | 0.124                               | 0.124                               | 0.124                               | 10,992,774                 | 2.78                           | 62.32    |
			
 
				+|            1 | FP32        | 0.052            | 0.053                               | 0.053                               | 0.053                               | 3,246,699                  | -                               | 147.24   |
			
 
				+|            4 | FP32        | 0.177            | 0.178                               | 0.179                               | 0.179                               | 3,829,018                  | -                               | 43.41    |
			
 
				+|            8 | FP32        | 0.343            | 0.345                               | 0.345                               | 0.346                               | 3,953,920                  | -                               | 22.41    |
			
 
				+
			
 
				+FastPitch + WaveGlow (TorchScript, denoising)
			
 
				+|   Batch size | Precision   | Avg latency (s)   | Latency tolerance interval 90% (s)   | Latency tolerance interval 95% (s)   | Latency tolerance interval 99% (s)   | Throughput (samples/sec)   | Speed-up with mixed precision   | Avg RTF   |
			
 
				+|--------------|-------------|-------------------|--------------------------------------|--------------------------------------|--------------------------------------|----------------------------|---------------------------------|-----------|
			
 
				+|            1 | FP16        | 0.134            | 0.135                               | 0.135                               | 0.135                               | 1,259,550                  | 2.89                           | 57.12    |
			
 
				+|            4 | FP16        | 0.503            | 0.504                               | 0.505                               | 0.505                               | 1,346,145                  | 2.88                           | 15.26    |
			
 
				+|            8 | FP16        | 0.995            | 0.999                               | 0.999                               | 1.001                               | 1,360,952                  | 2.89                           | 7.72     |
			
 
				+|            1 | FP32        | 0.389            | 0.391                               | 0.392                               | 0.393                               | 435,564                    | -                               | 19.75    |
			
 
				+|            4 | FP32        | 1.453            | 1.455                               | 1.456                               | 1.457                               | 466,685                    | -                               | 5.29     |
			
 
				+|            8 | FP32        | 2.875            | 2.879                               | 2.880                               | 2.882                               | 471,602                    | -                               | 2.67     |
			
 
				 
			
 
				-|Batch size|Precision|Avg latency (s)|Latency tolerance interval 90% (s)|Latency tolerance interval 95% (s)|Latency tolerance interval 99% (s)|Throughput (samples/sec)|Speed-up with mixed precision|Avg RTF|
			
 
				-|-----:|-------:|----------:|--------:|--------:|--------:|---------------:|----------:|------:|
			
 
				-|    1 | FP16   |     0.149 |   0.150 |   0.150 |   0.151 |      1,154,061 | 2.64      | 52.34 |
			
 
				-|    4 | FP16   |     0.535 |   0.538 |   0.538 |   0.539 |      1,282,680 | 2.71      | 14.54 |
			
 
				-|    8 | FP16   |     1.055 |   1.058 |   1.059 |   1.060 |      1,300,261 | 2.71      |  7.37 |
			
 
				-|    1 | FP32   |     0.393 |   0.395 |   0.395 |   0.396 |        436,961 | -         | 19.82 |
			
 
				-|    4 | FP32   |     1.449 |   1.452 |   1.452 |   1.453 |        473,515 | -         |  5.37 |
			
 
				-|    8 | FP32   |     2.861 |   2.865 |   2.866 |   2.867 |        479,642 | -         |  2.72 |
			
 
				 
			
 
				 ##### Inference performance: NVIDIA T4
			
 
				 
			
 
				 Our results were obtained by running the `./scripts/inference_benchmark.sh` script in
			
 
				-the PyTorch 21.05-py3 NGC container.
			
 
				+the PyTorch 22.08-py3 NGC container.
			
 
				 The input utterance has 128 characters, synthesized audio has 8.05 s.
			
 
				 
			
 
				-|Batch size|Precision|Avg latency (s)|Latency tolerance interval 90% (s)|Latency tolerance interval 95% (s)|Latency tolerance interval 99% (s)|Throughput (samples/sec)|Speed-up with mixed precision|Avg RTF|
			
 
				-|-----:|-------:|----------:|--------:|--------:|--------:|--------------:|----------:|------:|
			
 
				-|    1 | FP16   |     0.446 |   0.449 |   0.449 |   0.450 |       384,743 | 2.72      | 17.45 |
			
 
				-|    4 | FP16   |     1.822 |   1.826 |   1.827 |   1.828 |       376,480 | 2.70      |  4.27 |
			
 
				-|    8 | FP16   |     3.656 |   3.662 |   3.664 |   3.666 |       375,329 | 2.70      |  2.13 |
			
 
				-|    1 | FP32   |     1.213 |   1.218 |   1.219 |   1.220 |       141,403 | -         |  6.41 |
			
 
				-|    4 | FP32   |     4.928 |   4.937 |   4.939 |   4.942 |       139,208 | -         |  1.58 |
			
 
				-|    8 | FP32   |     9.853 |   9.868 |   9.871 |   9.877 |       139,266 | -         |  0.79 |
			
 
				+FastPitch (TorchScript, denoising)
			
 
				+|   Batch size | Precision   |   Avg latency (s) |   Latency tolerance interval 90% (s) |   Latency tolerance interval 95% (s) |   Latency tolerance interval 99% (s) | Throughput (frames/sec)   | Speed-up with mixed precision   |   Avg RTF |
			
 
				+|--------------|-------------|-------------------|--------------------------------------|--------------------------------------|--------------------------------------|----------------------------|---------------------------------|-----------|
			
 
				+|            1 | FP16        |             0.008 |                                0.008 |                                0.008 |                                0.008 | 87,937                     | 1.69                            |   1020.95 |
			
 
				+|            4 | FP16        |             0.017 |                                0.017 |                                0.017 |                                0.018 | 154,880                    | 2.55                            |    449.54 |
			
 
				+|            8 | FP16        |             0.029 |                                0.030 |                                0.030 |                                0.030 | 181,776                    | 2.61                            |    263.80 |
			
 
				+|            1 | FP32        |             0.013 |                                0.013 |                                0.013 |                                0.013 | 52,062                     | -                               |    604.45 |
			
 
				+|            4 | FP32        |             0.044 |                                0.045 |                                0.045 |                                0.045 | 60,733                     | -                               |    176.28 |
			
 
				+|            8 | FP32        |             0.076 |                                0.077 |                                0.077 |                                0.077 | 69,685                     | -                               |    101.13 |
			
 
				+
			
 
				+FastPitch + HiFi-GAN (TorchScript, denoising)
			
 
				+|   Batch size | Precision   |   Avg latency (s) |   Latency tolerance interval 90% (s) |   Latency tolerance interval 95% (s) |   Latency tolerance interval 99% (s) | Throughput (samples/sec)   | Speed-up with mixed precision   |   Avg RTF |
			
 
				+|--------------|-------------|-------------------|--------------------------------------|--------------------------------------|--------------------------------------|----------------------------|---------------------------------|-----------|
			
 
				+|            1 | FP16        |             0.055 |                                0.056 |                                0.056 |                                0.057 | 3,076,809                  | 2.55                            |    139.54 |
			
 
				+|            4 | FP16        |             0.201 |                                0.203 |                                0.204 |                                0.204 | 3,360,014                  | 2.67                            |     38.10 |
			
 
				+|            8 | FP16        |             0.393 |                                0.395 |                                0.396 |                                0.397 | 3,444,245                  | 2.65                            |     19.53 |
			
 
				+|            1 | FP32        |             0.140 |                                0.142 |                                0.142 |                                0.142 | 1,208,678                  | -                               |     54.82 |
			
 
				+|            4 | FP32        |             0.538 |                                0.542 |                                0.543 |                                0.545 | 1,260,627                  | -                               |     14.29 |
			
 
				+|            8 | FP32        |             1.045 |                                1.049 |                                1.050 |                                1.051 | 1,297,726                  | -                               |      7.36 |
			
 
				+
			
 
				+FastPitch + WaveGlow (TorchScript, denoising)
			
 
				+|   Batch size | Precision   |   Avg latency (s) |   Latency tolerance interval 90% (s) |   Latency tolerance interval 95% (s) |   Latency tolerance interval 99% (s) | Throughput (samples/sec)   | Speed-up with mixed precision   |   Avg RTF |
			
 
				+|--------------|-------------|-------------------|--------------------------------------|--------------------------------------|--------------------------------------|----------------------------|---------------------------------|-----------|
			
 
				+|            1 | FP16        |             0.409 |                                0.411 |                                0.411 |                                0.412 | 414,019                    | 2.65                            |     18.78 |
			
 
				+|            4 | FP16        |             1.619 |                                1.622 |                                1.623 |                                1.624 | 418,010                    | 2.91                            |      4.74 |
			
 
				+|            8 | FP16        |             3.214 |                                3.219 |                                3.220 |                                3.222 | 421,148                    | 2.72                            |      2.39 |
			
 
				+|            1 | FP32        |             1.084 |                                1.087 |                                1.088 |                                1.089 | 156,345                    | -                               |      7.09 |
			
 
				+|            4 | FP32        |             4.721 |                                4.735 |                                4.738 |                                4.743 | 143,585                    | -                               |      1.63 |
			
 
				+|            8 | FP32        |             8.764 |                                8.777 |                                8.779 |                                8.784 | 154,694                    | -                               |      0.88 |
			
 
				 
			
 
				 ## Release notes
			
 
				 
			
@@ -675,8 +738,11 @@ We're constantly refining and improving our performance on AI and HPC workloads
 
				 
			
 
				 ### Changelog
			
 
				 
			
 
				+October 2022
			
 
				+- Updated performance tables
			
 
				+
			
 
				 July 2022
			
 
				-- Performance optimizations, speedups up to 2x (DGX-1) and 2.5x (DGX A100)
			
 
				+- Performance optimizations, speedups up to 1.2x (DGX-1) and 1.6x (DGX A100)
			
 
				 
			
 
				 June 2022
			
 
				 - MHA bug fix affecting models with > 1 attention heads