Просмотр исходного кода

Merge: [FastPitch/PyT] Bump container to 22.08, update perf results

Krzysztof Kudrynski 3 лет назад
Родитель
Сommit
84be38e330
2 измененных файлов с 120 добавлено и 54 удалено
  1. 1 1
      PyTorch/SpeechSynthesis/FastPitch/Dockerfile
  2. 119 53
      PyTorch/SpeechSynthesis/FastPitch/README.md

+ 1 - 1
PyTorch/SpeechSynthesis/FastPitch/Dockerfile

@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # limitations under the License.
 
 
-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:21.05-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:22.08-py3
 FROM ${FROM_IMAGE_NAME}
 FROM ${FROM_IMAGE_NAME}
 
 
 ENV PYTHONPATH /workspace/fastpitch
 ENV PYTHONPATH /workspace/fastpitch

+ 119 - 53
PyTorch/SpeechSynthesis/FastPitch/README.md

@@ -197,7 +197,7 @@ The following section lists the requirements that you need to meet in order to s
 
 
 This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
 This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
 -   [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
 -   [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
--   [PyTorch 21.05-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
+-   [PyTorch 22.08-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
 or newer
 or newer
 - supported GPUs:
 - supported GPUs:
     - [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
     - [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
@@ -319,7 +319,7 @@ The repository is structured similarly to the [NVIDIA Tacotron2 Deep Learning ex
 In this section, we list the most important hyperparameters and command-line arguments,
 In this section, we list the most important hyperparameters and command-line arguments,
 together with their default values that are used to train FastPitch.
 together with their default values that are used to train FastPitch.
 
 
-* `--epochs` - number of epochs (default: 1500)
+* `--epochs` - number of epochs (default: 1000)
 * `--learning-rate` - learning rate (default: 0.1)
 * `--learning-rate` - learning rate (default: 0.1)
 * `--batch-size` - batch size for a single forward-backward step (default: 16)
 * `--batch-size` - batch size for a single forward-backward step (default: 16)
 * `--grad-accumulation` - number of steps over which gradients are accumulated (default: 2)
 * `--grad-accumulation` - number of steps over which gradients are accumulated (default: 2)
@@ -542,7 +542,7 @@ and accuracy in training and inference.
 
 
 ##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
 ##### Training accuracy: NVIDIA DGX A100 (8x A100 80GB)
 
 
-Our results were obtained by running the `./platform/DGXA100_FastPitch_{AMP,TF32}_8GPU.sh` training script in the 21.05-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.
+Our results were obtained by running the `./platform/DGXA100_FastPitch_{AMP,TF32}_8GPU.sh` training script in the PyTorch 21.05-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs.
 
 
 | Loss (Model/Epoch)   |    50 |   250 |   500 |   750 |  1000 |  1250 |  1500 |
 | Loss (Model/Epoch)   |    50 |   250 |   500 |   750 |  1000 |  1250 |  1500 |
 |:---------------------|------:|------:|------:|------:|------:|------:|------:|
 |:---------------------|------:|------:|------:|------:|------:|------:|------:|
@@ -570,50 +570,49 @@ All of the results were produced using the `train.py` script as described in the
 
 
 ##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
 ##### Training performance: NVIDIA DGX A100 (8x A100 80GB)
 
 
-Our results were obtained by running the `./platform/DGXA100_FastPitch_{AMP,TF32}_8GPU.sh` training script in the 21.05-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers, in output mel-scale spectrogram frames per second, were averaged over
+Our results were obtained by running the `./platform/DGXA100_FastPitch_{AMP,TF32}_8GPU.sh` training script in the PyTorch 22.08-py3 NGC container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers, in output mel-scale spectrogram frames per second, were averaged over
 an entire training epoch.
 an entire training epoch.
 
 
-| Batch size / GPU | Grad accumulation | GPUs | Throughput - TF32 | Throughput - mixed precision | Throughput speedup (TF32 to mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
-|---:|--:|--:|--------:|--------:|-----:|-----:|-----:|
-| 32 | 8 | 1 |  97,735 | 101,730 | 1.04 | 1.00 | 1.00 |
-| 32 | 2 | 4 | 337,163 | 352,300 | 1.04 | 3.45 | 3.46 |
-| 32 | 1 | 8 | 599,221 | 623,498 | 1.04 | 6.13 | 6.13 |
+| Batch size / GPU | GPUs | Grad accumulation | Throughput - TF32 | Throughput - mixed precision | Throughput speedup (TF32 to mixed precision) | Strong scaling - TF32 | Strong scaling - mixed precision |
+|-----:|--:|---:|--------:|----------:|--------:|-----:|------:|
+|  128 | 1 |  2 | 141,028 |   148,149 |    1.05 | 1.00 |  1.00 |
+|   64 | 4 |  1 | 525,879 |   614,857 |    1.17 | 3.73 |  4.15 |
+|   32 | 8 |  1 | 914,350 | 1,022,722 |    1.12 | 6.48 |  6.90 |
 
 
 ###### Expected training time
 ###### Expected training time
 
 
-The following table shows the expected training time for convergence for 1500 epochs:
+The following table shows the expected training time for convergence for 1000 epochs:
 
 
 | Batch size / GPU | GPUs | Grad accumulation | Time to train with TF32 (Hrs) | Time to train with mixed precision (Hrs) | Speed-up with mixed precision|
 | Batch size / GPU | GPUs | Grad accumulation | Time to train with TF32 (Hrs) | Time to train with mixed precision (Hrs) | Speed-up with mixed precision|
-|---:|--:|--:|-----:|-----:|-----:|
-| 32 | 1 | 8 | 32.8 | 31.6 | 1.04 |
-| 32 | 4 | 2 |  9.6 |  9.2 | 1.04 |
-| 32 | 8 | 1 |  5.5 |  5.3 | 1.04 |
+|----:|--:|--:|-----:|-----:|-----:|
+| 128 | 1 | 2 | 14.5 | 13.8 | 1.05 |
+| 64  | 4 | 1 |  4.1 |  3.3 | 1.17 |
+| 32  | 8 | 1 |  2.2 |  2.0 | 1.12 |
 
 
 ##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
 ##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
 
 
 Our results were obtained by running the `./platform/DGX1_FastPitch_{AMP,FP32}_8GPU.sh`
 Our results were obtained by running the `./platform/DGX1_FastPitch_{AMP,FP32}_8GPU.sh`
-training script in the PyTorch 21.05-py3 NGC container on NVIDIA DGX-1 with
+training script in the PyTorch 22.08-py3 NGC container on NVIDIA DGX-1 with
 8x V100 16GB GPUs. Performance numbers, in output mel-scale spectrogram frames per second, were averaged over
 8x V100 16GB GPUs. Performance numbers, in output mel-scale spectrogram frames per second, were averaged over
 an entire training epoch.
 an entire training epoch.
 
 
 | Batch size / GPU | GPUs | Grad accumulation | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) | Strong scaling - FP32 | Strong scaling - mixed precision |
 | Batch size / GPU | GPUs | Grad accumulation | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) | Strong scaling - FP32 | Strong scaling - mixed precision |
-|---:|--:|---:|--------:|--------:|-----:|-----:|-----:|
-| 16 | 1 | 16 |  33,456 |  63,986 | 1.91 | 1.00 | 1.00 |
-| 16 | 4 |  4 | 120,393 | 209,335 | 1.74 | 3.60 | 3.27 |
-| 16 | 8 |  2 | 222,161 | 356,522 | 1.60 | 6.64 | 5.57 |
-
+|-----:|---:|-----:|---------:|----------:|--------:|-----:|------:|
+|   16 |  1 |   16 |   31,863 |    83,761 |    2.63 | 1.00 |  1.00 |
+|   16 |  4 |    4 |  117,971 |   269,143 |    2.28 | 3.70 |  3.21 |
+|   16 |  8 |    2 |  225,826 |   435,799 |    1.93 | 7.09 |  5.20 |
 
 
 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
 To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
 
 
 ###### Expected training time
 ###### Expected training time
 
 
-The following table shows the expected training time for convergence for 1500 epochs:
+The following table shows the expected training time for convergence for 1000 epochs:
 
 
 | Batch size / GPU | GPUs | Grad accumulation | Time to train with FP32 (Hrs) | Time to train with mixed precision (Hrs) | Speed-up with mixed precision|
 | Batch size / GPU | GPUs | Grad accumulation | Time to train with FP32 (Hrs) | Time to train with mixed precision (Hrs) | Speed-up with mixed precision|
 |---:|--:|---:|-----:|-----:|-----:|
 |---:|--:|---:|-----:|-----:|-----:|
-| 16 | 1 | 16 | 89.3 | 47.4 | 1.91 |
-| 16 | 4 |  4 | 24.9 | 14.6 | 1.74 |
-| 16 | 8 |  2 | 13.6 |  8.6 | 1.60 |
+| 16 | 1 | 16 | 64.2 | 24.4 | 2.63 |
+| 16 | 4 |  4 | 17.4 |  7.6 | 2.28 |
+| 16 | 8 |  2 |  9.1 |  4.7 | 1.93 |
 
 
 Note that most of the quality is achieved after the initial 1000 epochs.
 Note that most of the quality is achieved after the initial 1000 epochs.
 
 
@@ -628,46 +627,110 @@ The used WaveGlow model is a 256-channel model.
 Note that performance numbers are related to the length of input. The numbers reported below were taken with a moderate length of 128 characters. Longer utterances yield higher RTF, as the generator is fully parallel.
 Note that performance numbers are related to the length of input. The numbers reported below were taken with a moderate length of 128 characters. Longer utterances yield higher RTF, as the generator is fully parallel.
 ##### Inference performance: NVIDIA DGX A100 (1x A100 80GB)
 ##### Inference performance: NVIDIA DGX A100 (1x A100 80GB)
 
 
-Our results were obtained by running the `./scripts/inference_benchmark.sh` inferencing benchmarking script in the 21.05-py3 NGC container on NVIDIA DGX A100 (1x A100 80GB) GPU.
+Our results were obtained by running the `./scripts/inference_benchmark.sh` inferencing benchmarking script in the PyTorch 22.08-py3 NGC container on NVIDIA DGX A100 (1x A100 80GB) GPU.
+
+FastPitch (TorchScript, denoising)
+|   Batch size | Precision   | Avg latency (s)   | Latency tolerance interval 90% (s)   | Latency tolerance interval 95% (s)   | Latency tolerance interval 99% (s)   | Throughput (frames/sec)   | Speed-up with mixed precision   | Avg RTF   |
+|--------------|-------------|-------------------|--------------------------------------|--------------------------------------|--------------------------------------|----------------------------|---------------------------------|-----------|
+|            1 | FP16        | 0.005            | 0.006                               | 0.006                               | 0.006                               | 120,333                    | 0.97                           | 1397.07  |
+|            4 | FP16        | 0.006            | 0.006                               | 0.006                               | 0.006                               | 424,053                    | 1.12                           | 1230.81  |
+|            8 | FP16        | 0.008            | 0.010                               | 0.010                               | 0.011                               | 669,549                    | 1.12                           | 971.68   |
+|            1 | TF32        | 0.005            | 0.006                               | 0.006                               | 0.007                               | 123,718                    | -                               | 1436.37  |
+|            4 | TF32        | 0.007            | 0.007                               | 0.007                               | 0.007                               | 379,980                    | -                               | 1102.89  |
+|            8 | TF32        | 0.009            | 0.009                               | 0.009                               | 0.009                               | 600,435                    | -                               | 871.38   |
+
+FastPitch + HiFi-GAN (TorchScript, denoising)
+|   Batch size | Precision   | Avg latency (s)   | Latency tolerance interval 90% (s)   | Latency tolerance interval 95% (s)   | Latency tolerance interval 99% (s)   | Throughput (samples/sec)   | Speed-up with mixed precision   | Avg RTF   |
+|--------------|-------------|-------------------|--------------------------------------|--------------------------------------|--------------------------------------|----------------------------|---------------------------------|-----------|
+|            1 | FP16        | 0.015            | 0.016                               | 0.016                               | 0.016                               | 11,431,335                 | 1.28                           | 518.43   |
+|            4 | FP16        | 0.038            | 0.040                               | 0.040                               | 0.040                               | 17,670,528                 | 1.42                           | 200.35   |
+|            8 | FP16        | 0.069            | 0.069                               | 0.070                               | 0.070                               | 19,750,759                 | 1.46                           | 111.97   |
+|            1 | TF32        | 0.019            | 0.020                               | 0.020                               | 0.020                               | 8,912,296                  | -                               | 404.19   |
+|            4 | TF32        | 0.054            | 0.055                               | 0.055                               | 0.055                               | 12,471,624                 | -                               | 141.40   |
+|            8 | TF32        | 0.100            | 0.100                               | 0.100                               | 0.101                               | 13,543,317                 | -                               | 76.78    |
+
+FastPitch + WaveGlow (TorchScript, denoising)
+|   Batch size | Precision   | Avg latency (s)   | Latency tolerance interval 90% (s)   | Latency tolerance interval 95% (s)   | Latency tolerance interval 99% (s)   | Throughput (samples/sec)   | Speed-up with mixed precision   | Avg RTF   |
+|--------------|-------------|-------------------|--------------------------------------|--------------------------------------|--------------------------------------|----------------------------|---------------------------------|-----------|
+|            1 | FP16        | 0.076            | 0.077                               | 0.077                               | 0.078                               | 2,223,336                  | 1.38                           | 100.83   |
+|            4 | FP16        | 0.265            | 0.267                               | 0.267                               | 0.267                               | 2,552,577                  | 1.36                           | 28.94    |
+|            8 | FP16        | 0.515            | 0.515                               | 0.516                               | 0.516                               | 2,630,328                  | 1.37                           | 14.91    |
+|            1 | TF32        | 0.105            | 0.106                               | 0.106                               | 0.107                               | 1,610,266                  | -                               | 73.03    |
+|            4 | TF32        | 0.362            | 0.363                               | 0.363                               | 0.363                               | 1,872,327                  | -                               | 21.23    |
+|            8 | TF32        | 0.708            | 0.709                               | 0.709                               | 0.709                               | 1,915,577                  | -                               | 10.86    |
 
 
-|Batch size|Precision|Avg latency (s)|Latency tolerance interval 90% (s)|Latency tolerance interval 95% (s)|Latency tolerance interval 99% (s)|Throughput (samples/sec)|Speed-up with mixed precision|Avg RTF|
-|-----:|-------:|----------:|--------:|--------:|--------:|---------------:|----------:|------:|
-|    1 | FP16   |     0.091 |   0.092 |   0.092 |   0.092 |      1,879,189 | 1.28      | 85.22 |
-|    4 | FP16   |     0.335 |   0.337 |   0.337 |   0.338 |      2,043,641 | 1.21      | 23.17 |
-|    8 | FP16   |     0.652 |   0.654 |   0.654 |   0.655 |      2,103,765 | 1.21      | 11.93 |
-|    1 | TF32   |     0.117 |   0.117 |   0.118 |   0.118 |      1,473,838 | -         | 66.84 |
-|    4 | TF32   |     0.406 |   0.408 |   0.408 |   0.409 |      1,688,141 | -         | 19.14 |
-|    8 | TF32   |     0.792 |   0.794 |   0.794 |   0.795 |      1,735,463 | -         |  9.84 |
 
 
 ##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
 ##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
 
 
 Our results were obtained by running the `./scripts/inference_benchmark.sh` script in
 Our results were obtained by running the `./scripts/inference_benchmark.sh` script in
-the PyTorch 21.05-py3 NGC container. The input utterance has 128 characters, synthesized audio has 8.05 s.
-
+the PyTorch 22.08-py3 NGC container. The input utterance has 128 characters, synthesized audio has 8.05 s.
+
+FastPitch (TorchScript, denoising)
+|   Batch size | Precision   | Avg latency (s)   | Latency tolerance interval 90% (s)   | Latency tolerance interval 95% (s)   | Latency tolerance interval 99% (s)   | Throughput (frames/sec)   | Speed-up with mixed precision   | Avg RTF   |
+|--------------|-------------|-------------------|--------------------------------------|--------------------------------------|--------------------------------------|----------------------------|---------------------------------|-----------|
+|            1 | FP16        | 0.007            | 0.008                               | 0.008                               | 0.008                               | 88,908                     | 1.10                           | 1032.23  |
+|            4 | FP16        | 0.010            | 0.010                               | 0.010                               | 0.010                               | 272,564                    | 1.73                           | 791.12   |
+|            8 | FP16        | 0.013            | 0.013                               | 0.013                               | 0.013                               | 415,263                    | 2.35                           | 602.65   |
+|            1 | FP32        | 0.008            | 0.008                               | 0.008                               | 0.009                               | 80,558                     | -                               | 935.28   |
+|            4 | FP32        | 0.017            | 0.017                               | 0.017                               | 0.017                               | 157,114                    | -                               | 456.02   |
+|            8 | FP32        | 0.030            | 0.030                               | 0.030                               | 0.030                               | 176,754                    | -                               | 256.51   |
+
+FastPitch + HiFi-GAN (TorchScript, denoising)
+|   Batch size | Precision   | Avg latency (s)   | Latency tolerance interval 90% (s)   | Latency tolerance interval 95% (s)   | Latency tolerance interval 99% (s)   | Throughput (samples/sec)   | Speed-up with mixed precision   | Avg RTF   |
+|--------------|-------------|-------------------|--------------------------------------|--------------------------------------|--------------------------------------|----------------------------|---------------------------------|-----------|
+|            1 | FP16        | 0.025            | 0.025                               | 0.025                               | 0.025                               | 6,788,274                  | 2.09                           | 307.86   |
+|            4 | FP16        | 0.067            | 0.068                               | 0.068                               | 0.068                               | 10,066,291                 | 2.63                           | 114.13   |
+|            8 | FP16        | 0.123            | 0.124                               | 0.124                               | 0.124                               | 10,992,774                 | 2.78                           | 62.32    |
+|            1 | FP32        | 0.052            | 0.053                               | 0.053                               | 0.053                               | 3,246,699                  | -                               | 147.24   |
+|            4 | FP32        | 0.177            | 0.178                               | 0.179                               | 0.179                               | 3,829,018                  | -                               | 43.41    |
+|            8 | FP32        | 0.343            | 0.345                               | 0.345                               | 0.346                               | 3,953,920                  | -                               | 22.41    |
+
+FastPitch + WaveGlow (TorchScript, denoising)
+|   Batch size | Precision   | Avg latency (s)   | Latency tolerance interval 90% (s)   | Latency tolerance interval 95% (s)   | Latency tolerance interval 99% (s)   | Throughput (samples/sec)   | Speed-up with mixed precision   | Avg RTF   |
+|--------------|-------------|-------------------|--------------------------------------|--------------------------------------|--------------------------------------|----------------------------|---------------------------------|-----------|
+|            1 | FP16        | 0.134            | 0.135                               | 0.135                               | 0.135                               | 1,259,550                  | 2.89                           | 57.12    |
+|            4 | FP16        | 0.503            | 0.504                               | 0.505                               | 0.505                               | 1,346,145                  | 2.88                           | 15.26    |
+|            8 | FP16        | 0.995            | 0.999                               | 0.999                               | 1.001                               | 1,360,952                  | 2.89                           | 7.72     |
+|            1 | FP32        | 0.389            | 0.391                               | 0.392                               | 0.393                               | 435,564                    | -                               | 19.75    |
+|            4 | FP32        | 1.453            | 1.455                               | 1.456                               | 1.457                               | 466,685                    | -                               | 5.29     |
+|            8 | FP32        | 2.875            | 2.879                               | 2.880                               | 2.882                               | 471,602                    | -                               | 2.67     |
 
 
-|Batch size|Precision|Avg latency (s)|Latency tolerance interval 90% (s)|Latency tolerance interval 95% (s)|Latency tolerance interval 99% (s)|Throughput (samples/sec)|Speed-up with mixed precision|Avg RTF|
-|-----:|-------:|----------:|--------:|--------:|--------:|---------------:|----------:|------:|
-|    1 | FP16   |     0.149 |   0.150 |   0.150 |   0.151 |      1,154,061 | 2.64      | 52.34 |
-|    4 | FP16   |     0.535 |   0.538 |   0.538 |   0.539 |      1,282,680 | 2.71      | 14.54 |
-|    8 | FP16   |     1.055 |   1.058 |   1.059 |   1.060 |      1,300,261 | 2.71      |  7.37 |
-|    1 | FP32   |     0.393 |   0.395 |   0.395 |   0.396 |        436,961 | -         | 19.82 |
-|    4 | FP32   |     1.449 |   1.452 |   1.452 |   1.453 |        473,515 | -         |  5.37 |
-|    8 | FP32   |     2.861 |   2.865 |   2.866 |   2.867 |        479,642 | -         |  2.72 |
 
 
 ##### Inference performance: NVIDIA T4
 ##### Inference performance: NVIDIA T4
 
 
 Our results were obtained by running the `./scripts/inference_benchmark.sh` script in
 Our results were obtained by running the `./scripts/inference_benchmark.sh` script in
-the PyTorch 21.05-py3 NGC container.
+the PyTorch 22.08-py3 NGC container.
 The input utterance has 128 characters, synthesized audio has 8.05 s.
 The input utterance has 128 characters, synthesized audio has 8.05 s.
 
 
-|Batch size|Precision|Avg latency (s)|Latency tolerance interval 90% (s)|Latency tolerance interval 95% (s)|Latency tolerance interval 99% (s)|Throughput (samples/sec)|Speed-up with mixed precision|Avg RTF|
-|-----:|-------:|----------:|--------:|--------:|--------:|--------------:|----------:|------:|
-|    1 | FP16   |     0.446 |   0.449 |   0.449 |   0.450 |       384,743 | 2.72      | 17.45 |
-|    4 | FP16   |     1.822 |   1.826 |   1.827 |   1.828 |       376,480 | 2.70      |  4.27 |
-|    8 | FP16   |     3.656 |   3.662 |   3.664 |   3.666 |       375,329 | 2.70      |  2.13 |
-|    1 | FP32   |     1.213 |   1.218 |   1.219 |   1.220 |       141,403 | -         |  6.41 |
-|    4 | FP32   |     4.928 |   4.937 |   4.939 |   4.942 |       139,208 | -         |  1.58 |
-|    8 | FP32   |     9.853 |   9.868 |   9.871 |   9.877 |       139,266 | -         |  0.79 |
+FastPitch (TorchScript, denoising)
+|   Batch size | Precision   |   Avg latency (s) |   Latency tolerance interval 90% (s) |   Latency tolerance interval 95% (s) |   Latency tolerance interval 99% (s) | Throughput (frames/sec)   | Speed-up with mixed precision   |   Avg RTF |
+|--------------|-------------|-------------------|--------------------------------------|--------------------------------------|--------------------------------------|----------------------------|---------------------------------|-----------|
+|            1 | FP16        |             0.008 |                                0.008 |                                0.008 |                                0.008 | 87,937                     | 1.69                            |   1020.95 |
+|            4 | FP16        |             0.017 |                                0.017 |                                0.017 |                                0.018 | 154,880                    | 2.55                            |    449.54 |
+|            8 | FP16        |             0.029 |                                0.030 |                                0.030 |                                0.030 | 181,776                    | 2.61                            |    263.80 |
+|            1 | FP32        |             0.013 |                                0.013 |                                0.013 |                                0.013 | 52,062                     | -                               |    604.45 |
+|            4 | FP32        |             0.044 |                                0.045 |                                0.045 |                                0.045 | 60,733                     | -                               |    176.28 |
+|            8 | FP32        |             0.076 |                                0.077 |                                0.077 |                                0.077 | 69,685                     | -                               |    101.13 |
+
+FastPitch + HiFi-GAN (TorchScript, denoising)
+|   Batch size | Precision   |   Avg latency (s) |   Latency tolerance interval 90% (s) |   Latency tolerance interval 95% (s) |   Latency tolerance interval 99% (s) | Throughput (samples/sec)   | Speed-up with mixed precision   |   Avg RTF |
+|--------------|-------------|-------------------|--------------------------------------|--------------------------------------|--------------------------------------|----------------------------|---------------------------------|-----------|
+|            1 | FP16        |             0.055 |                                0.056 |                                0.056 |                                0.057 | 3,076,809                  | 2.55                            |    139.54 |
+|            4 | FP16        |             0.201 |                                0.203 |                                0.204 |                                0.204 | 3,360,014                  | 2.67                            |     38.10 |
+|            8 | FP16        |             0.393 |                                0.395 |                                0.396 |                                0.397 | 3,444,245                  | 2.65                            |     19.53 |
+|            1 | FP32        |             0.140 |                                0.142 |                                0.142 |                                0.142 | 1,208,678                  | -                               |     54.82 |
+|            4 | FP32        |             0.538 |                                0.542 |                                0.543 |                                0.545 | 1,260,627                  | -                               |     14.29 |
+|            8 | FP32        |             1.045 |                                1.049 |                                1.050 |                                1.051 | 1,297,726                  | -                               |      7.36 |
+
+FastPitch + WaveGlow (TorchScript, denoising)
+|   Batch size | Precision   |   Avg latency (s) |   Latency tolerance interval 90% (s) |   Latency tolerance interval 95% (s) |   Latency tolerance interval 99% (s) | Throughput (samples/sec)   | Speed-up with mixed precision   |   Avg RTF |
+|--------------|-------------|-------------------|--------------------------------------|--------------------------------------|--------------------------------------|----------------------------|---------------------------------|-----------|
+|            1 | FP16        |             0.409 |                                0.411 |                                0.411 |                                0.412 | 414,019                    | 2.65                            |     18.78 |
+|            4 | FP16        |             1.619 |                                1.622 |                                1.623 |                                1.624 | 418,010                    | 2.91                            |      4.74 |
+|            8 | FP16        |             3.214 |                                3.219 |                                3.220 |                                3.222 | 421,148                    | 2.72                            |      2.39 |
+|            1 | FP32        |             1.084 |                                1.087 |                                1.088 |                                1.089 | 156,345                    | -                               |      7.09 |
+|            4 | FP32        |             4.721 |                                4.735 |                                4.738 |                                4.743 | 143,585                    | -                               |      1.63 |
+|            8 | FP32        |             8.764 |                                8.777 |                                8.779 |                                8.784 | 154,694                    | -                               |      0.88 |
 
 
 ## Release notes
 ## Release notes
 
 
@@ -675,8 +738,11 @@ We're constantly refining and improving our performance on AI and HPC workloads
 
 
 ### Changelog
 ### Changelog
 
 
+October 2022
+- Updated performance tables
+
 July 2022
 July 2022
-- Performance optimizations, speedups up to 2x (DGX-1) and 2.5x (DGX A100)
+- Performance optimizations, speedups up to 1.2x (DGX-1) and 1.6x (DGX A100)
 
 
 June 2022
 June 2022
 - MHA bug fix affecting models with > 1 attention heads
 - MHA bug fix affecting models with > 1 attention heads