|
|
@@ -12,10 +12,10 @@ Server with a custom TensorRT
|
|
|
- [Requirements](#requirements)
|
|
|
* [Quick Start Guide](#quick-start-guide)
|
|
|
- [Export the models](#export-the-models)
|
|
|
- - [Setup the TRTIS server](#setup-the-trtis-server)
|
|
|
- - [Setup the TRTIS client](#setup-the-trtis-client)
|
|
|
- - [Starting the TRTIS server](#starting-the-trtis-server)
|
|
|
- - [Running the TRTIS client](#running-the-trtis-client)
|
|
|
+ - [Setup the Triton server](#setup-the-trtis-server)
|
|
|
+ - [Setup the Triton client](#setup-the-trtis-client)
|
|
|
+ - [Starting the Triton server](#starting-the-trtis-server)
|
|
|
+ - [Running the Triton client](#running-the-trtis-client)
|
|
|
* [Advanced](#advanced)
|
|
|
- [Code structure](#code-structure)
|
|
|
- [Precision](#precision)
|
|
|
@@ -93,14 +93,14 @@ mkdir models
|
|
|
./export_weights.sh checkpoints/nvidia_tacotron2pyt_fp16_20190427 checkpoints/nvidia_waveglow256pyt_fp16 models/
|
|
|
```
|
|
|
|
|
|
-### Setup the TRTIS server
|
|
|
+### Setup the Triton server
|
|
|
```bash
|
|
|
./build_trtis.sh models/tacotron2.json models/waveglow.onnx models/denoiser.json
|
|
|
```
|
|
|
This will take some time as TensorRT tries out different tactics for best
|
|
|
performance while building the engines.
|
|
|
|
|
|
-### Setup the TRTIS client
|
|
|
+### Setup the Triton client
|
|
|
|
|
|
Next you need to build the client docker container. To do this, enter the
|
|
|
`trtis_client` directory and run the script `build_trtis_client.sh`.
|
|
|
@@ -111,7 +111,7 @@ cd trtis_client
|
|
|
cd ..
|
|
|
```
|
|
|
|
|
|
-### Run the TRTIS server
|
|
|
+### Run the Triton server
|
|
|
|
|
|
To run the server locally, use the script `run_trtis_server.sh`:
|
|
|
```bash
|
|
|
@@ -119,10 +119,10 @@ To run the server locally, use the script `run_trtis_server.sh`:
|
|
|
```
|
|
|
|
|
|
You can use the environment variable `NVIDIA_VISIBLE_DEVICES` to set which GPUs
|
|
|
-the TRTIS server sees.
|
|
|
+the Triton server sees.
|
|
|
|
|
|
|
|
|
-### Run the TRTIS client
|
|
|
+### Run the Triton client
|
|
|
|
|
|
Leave the server running. In another terminal, type:
|
|
|
```bash
|
|
|
@@ -142,13 +142,11 @@ to detect the end of the phrase.
|
|
|
### Code structure
|
|
|
|
|
|
The `src/` contains the following sub-directories:
|
|
|
-* `trtis`: The directory containing code for the custom TRTIS backend.
|
|
|
+* `trtis`: The directory containing code for the custom Triton backend.
|
|
|
* `trt/tacotron2`: The directory containing the Tacotron2 implementation in TensorRT.
|
|
|
* `trt/waveglow`: The directory containing the WaveGlow implementation in TensorRT.
|
|
|
* `trt/denoiser`: The directory containing the Denoiser (STFT) implementation in TensorRT.
|
|
|
* `trt/plugins`: The directory containing plugins used by the TensorRT engines.
|
|
|
-* `trt/helpers`: The directory containing scripts for exporting models from
|
|
|
-PyTorch.
|
|
|
|
|
|
The `trtis_client/` directory contains the code for running the client.
|
|
|
|
|
|
@@ -172,21 +170,6 @@ For all tests in these tables, we used WaveGlow with 256 residual channels.
|
|
|
|
|
|
### Performance on NVIDIA T4
|
|
|
|
|
|
-#### TensorRT \w Plugins in TRTIS
|
|
|
-
|
|
|
-Latency in this table is measured from the client sending the request, to it
|
|
|
-receiving back the generated audio.
|
|
|
-
|
|
|
-|Batch size|Input length|Precision|Avg latency (s)|Latency std (s)| Latency interval 90% (s)|Latency interval 95% (s)|Latency interval 99% (s)|Avg mels generated |Avg audio length (s)|Avg RTF|
|
|
|
-|---:|----:|-----:|------:|------:|------:|------:|------:|----:|------:|-------:|
|
|
|
-| 1 | 128 | FP16 | 0.49 | 0.00 | 0.49 | 0.49 | 0.50 | 564 | 6.59 | 13.48 |
|
|
|
-| 4 | 128 | FP16 | 1.37 | 0.01 | 1.38 | 1.38 | 1.38 | 563 | 6.54 | 4.77 |
|
|
|
-| 1 | 128 | FP32 | 1.30 | 0.01 | 1.30 | 1.30 | 1.31 | 567 | 6.58 | 5.08 |
|
|
|
-| 4 | 128 | FP32 | 3.63 | 0.01 | 3.64 | 3.64 | 3.64 | 568 | 6.59 | 1.82 |
|
|
|
-
|
|
|
-To reproduce this table, see [Running the benchmark](#running-the-benchmark)
|
|
|
-below.
|
|
|
-
|
|
|
|
|
|
#### TensorRT \w Plugins vs. PyTorch
|
|
|
|
|
|
@@ -194,12 +177,12 @@ Latency in this table is measured from just before the input sequence starts
|
|
|
being copied from host memory to the GPU,
|
|
|
to just after the generated audio finishes being copied back to the host
|
|
|
memory.
|
|
|
-That is, what is taking place in the custom backend inside of TRTIS.
|
|
|
+That is, what is taking place in the custom backend inside of Triton.
|
|
|
|
|
|
|Framework|Batch size|Input length|Precision|Avg latency (s)|Latency std (s)| Latency interval 90% (s)|Latency interval 95% (s)|Latency interval 99% (s)| Throughput (samples/sec) | Speed-up vs. PyT FP32 | Speed-up vs. PyT FP16 | Avg mels generated |Avg audio length (s)|Avg RTF|
|
|
|
|------:|----:|-----:|-----------:|--------:|------:|------:|------:|------:|------:|------:|----:|------:|-------:|---:|
|
|
|
-| TRT \w plugins | 1 | 128 | FP16 | 0.45 | 0.00 | 0.45 | 0.45 | 0.46 | 320,950 | __3.72x__ | __3.39x__ | 564 | 6.55 | 14.59 |
|
|
|
-| TRT \w plugins | 1 | 128 | FP32 | 1.26 | 0.01 | 1.27 | 1.27 | 1.27 | 115,150 | __1.33x__ | __1.21x__ | 567 | 6.58 | 5.22 |
|
|
|
+| TRT \w plugins | 1 | 128 | FP16 | 0.40 | 0.00 | 0.40 | 0.40 | 0.40 | 369,862 | __4.27x__ | __3.90x__ | 579 | 6.72 | 16.77 |
|
|
|
+| TRT \w plugins | 1 | 128 | FP32 | 1.20 | 0.01 | 1.21 | 1.21 | 1.21 | 123,922 | __1.43x__ | __1.31x__ | 581 | 6.74 | 5.62 |
|
|
|
| PyTorch | 1 | 128 | FP16 | 1.63 | 0.07 | 1.71 | 1.73 | 1.81 | 94,758 | __1.10x__ | __1.00x__ | 601 | 6.98 | 4.30 |
|
|
|
| PyTorch | 1 | 128 | FP32 | 1.77 | 0.08 | 1.88 | 1.92 | 2.00 | 86,705 | __1.00x__ | __0.91x__ | 600 | 6.96 | 3.92 |
|
|
|
|
|
|
@@ -207,16 +190,36 @@ That is a __3.72x__ speedup when using TensorRT FP16 with plugins when compared
|
|
|
PyTorch FP32, and still a __3.39x__ speedup when compared to PyTorch FP16.
|
|
|
|
|
|
The TensorRT entries in this table can be reproduced by using the output of
|
|
|
-the TRTIS server, when performing the steps for [Running the
|
|
|
+the Triton server, when performing the steps for [Running the
|
|
|
benchmark](#running-the-benchmark) below.
|
|
|
The PyTorch entries can be reproduced by following the instructions
|
|
|
[here](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2).
|
|
|
|
|
|
|
|
|
+
|
|
|
+#### TensorRT \w Plugins in Triton
|
|
|
+
|
|
|
+Latency in this table is measured from the client sending the request, to it
|
|
|
+receiving back the generated audio. This includes network time,
|
|
|
+request/response formatting time, as well as the backend time shown in the
|
|
|
+above section.
|
|
|
+
|
|
|
+|Batch size|Input length|Precision|Avg latency (s)|Latency std (s)| Latency interval 90% (s)|Latency interval 95% (s)|Latency interval 99% (s)|Avg mels generated |Avg audio length (s)|Avg RTF|
|
|
|
+|---:|----:|-----:|------:|------:|------:|------:|------:|----:|------:|-------:|
|
|
|
+| 1 | 128 | FP16 | 0.42 | 0.00 | 0.42 | 0.42 | 0.42 | 579 | 6.72 | 15.95 |
|
|
|
+| 8 | 128 | FP16 | 2.55 | 0.01 | 2.56 | 2.56 | 2.57 | 571 | 6.62 | 2.60 |
|
|
|
+| 1 | 128 | FP32 | 1.22 | 0.01 | 1.22 | 1.23 | 1.23 | 581 | 6.75 | 5.54 |
|
|
|
+| 8 | 128 | FP32 | 8.64 | 0.01 | 8.68 | 8.69 | 8.71 | 569 | 6.61 | 0.72 |
|
|
|
+
|
|
|
+To reproduce this table, see [Running the benchmark](#running-the-benchmark)
|
|
|
+below.
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
### Running the benchmark
|
|
|
|
|
|
-Once you have performed the steps in [Setup the TRTIS server](#setup-the-trtis-server) and
|
|
|
-[Setup the TRTIS client](#setup-the-trtis-client), you can run the benchmark by starting the TRTIS server via:
|
|
|
+Once you have performed the steps in [Setup the Triton server](#setup-the-trtis-server) and
|
|
|
+[Setup the Triton client](#setup-the-trtis-client), you can run the benchmark by starting the Triton server via:
|
|
|
```bash
|
|
|
./run_trtis_server.sh
|
|
|
```
|
|
|
@@ -233,15 +236,14 @@ Replace <batch size> with the desired batch size between 1 and 32. The engines a
|
|
|
After some time this should produce output like:
|
|
|
```
|
|
|
Performed 1000 runs.
|
|
|
-batch size = 1
|
|
|
-input size = 128
|
|
|
-avg latency (s) = 0.485718
|
|
|
-latency std (s) = 0.00448834
|
|
|
-latency interval 50% (s) = 0.485836
|
|
|
-latency interval 90% (s) = 0.489517
|
|
|
-latency interval 95% (s) = 0.490613
|
|
|
-latency interval 99% (s) = 0.494721
|
|
|
-average mels generated = 564
|
|
|
-average audio generated (s) = 6.54803
|
|
|
-average real-time factor = 13.4811
|
|
|
+batch size = 1
|
|
|
+avg latency (s) = 0.421375
|
|
|
+latency std (s) = 0.00170839
|
|
|
+latency interval 50% (s) = 0.421553
|
|
|
+latency interval 90% (s) = 0.422805
|
|
|
+latency interval 95% (s) = 0.423273
|
|
|
+latency interval 99% (s) = 0.424153
|
|
|
+average mels generated = 582
|
|
|
+average audio generated (s) = 6.72218
|
|
|
+average real-time factor = 15.953
|
|
|
```
|