Просмотр исходного кода

[Jasper/PyT] Adding TRT support + jupyter notebooks for inference

Przemek Strzelczyk 6 лет назад
Родитель
Сommit
2de99b5fa7
41 измененных файлов с 2521 добавлено и 221 удалено
  1. 1 1
      PyTorch/SpeechRecognition/Jasper/.dockerignore
  2. 1 11
      PyTorch/SpeechRecognition/Jasper/Dockerfile
  3. 36 33
      PyTorch/SpeechRecognition/Jasper/README.md
  4. 1 1
      PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr.toml
  5. 203 0
      PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_nomask.toml
  6. 1 1
      PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_sp_offline.toml
  7. 1 1
      PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_sp_offline_specaugment.toml
  8. 7 10
      PyTorch/SpeechRecognition/Jasper/dataset.py
  9. 5 6
      PyTorch/SpeechRecognition/Jasper/helpers.py
  10. 65 37
      PyTorch/SpeechRecognition/Jasper/inference.py
  11. 17 12
      PyTorch/SpeechRecognition/Jasper/inference_benchmark.py
  12. 0 1
      PyTorch/SpeechRecognition/Jasper/metrics.py
  13. 105 62
      PyTorch/SpeechRecognition/Jasper/model.py
  14. 451 0
      PyTorch/SpeechRecognition/Jasper/notebooks/JasperTRT.ipynb
  15. 57 0
      PyTorch/SpeechRecognition/Jasper/notebooks/README.md
  16. BIN
      PyTorch/SpeechRecognition/Jasper/notebooks/keynote.wav
  17. 16 3
      PyTorch/SpeechRecognition/Jasper/parts/features.py
  18. 3 3
      PyTorch/SpeechRecognition/Jasper/parts/manifest.py
  19. 1 1
      PyTorch/SpeechRecognition/Jasper/requirements.txt
  20. 1 0
      PyTorch/SpeechRecognition/Jasper/scripts/docker/launch.sh
  21. 1 1
      PyTorch/SpeechRecognition/Jasper/scripts/download_librispeech.sh
  22. 1 1
      PyTorch/SpeechRecognition/Jasper/scripts/evaluation.sh
  23. 0 5
      PyTorch/SpeechRecognition/Jasper/scripts/inference_benchmark.sh
  24. 0 1
      PyTorch/SpeechRecognition/Jasper/scripts/train.sh
  25. 0 1
      PyTorch/SpeechRecognition/Jasper/scripts/train_benchmark.sh
  26. 29 29
      PyTorch/SpeechRecognition/Jasper/train.py
  27. 31 0
      PyTorch/SpeechRecognition/Jasper/trt/Dockerfile
  28. 294 0
      PyTorch/SpeechRecognition/Jasper/trt/README.md
  29. 140 0
      PyTorch/SpeechRecognition/Jasper/trt/perf.py
  30. 337 0
      PyTorch/SpeechRecognition/Jasper/trt/perfprocedures.py
  31. 252 0
      PyTorch/SpeechRecognition/Jasper/trt/perfutils.py
  32. 2 0
      PyTorch/SpeechRecognition/Jasper/trt/requirements.txt
  33. 5 0
      PyTorch/SpeechRecognition/Jasper/trt/scripts/docker/trt_build.sh
  34. 39 0
      PyTorch/SpeechRecognition/Jasper/trt/scripts/docker/trt_launch.sh
  35. 30 0
      PyTorch/SpeechRecognition/Jasper/trt/scripts/download_inference_librispeech.sh
  36. 35 0
      PyTorch/SpeechRecognition/Jasper/trt/scripts/preprocess_inference_librispeech.sh
  37. 56 0
      PyTorch/SpeechRecognition/Jasper/trt/scripts/trt_inference.sh
  38. 162 0
      PyTorch/SpeechRecognition/Jasper/trt/scripts/trt_inference_benchmark.sh
  39. 38 0
      PyTorch/SpeechRecognition/Jasper/trt/scripts/walk_benchmark.sh
  40. 92 0
      PyTorch/SpeechRecognition/Jasper/trt/trtutils.py
  41. 5 0
      PyTorch/SpeechRecognition/Jasper/utils/inference_librispeech.csv

+ 1 - 1
PyTorch/SpeechRecognition/Jasper/.dockerignore

@@ -1,4 +1,4 @@
 results/
 *__pycache__
 checkpoints/
-datasets/
+.git/

+ 1 - 11
PyTorch/SpeechRecognition/Jasper/Dockerfile

@@ -12,23 +12,13 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.06-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.09-py3 
 FROM ${FROM_IMAGE_NAME}
 
 
-WORKDIR /tmp/unique_for_apex
-RUN pip uninstall -y apex || :
-RUN pip uninstall -y apex || :
-
-RUN SHA=ToUcHMe git clone https://github.com/NVIDIA/apex.git
-WORKDIR /tmp/unique_for_apex/apex
-RUN pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
-
-
 RUN apt-get update && apt-get install -y libsndfile1 && apt-get install -y sox && rm -rf /var/lib/apt/lists/*
 
 WORKDIR /workspace/jasper
 
 COPY . .
 RUN pip install --disable-pip-version-check -U -r requirements.txt
-

+ 36 - 33
PyTorch/SpeechRecognition/Jasper/README.md

@@ -1,6 +1,6 @@
 # Jasper For PyTorch
 
-This repository provides a script and recipe to train the Jasper model to achieve state of the art  the paper accuracy of the acoustic model, and is tested and maintained by NVIDIA.
+This repository provides scripts to train the Jasper model to achieve near state of the art accuracy and perform high-performance inference using NVIDIA TensorRT. This repository is tested and maintained by NVIDIA.
 
 ## Table Of Contents
 - [Model overview](#model-overview)
@@ -23,6 +23,7 @@ This repository provides a script and recipe to train the Jasper model to achiev
    * [Training process](#training-process)
    * [Inference process](#inference-process)
    * [Evaluation process](#evaluation-process)
+   * [Inference process with TensorRT](#inference-process-with-tensorrt)
 - [Performance](#performance)
    * [Benchmarking](#benchmarking)
        * [Training performance benchmark](#training-performance-benchmark)
@@ -50,7 +51,7 @@ This repository provides an implementation of the Jasper model in PyTorch from t
 The Jasper model is an end-to-end neural acoustic model for automatic speech recognition (ASR) that provides near state-of-the-art results on LibriSpeech among end-to-end ASR models without any external data. The Jasper architecture of convolutional layers was designed to facilitate fast GPU inference, by allowing whole sub-blocks to be fused into a single GPU kernel. This is important for meeting strict real-time requirements of ASR systems in deployment.
 
 The results of the acoustic model are combined with the results of external language models to get the top-ranked word sequences
-corresponding to a given audio segment. This post-processing step is called decoding. 
+corresponding to a given audio segment. This post-processing step is called decoding.
 
 This repository is a PyTorch implementation of Jasper and provides scripts to train the Jasper 10x5 model with dense residuals from scratch on the [Librispeech](http://www.openslr.org/12) dataset to achieve the greedy decoding results of the original paper.
 The original reference code provides Jasper as part of a research toolkit in TensorFlow [openseq2seq](https://github.com/NVIDIA/OpenSeq2Seq).
@@ -85,7 +86,7 @@ Each sub-block applies the following operations in sequence: 1D-Convolution, Bat
 Each block input is connected directly to the last subblock of all following blocks via a residual connection, which is referred to as `dense residual` in the paper.
 Every block differs in kernel size and number of filters, which are increasing in size from the bottom to the top layers.
 Irrespective of the exact block configuration parameters B and R, every Jasper model has four additional convolutional blocks:
-one immediately succeeding the input layer (Prologue) and three at the end of the B blocks (Epilogue).  
+one immediately succeeding the input layer (Prologue) and three at the end of the B blocks (Epilogue).
 
 The Prologue is to decimate the audio signal
 in time in order to process a shorter time sequence for efficiency. The Epilogue with dilation captures a bigger context around an audio time step, which decreases the model word error rate (WER).
@@ -96,7 +97,7 @@ The paper achieves best results with Jasper 10x5 with dense residual connections
 The following features were implemented in this model:
 
 * GPU-supported feature extraction with data augmentation options [SpecAugment](https://arxiv.org/abs/1904.08779) and [Cutout](https://arxiv.org/pdf/1708.04552.pdf)
-* offline and online [Speed Perturbation](https://www.danielpovey.com/files/2015_interspeech_augmentation.pdf) 
+* offline and online [Speed Perturbation](https://www.danielpovey.com/files/2015_interspeech_augmentation.pdf)
 * data-parallel multi-GPU training and evaluation
 * AMP with dynamic loss scaling for Tensor Core training
 * FP16 inference with AMP
@@ -153,7 +154,7 @@ For information about:
 
 For training, mixed precision can be enabled by setting the flag: `train.py --fp16`. You can change this behavior and execute the training in
 single precision by removing the `--fp16` flag for the `train.py` training
-script. For example, in the bash scripts `scripts/train.sh`, `scripts/inference.sh`, etc. the precision can be specified with the variable `PRECISION` by setting it to either `PRECISION=’fp16’` or  `PRECISION=’fp32’`. 
+script. For example, in the bash scripts `scripts/train.sh`, `scripts/inference.sh`, etc. the precision can be specified with the variable `PRECISION` by setting it to either `PRECISION=’fp16’` or  `PRECISION=’fp32’`.
 
 Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision
 (AMP) library from [APEX](https://github.com/NVIDIA/apex) that casts variables
@@ -169,7 +170,7 @@ value to be used can be
 For an in-depth walk through on AMP, check out sample usage
 [here](https://nvidia.github.io/apex/amp.html#). [APEX](https://github.com/NVIDIA/apex) is a PyTorch extension that contains
 utility libraries, such as AMP, which require minimal network code changes to
-leverage tensor cores performance.
+leverage Tensor Cores performance.
 
 The following steps were needed to enable mixed precision training in Jasper:
 
@@ -178,7 +179,7 @@ The following steps were needed to enable mixed precision training in Jasper:
 from apex import amp
 ```
 
-* Initialize AMP and wrap the model and the optimizer 
+* Initialize AMP and wrap the model and the optimizer
 ```
    model, optimizer = amp.initialize(
      min_loss_scale=1.0,
@@ -188,7 +189,7 @@ from apex import amp
 
 ```
 
-* Apply `scale_loss` context manager 
+* Apply `scale_loss` context manager
 ```
 with amp.scale_loss(loss, optimizer) as scaled_loss:
     scaled_loss.backward()
@@ -216,11 +217,11 @@ The following section lists the requirements in order to start training and eval
 
 ### Requirements
 
-This repository contains a `Dockerfile` which extends the PyTorch 19.06-py3 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
+This repository contains a `Dockerfile` which extends the PyTorch 19.09-py3 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
 
 * [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
-* [PyTorch 19.06-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
-* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) based GPU
+* [PyTorch 19.09-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
+* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
 
 Further required python packages are listed in `requirements.txt`, which are automatically installed with the Docker container built. To manually install them, run
 ```bash
@@ -240,7 +241,7 @@ For those unable to use the PyTorch NGC container, to set up the required enviro
 
 ## Quick Start Guide
 
-To train your model using mixed precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the Jasper model on the Librispeech dataset. For details concerning training and inference, see [Advanced](#Advanced).
+To train your model using mixed precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the Jasper model on the Librispeech dataset. For details concerning training and inference, see [Advanced](#Advanced) section.
 
 1. Clone the repository.
 ```bash
@@ -265,7 +266,7 @@ and mapped to the corresponding directories `<DATA_DIR>`, `<CHECKPOINT_DIR>`, `<
 
 4. Download and preprocess the dataset.
 
-No GPU is required for data download and preprocessing. Therefore, if GPU usage is a limited resource, launch the container for this section on a CPU machine by following Steps 2 and 3. 
+No GPU is required for data download and preprocessing. Therefore, if GPU usage is a limited resource, launch the container for this section on a CPU machine by following Steps 2 and 3.
 
 Note: Downloading and preprocessing the dataset requires 500GB of free disk space and can take several hours to complete.
 
@@ -290,7 +291,7 @@ Once the data download is complete, the following folders should exist:
    * `test-clean/`
    * `test-other/`
 
-Since `/datasets/` is mounted to `<DATA_DIR>` on the host (see Step 3),  once the dataset is downloaded it is accessible from outside of the container at `<DATA_DIR>/LibriSpeech`.
+Since `/datasets/` is mounted to `<DATA_DIR>` on the host (see Step 3),  once the dataset is downloaded it will be accessible from outside of the container at `<DATA_DIR>/LibriSpeech`.
 
 
 Next, convert the data into WAV files and add speed perturbation with 0.9 and 1.1 to the training files:
@@ -317,8 +318,8 @@ Once the data is converted, the following additional files and folders should ex
 
 5. Start training.
 
-Inside the container, use the following script to start training. 
-Make sure the downloaded and preprocessed dataset is located at `<DATA_DIR>/LibriSpeech` on the host (see Step 3), which corresponds to `/datasets/LibriSpeech` inside the container. 
+Inside the container, use the following script to start training.
+Make sure the downloaded and preprocessed dataset is located at `<DATA_DIR>/LibriSpeech` on the host (see Step 3), which corresponds to `/datasets/LibriSpeech` inside the container.
 
 ```bash
 bash scripts/train.sh [OPTIONS]
@@ -330,7 +331,7 @@ More details on available [OPTIONS] can be found in [Parameters](#parameters) an
 6. Start validation/evaluation.
 
 Inside the container, use the following script to run evaluation.
- Make sure the downloaded and preprocessed dataset is located at `<DATA_DIR>/LibriSpeech` on the host (see Step 3), which corresponds to `/datasets/LibriSpeech` inside the container. 
+ Make sure the downloaded and preprocessed dataset is located at `<DATA_DIR>/LibriSpeech` on the host (see Step 3), which corresponds to `/datasets/LibriSpeech` inside the container.
 ```bash
 bash scripts/evaluation.sh [OPTIONS]
 ```
@@ -342,7 +343,9 @@ More details on available [OPTIONS] can be found in [Parameters](#parameters) an
 7. Start inference/predictions.
 
 Inside the container, use the following script to run inference.
- Make sure the downloaded and preprocessed dataset is located at `<DATA_DIR>/LibriSpeech` on the host (see Step 3), which corresponds to `/datasets/LibriSpeech` inside the container. 
+ Make sure the downloaded and preprocessed dataset is located at `<DATA_DIR>/LibriSpeech` on the host (see Step 3), which corresponds to `/datasets/LibriSpeech` inside the container.
+A pretrained model checkpoint can be downloaded from `NGC model repository`[https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16].
+
 ```bash
 bash scripts/inference.sh [OPTIONS]
 ```
@@ -364,7 +367,7 @@ In the `root` directory, the most important files are:
 * `model.py` - Contains the model architecture
 * `dataset.py` - Contains the data loader and related functionality
 * `optimizer.py` - Contains the optimizer
-* `inference_benchmark.py` - Serves as inference benchmarking script that measures the latency of pre-processing and the acoustic model 
+* `inference_benchmark.py` - Serves as inference benchmarking script that measures the latency of pre-processing and the acoustic model
 * `requirements.py` - Contains the required dependencies that are installed when building the Docker container
 * `Dockerfile` - Container with the basic set of dependencies to run Jasper
 
@@ -380,9 +383,9 @@ The `scripts/` folder encapsulates all the one-click scripts required for runnin
 
 
 Other folders included in the `root` directory are:
+* `notebooks/` - Contains Jupyter notebook
 * `configs/` - Model configurations
 * `utils/` - Contains the necessary files for data download and  processing
-
 * `parts/` - Contains the necessary files for data pre-processing
 
 ### Parameters
@@ -438,7 +441,7 @@ SEED: seed for random number generator and useful for ensuring reproducibility.
 BATCH_SIZE: data batch size.(default: 64)
 ```
 
-The `scripts/inference_benchmark.sh` script pads all input to the same length and computes the mean, 90%, 95%, 99% percentile of latency for the specified number of inference steps. Latency is measured in millisecond per batch. The `scripts/inference_benchmark.sh` 
+The `scripts/inference_benchmark.sh` script pads all input to the same length and computes the mean, 90%, 95%, 99% percentile of latency for the specified number of inference steps. Latency is measured in millisecond per batch. The `scripts/inference_benchmark.sh`
 measures latency for a single GPU and extends  `scripts/inference.sh` by :
 ```bash
  MAX_DURATION: filters out input audio data that exceeds a maximum number of seconds. This ensures that when all filtered audio samples are padded to maximum length that length will stay under this specified threshold (default: 36)
@@ -538,7 +541,7 @@ Apart from the default arguments as listed in the [Parameters](#parameters) sect
 ### Evaluation process
 
 Evaluation is performed using the `inference.py` script along with parameters defined in `scripts/evaluation.sh`.
-The `scripts/evaluation.sh` script runs a job on a a single GPU, taking a pre-trained Jasper model checkpoint and running it on the specified dataset.
+The `scripts/evaluation.sh` script runs a job on a single GPU, taking a pre-trained Jasper model checkpoint and running it on the specified dataset.
 Apart from the default arguments as listed in the [Parameters](#parameters) section, by default the evaluation script:
 
 * Uses a batch size of 64
@@ -551,6 +554,9 @@ Apart from the default arguments as listed in the [Parameters](#parameters) sect
 * Has cudnn benchmark disabled
 
 
+### Inference Process with TensorRT
+NVIDIA TensorRT is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. Jasper’s architecture, which is of deep convolutional nature, is designed to facilitate fast GPU inference. After optimizing the compute-intensive acoustic model with NVIDIA TensorRT, inference throughput increased by up to 1.8x over native PyTorch. 
+More information on how to perform inference using TensorRT and speed up comparison between TensorRT and native PyTorch can be found in the subfolder [./trt/README.md](trt/README.md)
 
 ## Performance
 
@@ -604,12 +610,12 @@ The results for Jasper Large's word error rate from the original paper after gre
 
 ##### Training accuracy: NVIDIA DGX-1 (8x V100 32G)
 
-Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 19.06-py3 NGC container with NVIDIA DGX-1 with (8x V100 32G) GPUs.
+Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 19.09-py3 NGC container with NVIDIA DGX-1 with (8x V100 32G) GPUs.
 The following tables report the word error rate(WER) of the acoustic model with greedy decoding on all LibriSpeech dev and test datasets for mixed precision training.
 
 FP16 (seed #6)
 
-| **Number of GPUs**    | **Batch size per GPU**    | **dev-clean WER** | **dev-other WER**| **test-clean WER**| **test-other WER**| **Total time to train with FP16 (Hrs)** | 
+| **Number of GPUs**    | **Batch size per GPU**    | **dev-clean WER** | **dev-other WER**| **test-clean WER**| **test-other WER**| **Total time to train with FP16 (Hrs)** |
 |---    |---    |---    |---    |---    |---    |---    |
 |8 |64| 3.51|11.14|3.74|11.06|100
 
@@ -619,7 +625,7 @@ FP32 training matches the results of mixed precision training and takes approxim
 
 ##### Training stability test
 
-The following table compares greedy decoding word error rates across 8 different training runs with different seeds for mixed precision training. 
+The following table compares greedy decoding word error rates across 8 different training runs with different seeds for mixed precision training.
 
 | **FP16, 8x GPUs** | **seed #1** | **seed #2** | **seed #3** | **seed #4** | **seed #5** | **seed #6** | **seed #7** | **seed #8** | **mean** | **std** |
 |:-----------:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
@@ -632,7 +638,7 @@ The following table compares greedy decoding word error rates across 8 different
 
 #### Training performance results
 
-Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 19.06-py3 NGC container. Performance (in sequences per second) is the steady-state throughput.
+Our results were obtained by running the `scripts/train.sh` training script in the PyTorch 19.09-py3 NGC container. Performance (in sequences per second) is the steady-state throughput.
 
 ##### Training performance: NVIDIA DGX-1 (8x V100 16G)
 
@@ -700,7 +706,7 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
 
 #### Inference performance results
 
-Our results were obtained by running the `scripts/inference_benchmark.sh` script in the PyTorch 19.06-py3 NGC container on NVIDIA DGX-1, DGX-2 and T4 on a single GPU. Performance numbers (latency in milliseconds per batch) were averaged over 1000 iterations.
+Our results were obtained by running the `scripts/inference_benchmark.sh` script in the PyTorch 19.09-py3 NGC container on NVIDIA DGX-1, DGX-2 and T4 on a single GPU. Performance numbers (latency in milliseconds per batch) were averaged over 1000 iterations.
 
 ##### Inference performance: NVIDIA DGX-1 (1x V100 16G)
 
@@ -800,6 +806,9 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
 ## Release notes
 
 ### Changelog
+September 2019
+* Inference support for TRT 6
+* Jupyter notebook for inference
 
 July 2019
 * Initial release
@@ -808,9 +817,3 @@ July 2019
 ### Known issues
 
 There are no known issues in this release.
-
-
-
-
-
-

+ 1 - 1
PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr.toml

@@ -55,7 +55,7 @@ dither = 0.00001
 feat_type = "logfbank"
 normalize_transcripts = true
 trim_silence = true
-pad_to = 16 
+pad_to = 16
 
 
 [encoder]

+ 203 - 0
PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_nomask.toml

@@ -0,0 +1,203 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+model = "Jasper"
+
+[input]
+normalize = "per_feature"
+sample_rate = 16000
+window_size = 0.02
+window_stride = 0.01
+window = "hann"
+features = 64
+n_fft = 512
+frame_splicing = 1
+dither = 0.00001
+feat_type = "logfbank"
+normalize_transcripts = true
+trim_silence = true
+pad_to = 16
+max_duration = 16.7
+speed_perturbation = false
+
+
+cutout_rect_regions = 0
+cutout_rect_time = 60
+cutout_rect_freq = 25
+
+cutout_x_regions = 0
+cutout_y_regions = 0
+cutout_x_width = 6
+cutout_y_width = 6
+
+
+[input_eval]
+normalize = "per_feature"
+sample_rate = 16000
+window_size = 0.02
+window_stride = 0.01
+window = "hann"
+features = 64
+n_fft = 512
+frame_splicing = 1
+dither = 0.00001
+feat_type = "logfbank"
+normalize_transcripts = true
+trim_silence = true
+pad_to = 16
+
+
+[encoder]
+activation = "relu"
+convmask = false
+
+[[jasper]]
+filters = 256
+repeat = 1
+kernel = [11]
+stride = [2]
+dilation = [1]
+dropout = 0.2
+residual = false
+
+[[jasper]]
+filters = 256
+repeat = 5
+kernel = [11]
+stride = [1]
+dilation = [1]
+dropout = 0.2
+residual = true
+residual_dense = true
+
+
+[[jasper]]
+filters = 256
+repeat = 5
+kernel = [11]
+stride = [1]
+dilation = [1]
+dropout = 0.2
+residual = true
+residual_dense = true
+
+
+[[jasper]]
+filters = 384
+repeat = 5
+kernel = [13]
+stride = [1]
+dilation = [1]
+dropout = 0.2
+residual = true
+residual_dense = true
+
+
+[[jasper]]
+filters = 384
+repeat = 5
+kernel = [13]
+stride = [1]
+dilation = [1]
+dropout = 0.2
+residual = true
+residual_dense = true
+
+
+[[jasper]]
+filters = 512
+repeat = 5
+kernel = [17]
+stride = [1]
+dilation = [1]
+dropout = 0.2
+residual = true
+residual_dense = true
+
+
+[[jasper]]
+filters = 512
+repeat = 5
+kernel = [17]
+stride = [1]
+dilation = [1]
+dropout = 0.2
+residual = true
+residual_dense = true
+
+
+[[jasper]]
+filters = 640
+repeat = 5
+kernel = [21]
+stride = [1]
+dilation = [1]
+dropout = 0.3
+residual = true
+residual_dense = true
+
+
+[[jasper]]
+filters = 640
+repeat = 5
+kernel = [21]
+stride = [1]
+dilation = [1]
+dropout = 0.3
+residual = true
+residual_dense = true
+
+
+[[jasper]]
+filters = 768
+repeat = 5
+kernel = [25]
+stride = [1]
+dilation = [1]
+dropout = 0.3
+residual = true
+residual_dense = true
+
+
+[[jasper]]
+filters = 768
+repeat = 5
+kernel = [25]
+stride = [1]
+dilation = [1]
+dropout = 0.3
+residual = true
+residual_dense = true
+
+
+[[jasper]]
+filters = 896
+repeat = 1
+kernel = [29]
+stride = [1]
+dilation = [2]
+dropout = 0.4
+residual = false
+
+[[jasper]]
+filters = 1024
+repeat = 1
+kernel = [1]
+stride = [1]
+dilation = [1]
+dropout = 0.4
+residual = false
+
+[labels]
+labels = [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]

+ 1 - 1
PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_sp_offline.toml

@@ -56,7 +56,7 @@ dither = 0.00001
 feat_type = "logfbank"
 normalize_transcripts = true
 trim_silence = true
-pad_to = 16 
+pad_to = 16
 
 
 [encoder]

+ 1 - 1
PyTorch/SpeechRecognition/Jasper/configs/jasper10x5dr_sp_offline_specaugment.toml

@@ -56,7 +56,7 @@ dither = 0.00001
 feat_type = "logfbank"
 normalize_transcripts = true
 trim_silence = true
-pad_to = 16 
+pad_to = 16
 
 
 [encoder]

+ 7 - 10
PyTorch/SpeechRecognition/Jasper/dataset.py

@@ -13,7 +13,7 @@
 # limitations under the License.
 
 """
-This file contains classes and functions related to data loading  
+This file contains classes and functions related to data loading
 """
 import torch
 import numpy as np
@@ -66,7 +66,7 @@ class DistributedBucketBatchSampler(Sampler):
             bucket_start = self.bucket_size * bucket
             bucket_end = min(bucket_start + self.bucket_size, self.index_count)
             indices[bucket_start:bucket_end] = indices[bucket_start:bucket_end][torch.randperm(bucket_end - bucket_start, generator=g)]
-        
+
         tile_indices = torch.randperm(self.index_count // self.tile_size, generator=g)
         for tile_index in tile_indices:
             start_index = self.tile_size * tile_index + self.batch_size * self.rank
@@ -93,7 +93,7 @@ class data_prefetcher():
             return
         with torch.cuda.stream(self.stream):
             self.next_input = [ x.cuda(non_blocking=True) for x in self.next_input]
-            
+
     def __next__(self):
         torch.cuda.current_stream().wait_stream(self.stream)
         input = self.next_input
@@ -133,7 +133,7 @@ def seq_collate_fn(batch):
     return batched_audio_signal, torch.stack(audio_lengths), batched_transcript, \
          torch.stack(transcript_lengths)
 
-class AudioToTextDataLayer:  
+class AudioToTextDataLayer:
     """Data layer with data loader
     """
     def __init__(self, **kwargs):
@@ -205,7 +205,7 @@ class AudioToTextDataLayer:
                 sampler=self.sampler
             )
         else:
-            raise RuntimeError("Sampler {} not supported".format(sampler_type)) 
+            raise RuntimeError("Sampler {} not supported".format(sampler_type))
 
     def __len__(self):
         return len(self._dataset)
@@ -214,9 +214,9 @@ class AudioToTextDataLayer:
     def data_iterator(self):
         return self._dataloader
 
-class AudioDataset(Dataset):  
+class AudioDataset(Dataset):
     def __init__(self, dataset_dir, manifest_filepath, labels, featurizer, max_duration=None, pad_to_max=False,
-                 min_duration=None, blank_index=0, max_utts=0, normalize=True, sort_by_duration=False, 
+                 min_duration=None, blank_index=0, max_utts=0, normalize=True, sort_by_duration=False,
                  trim=False, speed_perturbation=False):
         """Dataset that loads tensors via a json file containing paths to audio files, transcripts, and durations
         (in seconds). Each entry is a different audio sample.
@@ -264,6 +264,3 @@ class AudioDataset(Dataset):
 
     def __len__(self):
         return len(self.manifest)
-
-
-

+ 5 - 6
PyTorch/SpeechRecognition/Jasper/helpers.py

@@ -43,7 +43,7 @@ def add_ctc_labels(labels):
         raise ValueError("labels must be a list of symbols")
     labels.append("<BLANK>")
     return labels
- 
+
 def __ctc_decoder_predictions_tensor(tensor, labels):
     """
     Takes output of greedy ctc decoder and performs ctc decoding algorithm to
@@ -136,7 +136,7 @@ def __gather_transcripts(transcript_list: list, transcript_len_list: list,
 
 def process_evaluation_batch(tensors: dict, global_vars: dict, labels: list):
     """
-    Processes results of an iteration and saves it in global_vars 
+    Processes results of an iteration and saves it in global_vars
     Args:
         tensors: dictionary with results of an evaluation iteration, e.g. loss, predictions, transcript, and output
         global_vars: dictionary where processes results of iteration are saved
@@ -162,11 +162,11 @@ def process_evaluation_batch(tensors: dict, global_vars: dict, labels: list):
 
 def process_evaluation_epoch(global_vars: dict, tag=None):
     """
-    Processes results from each worker at the end of evaluation and combine to final result 
+    Processes results from each worker at the end of evaluation and combine to final result
     Args:
         global_vars: dictionary containing information of entire evaluation
     Return:
-        wer: final word error rate 
+        wer: final word error rate
         loss: final loss
     """
     if 'EvalLoss' in global_vars:
@@ -200,7 +200,7 @@ def process_evaluation_epoch(global_vars: dict, tag=None):
 
 
 def norm(x):
-    if not isinstance(x, List):
+    if not isinstance(x, list):
         if not isinstance(x, tuple):
             return x
     return x[0]
@@ -220,4 +220,3 @@ def model_multi_gpu(model, multi_gpu=False):
         model = DDP(model)
         print('DDP(model)')
     return model
-

+ 65 - 37
PyTorch/SpeechRecognition/Jasper/inference.py

@@ -19,14 +19,16 @@ from tqdm import tqdm
 import math
 import toml
 from dataset import AudioToTextDataLayer
-from helpers import process_evaluation_batch, process_evaluation_epoch, Optimization, add_ctc_labels, AmpOptimizations, print_dict, model_multi_gpu
+from helpers import process_evaluation_batch, process_evaluation_epoch, Optimization, add_ctc_labels, AmpOptimizations, print_dict, model_multi_gpu, __ctc_decoder_predictions_tensor
 from model import AudioPreprocessing, GreedyCTCDecoder, JasperEncoderDecoder
+from parts.features import audio_from_file
 import torch
 import apex
 from apex import amp
 import random
 import numpy as np
 import pickle
+import time
 
 def parse_args():
     parser = argparse.ArgumentParser(description='Jasper')
@@ -44,14 +46,15 @@ def parse_args():
     parser.add_argument("--save_prediction", type=str, default=None, help="if specified saves predictions in text form at this location")
     parser.add_argument("--logits_save_to", default=None, type=str, help="if specified will save logits to path")
     parser.add_argument("--seed", default=42, type=int, help='seed')
+    parser.add_argument("--wav", type=str, help='absolute path to .wav file (16KHz)')
     return parser.parse_args()
 
 def eval(
         data_layer,
-        audio_processor, 
-        encoderdecoder, 
-        greedy_decoder, 
-        labels, 
+        audio_processor,
+        encoderdecoder,
+        greedy_decoder,
+        labels,
         multi_gpu,
         args):
     """performs inference / evaluation
@@ -74,6 +77,21 @@ def eval(
             'logits' : [],
         }
 
+
+        
+        if args.wav:
+            features, p_length_e = audio_processor(audio_from_file(args.wav))
+            torch.cuda.synchronize()
+            t0 = time.perf_counter()
+            t_log_probs_e = encoderdecoder(features)
+            torch.cuda.synchronize()
+            t1 = time.perf_counter()
+            t_predictions_e = greedy_decoder(log_probs=t_log_probs_e)
+            hypotheses = __ctc_decoder_predictions_tensor(t_predictions_e, labels=labels)
+            print("INFERENCE TIME\t\t: {} ms".format((t1-t0)*1000.0))
+            print("TRANSCRIPT\t\t:", hypotheses[0])
+            return
+        
         for it, data in enumerate(tqdm(data_layer.data_iterator)):
             tensors = []
             for d in data:
@@ -83,8 +101,11 @@ def eval(
 
             inp = (t_audio_signal_e, t_a_sig_length_e)
 
-            t_processed_signal, p_length_e = audio_processor(x=inp) 
-            t_log_probs_e, _ = encoderdecoder((t_processed_signal, p_length_e))
+            t_processed_signal, p_length_e = audio_processor(x=inp)
+            if args.use_conv_mask:
+                t_log_probs_e, t_encoded_len_e  = encoderdecoder((t_processed_signal, p_length_e))
+            else:
+                t_log_probs_e  = encoderdecoder(t_processed_signal)
             t_predictions_e = greedy_decoder(log_probs=t_log_probs_e)
 
             values_dict = dict(
@@ -98,7 +119,7 @@ def eval(
             if args.steps is not None and it + 1 >= args.steps:
                 break
         wer, _ = process_evaluation_epoch(_global_var_dict)
-        if (not multi_gpu or (multi_gpu and torch.distributed.get_rank() == 0)):    
+        if (not multi_gpu or (multi_gpu and torch.distributed.get_rank() == 0)):
             print("==========>>>>>>Evaluation WER: {0}\n".format(wer))
             if args.save_prediction is not None:
                 with open(args.save_prediction, 'w') as fp:
@@ -122,7 +143,7 @@ def main(args):
     if args.local_rank is not None:
         torch.cuda.set_device(args.local_rank)
         torch.distributed.init_process_group(backend='nccl', init_method='env://')
-    multi_gpu = args.local_rank is not None 
+    multi_gpu = args.local_rank is not None
     if multi_gpu:
         print("DISTRIBUTED with ", torch.distributed.get_world_size())
 
@@ -135,9 +156,10 @@ def main(args):
     dataset_vocab = jasper_model_definition['labels']['labels']
     ctc_vocab = add_ctc_labels(dataset_vocab)
 
-    val_manifest = args.val_manifest 
+    val_manifest = args.val_manifest
     featurizer_config = jasper_model_definition['input_eval']
     featurizer_config["optimization_level"] = optim_level
+    args.use_conv_mask = jasper_model_definition['encoder'].get('convmask', True)
 
     if args.max_duration is not None:
         featurizer_config['max_duration'] = args.max_duration
@@ -148,20 +170,22 @@ def main(args):
     print_dict(jasper_model_definition)
     print('feature_config')
     print_dict(featurizer_config)
-
-    data_layer = AudioToTextDataLayer(
-                                    dataset_dir=args.dataset_dir, 
-                                    featurizer_config=featurizer_config,
-                                    manifest_filepath=val_manifest,
-                                    labels=dataset_vocab,
-                                    batch_size=args.batch_size,
-                                    pad_to_max=featurizer_config['pad_to'] == "max",
-                                    shuffle=False,
-                                    multi_gpu=multi_gpu)
+    data_layer = None
+    
+    if args.wav is None:
+        data_layer = AudioToTextDataLayer(
+            dataset_dir=args.dataset_dir, 
+            featurizer_config=featurizer_config,
+            manifest_filepath=val_manifest,
+            labels=dataset_vocab,
+            batch_size=args.batch_size,
+            pad_to_max=featurizer_config['pad_to'] == "max",
+            shuffle=False,
+            multi_gpu=multi_gpu)
     audio_preprocessor = AudioPreprocessing(**featurizer_config)
 
     encoderdecoder = JasperEncoderDecoder(jasper_model_definition=jasper_model_definition, feat_in=1024, num_classes=len(ctc_vocab))
- 
+
     if args.ckpt is not None:
         print("loading model from ", args.ckpt)
         checkpoint = torch.load(args.ckpt, map_location="cpu")
@@ -169,25 +193,28 @@ def main(args):
             checkpoint['state_dict'][k] = checkpoint['state_dict'].pop("audio_preprocessor." + k)
         audio_preprocessor.load_state_dict(checkpoint['state_dict'], strict=False)
         encoderdecoder.load_state_dict(checkpoint['state_dict'], strict=False)
-    
+
     greedy_decoder = GreedyCTCDecoder()
 
     # print("Number of parameters in encoder: {0}".format(model.jasper_encoder.num_weights()))
-
-    N = len(data_layer)
-    step_per_epoch = math.ceil(N / (args.batch_size * (1 if not torch.distributed.is_initialized() else torch.distributed.get_world_size())))
-
-    if args.steps is not None:
-        print('-----------------')
-        print('Have {0} examples to eval on.'.format(args.steps * args.batch_size * (1 if not torch.distributed.is_initialized() else torch.distributed.get_world_size())))
-        print('Have {0} steps / (gpu * epoch).'.format(args.steps))
-        print('-----------------')
+    if args.wav is None:
+        N = len(data_layer)
+        step_per_epoch = math.ceil(N / (args.batch_size * (1 if not torch.distributed.is_initialized() else torch.distributed.get_world_size())))
+
+        if args.steps is not None:
+            print('-----------------')
+            print('Have {0} examples to eval on.'.format(args.steps * args.batch_size * (1 if not torch.distributed.is_initialized() else torch.distributed.get_world_size())))
+            print('Have {0} steps / (gpu * epoch).'.format(args.steps))
+            print('-----------------')
+        else:
+            print('-----------------')
+            print('Have {0} examples to eval on.'.format(N))
+            print('Have {0} steps / (gpu * epoch).'.format(step_per_epoch))
+            print('-----------------')
     else:
-        print('-----------------')
-        print('Have {0} examples to eval on.'.format(N))
-        print('Have {0} steps / (gpu * epoch).'.format(step_per_epoch))
-        print('-----------------')
+            audio_preprocessor.featurizer.normalize = "per_feature"
 
+    print ("audio_preprocessor.normalize: ", audio_preprocessor.featurizer.normalize)
     audio_preprocessor.cuda()
     encoderdecoder.cuda()
     if args.fp16:
@@ -197,8 +224,9 @@ def main(args):
 
     encoderdecoder = model_multi_gpu(encoderdecoder, multi_gpu)
 
+    
     eval(
-        data_layer=data_layer, 
+        data_layer=data_layer,
         audio_processor=audio_preprocessor,
         encoderdecoder=encoderdecoder,
         greedy_decoder=greedy_decoder,
@@ -208,7 +236,7 @@ def main(args):
 
 if __name__=="__main__":
     args = parse_args()
-    
+
     print_dict(vars(args))
 
     main(args)

+ 17 - 12
PyTorch/SpeechRecognition/Jasper/inference_benchmark.py

@@ -98,7 +98,11 @@ def eval(
                 t_processed_signal, p_length_e = audio_processor(x=inp)
                 torch.cuda.synchronize()
                 t1 = time.perf_counter()
-                t_log_probs_e, _ = encoderdecoder((t_processed_signal, p_length_e))
+                
+                if args.use_conv_mask:
+                    t_log_probs_e, t_encoded_len_e  = encoderdecoder((t_processed_signal, p_length_e))
+                else:
+                    t_log_probs_e  = encoderdecoder(t_processed_signal)
                 torch.cuda.synchronize()
                 stop_time = time.perf_counter()
 
@@ -115,13 +119,13 @@ def eval(
                 durations_dnn.append(time_dnn)
                 durations_dnn_and_prep.append(time_prep_and_dnn)
                 seq_lens.append(t_processed_signal.shape[-1])
-                            
+
             if it >= steps:
-                
+
                 wer, _ = process_evaluation_epoch(_global_var_dict)
                 print("==========>>>>>>Evaluation of all iterations WER: {0}\n".format(wer))
                 break
-        
+
         ratios = [0.9,  0.95,0.99, 1.]
         latencies_dnn = take_durations_and_output_percentile(durations_dnn, ratios)
         latencies_dnn_and_prep = take_durations_and_output_percentile(durations_dnn_and_prep, ratios)
@@ -131,7 +135,7 @@ def eval(
 
 def take_durations_and_output_percentile(durations, ratios):
     durations = np.asarray(durations) * 1000 # in ms
-    latency = durations 
+    latency = durations
 
     latency = latency[5:]
     mean_latency = np.mean(latency)
@@ -167,11 +171,12 @@ def main(args):
     dataset_vocab = jasper_model_definition['labels']['labels']
     ctc_vocab = add_ctc_labels(dataset_vocab)
 
-    val_manifest = args.val_manifest 
+    val_manifest = args.val_manifest
     featurizer_config = jasper_model_definition['input_eval']
     featurizer_config["optimization_level"] = optim_level
+    args.use_conv_mask = jasper_model_definition['encoder'].get('convmask', True)
     if args.max_duration is not None:
-        featurizer_config['max_duration'] = args.max_duration  
+        featurizer_config['max_duration'] = args.max_duration
     if args.pad_to is not None:
         featurizer_config['pad_to'] = args.pad_to if args.pad_to >= 0 else "max"
 
@@ -181,7 +186,7 @@ def main(args):
     print_dict(featurizer_config)
 
     data_layer = AudioToTextDataLayer(
-                            dataset_dir=args.dataset_dir, 
+                            dataset_dir=args.dataset_dir,
                             featurizer_config=featurizer_config,
                             manifest_filepath=val_manifest,
                             labels=dataset_vocab,
@@ -226,16 +231,16 @@ def main(args):
             opt_level=AmpOptimizations[optim_level])
 
     eval(
-        data_layer=data_layer, 
+        data_layer=data_layer,
         audio_processor=audio_preprocessor,
-        encoderdecoder=encoderdecoder, 
-        greedy_decoder=greedy_decoder, 
+        encoderdecoder=encoderdecoder,
+        greedy_decoder=greedy_decoder,
         labels=ctc_vocab,
         args=args)
 
 if __name__=="__main__":
     args = parse_args()
-    
+
     print_dict(vars(args))
 
     main(args)

+ 0 - 1
PyTorch/SpeechRecognition/Jasper/metrics.py

@@ -65,4 +65,3 @@ def word_error_rate(hypotheses: List[str], references: List[str]) -> float:
     else:
         wer = float('inf')
     return wer, scores, words
-

+ 105 - 62
PyTorch/SpeechRecognition/Jasper/model.py

@@ -13,7 +13,7 @@
 # limitations under the License.
 
 from apex import amp
-import torch 
+import torch
 import torch.nn as nn
 from parts.features import FeatureFactory
 from helpers import Optimization
@@ -50,7 +50,6 @@ def init_weights(m, mode='xavier_uniform'):
 def get_same_padding(kernel_size, stride, dilation):
     if stride > 1 and dilation > 1:
         raise ValueError("Only stride OR dilation may be greater than 1")
-
     return (kernel_size // 2) * dilation
 
 class AudioPreprocessing(nn.Module):
@@ -74,7 +73,7 @@ class AudioPreprocessing(nn.Module):
         return processed_signal, processed_length
 
 class SpectrogramAugmentation(nn.Module):
-    """Spectrogram augmentation 
+    """Spectrogram augmentation
     """
     def __init__(self, **kwargs):
         nn.Module.__init__(self)
@@ -90,11 +89,8 @@ class SpectrogramAugmentation(nn.Module):
 class SpecAugment(nn.Module):
     """Spec augment. refer to https://arxiv.org/abs/1904.08779
     """
-    def __init__(self, cfg, rng=None):
+    def __init__(self, cfg):
         super(SpecAugment, self).__init__()
-
-        self._rng = random.Random() if rng is None else rng
-
         self.cutout_x_regions = cfg.get('cutout_x_regions', 0)
         self.cutout_y_regions = cfg.get('cutout_y_regions', 0)
 
@@ -108,12 +104,12 @@ class SpecAugment(nn.Module):
         mask = torch.zeros(x.shape).byte()
         for idx in range(sh[0]):
             for _ in range(self.cutout_x_regions):
-                cutout_x_left = int(self._rng.uniform(0, sh[1] - self.cutout_x_width))
+                cutout_x_left = int(random.uniform(0, sh[1] - self.cutout_x_width))
 
                 mask[idx, cutout_x_left:cutout_x_left + self.cutout_x_width, :] = 1
 
             for _ in range(self.cutout_y_regions):
-                cutout_y_left = int(self._rng.uniform(0, sh[2] - self.cutout_y_width))
+                cutout_y_left = int(random.uniform(0, sh[2] - self.cutout_y_width))
 
                 mask[idx, :, cutout_y_left:cutout_y_left + self.cutout_y_width] = 1
 
@@ -124,11 +120,9 @@ class SpecAugment(nn.Module):
 class SpecCutoutRegions(nn.Module):
     """Cutout. refer to https://arxiv.org/pdf/1708.04552.pdf
     """
-    def __init__(self, cfg, rng=None):
+    def __init__(self, cfg):
         super(SpecCutoutRegions, self).__init__()
 
-        self._rng = random.Random() if rng is None else rng
-
         self.cutout_rect_regions = cfg.get('cutout_rect_regions', 0)
         self.cutout_rect_time = cfg.get('cutout_rect_time', 5)
         self.cutout_rect_freq = cfg.get('cutout_rect_freq', 20)
@@ -141,9 +135,9 @@ class SpecCutoutRegions(nn.Module):
 
         for idx in range(sh[0]):
             for i in range(self.cutout_rect_regions):
-                cutout_rect_x = int(self._rng.uniform(
+                cutout_rect_x = int(random.uniform(
                         0, sh[1] - self.cutout_rect_freq))
-                cutout_rect_y = int(self._rng.uniform(
+                cutout_rect_y = int(random.uniform(
                         0, sh[2] - self.cutout_rect_time))
 
                 mask[idx, cutout_rect_x:cutout_rect_x + self.cutout_rect_freq,
@@ -154,18 +148,19 @@ class SpecCutoutRegions(nn.Module):
         return x
 
 class JasperEncoder(nn.Module):
-    """Jasper encoder 
+
+    """Jasper encoder
     """
     def __init__(self, **kwargs):
         cfg = {}
         for key, value in kwargs.items():
             cfg[key] = value
 
-        nn.Module.__init__(self)    
+        nn.Module.__init__(self)
         self._cfg = cfg
 
         activation = jasper_activations[cfg['encoder']['activation']]()
-        use_conv_mask = cfg['encoder'].get('convmask', False)
+        self.use_conv_mask = cfg['encoder'].get('convmask', False)
         feat_in = cfg['input']['features'] * cfg['input'].get('frame_splicing', 1)
         init_mode = cfg.get('init_mode', 'xavier_uniform')
 
@@ -183,7 +178,7 @@ class JasperEncoder(nn.Module):
                                         kernel_size=lcfg['kernel'], stride=lcfg['stride'],
                                         dilation=lcfg['dilation'], dropout=lcfg['dropout'],
                                         residual=lcfg['residual'], activation=activation,
-                                        residual_panes=dense_res, conv_mask=use_conv_mask))
+                                        residual_panes=dense_res, use_conv_mask=self.use_conv_mask))
             feat_in = lcfg['filters']
 
         self.encoder = nn.Sequential(*encoder_layers)
@@ -193,106 +188,146 @@ class JasperEncoder(nn.Module):
         return sum(p.numel() for p in self.parameters() if p.requires_grad)
 
     def forward(self, x):
-        audio_signal, length = x
-        s_input, length = self.encoder(([audio_signal], length))
-        return s_input, length
+        if self.use_conv_mask:
+            audio_signal, length = x
+            return self.encoder(([audio_signal], length))
+        else:
+            return self.encoder([x])
 
 class JasperDecoderForCTC(nn.Module):
-    """Jasper decoder 
+    """Jasper decoder
     """
     def __init__(self, **kwargs):
-        nn.Module.__init__(self)    
+        nn.Module.__init__(self)
         self._feat_in = kwargs.get("feat_in")
         self._num_classes = kwargs.get("num_classes")
         init_mode = kwargs.get('init_mode', 'xavier_uniform')
 
         self.decoder_layers = nn.Sequential(
-            nn.Conv1d(self._feat_in, self._num_classes, kernel_size=1, bias=True),
-            nn.LogSoftmax(dim=1))
+            nn.Conv1d(self._feat_in, self._num_classes, kernel_size=1, bias=True),)
         self.apply(lambda x: init_weights(x, mode=init_mode))
 
-    
     def num_weights(self):
         return sum(p.numel() for p in self.parameters() if p.requires_grad)
 
     def forward(self, encoder_output):
-        out = self.decoder_layers(encoder_output[-1])
-        return out.transpose(1, 2)
+        out = self.decoder_layers(encoder_output[-1]).transpose(1, 2)
+        return nn.functional.log_softmax(out, dim=2)
 
 class Jasper(nn.Module):
-    """Contains data preprocessing, spectrogram augmentation, jasper encoder and decoder 
+    """Contains data preprocessing, spectrogram augmentation, jasper encoder and decoder
     """
     def __init__(self, **kwargs):
-        nn.Module.__init__(self)    
-        self.audio_preprocessor = AudioPreprocessing(**kwargs.get("feature_config"))
+        nn.Module.__init__(self)
+        if kwargs.get("no_featurizer", False):
+            self.audio_preprocessor = None
+        else:
+            self.audio_preprocessor = AudioPreprocessing(**kwargs.get("feature_config"))
+
         self.data_spectr_augmentation = SpectrogramAugmentation(**kwargs.get("feature_config"))
         self.jasper_encoder = JasperEncoder(**kwargs.get("jasper_model_definition"))
         self.jasper_decoder = JasperDecoderForCTC(feat_in=kwargs.get("feat_in"),
-                                                 num_classes=kwargs.get("num_classes"))
+                                                  num_classes=kwargs.get("num_classes"))
+        self.acoustic_model = JasperAcousticModel(self.jasper_encoder, self.jasper_decoder)
 
     def num_weights(self):
         return sum(p.numel() for p in self.parameters() if p.requires_grad)
 
     def forward(self, x):
-        input_signal, length = x
-        t_processed_signal, p_length_t = self.audio_preprocessor(x)
+
+        # Apply optional preprocessing
+        if self.audio_preprocessor is not None:
+            t_processed_signal, p_length_t = self.audio_preprocessor(x)
+        # Apply optional spectral augmentation
         if self.training:
             t_processed_signal = self.data_spectr_augmentation(input_spec=t_processed_signal)
-        t_encoded_t, t_encoded_len_t = self.jasper_encoder((t_processed_signal, p_length_t))
-        return self.jasper_decoder(encoder_output=t_encoded_t), t_encoded_len_t
+            
+        if (self.jasper_encoder.use_conv_mask):
+            a_inp = (t_processed_signal, p_length_t)
+        else:
+            a_inp = t_processed_signal
+        # Forward Pass through Encoder-Decoder
+        return self.acoustic_model.forward(a_inp)
+
+
+class JasperAcousticModel(nn.Module):
+    def __init__(self, enc, dec, transpose_in=False):
+        nn.Module.__init__(self)
+        self.jasper_encoder = enc
+        self.jasper_decoder = dec
+        self.transpose_in = transpose_in
+    def forward(self, x):
+        if self.jasper_encoder.use_conv_mask:
+            t_encoded_t, t_encoded_len_t = self.jasper_encoder(x)
+        else:
+            if self.transpose_in:
+                x = x.transpose(1, 2)                
+            t_encoded_t = self.jasper_encoder(x)
+
+        out = self.jasper_decoder(encoder_output=t_encoded_t)
+        if self.jasper_encoder.use_conv_mask:
+            return out, t_encoded_len_t
+        else:
+            return out
 
 class JasperEncoderDecoder(nn.Module):
-    """Contains jasper encoder and decoder 
+    """Contains jasper encoder and decoder
     """
     def __init__(self, **kwargs):
-        nn.Module.__init__(self)    
+        nn.Module.__init__(self)
         self.jasper_encoder = JasperEncoder(**kwargs.get("jasper_model_definition"))
         self.jasper_decoder = JasperDecoderForCTC(feat_in=kwargs.get("feat_in"),
-                                                                            num_classes=kwargs.get("num_classes"))
+                                                  num_classes=kwargs.get("num_classes"))
+        self.acoustic_model = JasperAcousticModel(self.jasper_encoder,
+                                                  self.jasper_decoder,
+                                                  kwargs.get("transpose_in", False))
+        
     def num_weights(self):
         return sum(p.numel() for p in self.parameters() if p.requires_grad)
 
     def forward(self, x):
-            t_processed_signal, p_length_t = x
-            t_encoded_t, t_encoded_len_t = self.jasper_encoder((t_processed_signal, p_length_t))
-            return self.jasper_decoder(encoder_output=t_encoded_t), t_encoded_len_t
+        return self.acoustic_model.forward(x)
 
 class MaskedConv1d(nn.Conv1d):
-    """1D convolution with sequence masking 
+    """1D convolution with sequence masking
     """
     def __init__(self, in_channels, out_channels, kernel_size, stride=1,
-                             padding=0, dilation=1, groups=1, bias=False, use_mask=True):
+                             padding=0, dilation=1, groups=1, bias=False, use_conv_mask=True):
         super(MaskedConv1d, self).__init__(in_channels, out_channels, kernel_size,
                                                                              stride=stride,
                                                                              padding=padding, dilation=dilation,
                                                                              groups=groups, bias=bias)
-        self.use_mask = use_mask
+        self.use_conv_mask = use_conv_mask
 
     def get_seq_len(self, lens):
         return ((lens + 2 * self.padding[0] - self.dilation[0] * (
             self.kernel_size[0] - 1) - 1) / self.stride[0] + 1)
 
     def forward(self, inp):
-        x, lens = inp
-        if self.use_mask:
+        if self.use_conv_mask:
+            x, lens = inp
             max_len = x.size(2)
-            mask = torch.arange(max_len).to(lens.dtype).to(lens.device).expand(len(lens),
-                                                                                max_len) >= lens.unsqueeze(
-                1)
+            idxs = torch.arange(max_len).to(lens.dtype).to(lens.device).expand(len(lens), max_len)
+            mask = idxs >= lens.unsqueeze(1)
             x = x.masked_fill(mask.unsqueeze(1).to(device=x.device), 0)
             del mask
-            
+            del idxs
             lens = self.get_seq_len(lens)
-        
+        else:
+            x = inp
         out = super(MaskedConv1d, self).forward(x)
-        return out, lens
+
+        if self.use_conv_mask:
+            return out, lens
+        else:
+            return out
 
 class JasperBlock(nn.Module):
     """Jasper Block. See https://arxiv.org/pdf/1904.03288.pdf
     """
     def __init__(self, inplanes, planes, repeat=3, kernel_size=11, stride=1,
                              dilation=1, padding='same', dropout=0.2, activation=None,
-                             residual=True, residual_panes=[], conv_mask=False):
+                             residual=True, residual_panes=[], use_conv_mask=False):
         super(JasperBlock, self).__init__()
 
         if padding != "same":
@@ -300,7 +335,7 @@ class JasperBlock(nn.Module):
 
 
         padding_val = get_same_padding(kernel_size[0], stride[0], dilation[0])
-        self.conv_mask = conv_mask
+        self.use_conv_mask = use_conv_mask
         self.conv = nn.ModuleList()
         inplanes_loop = inplanes
         for _ in range(repeat - 1):
@@ -334,7 +369,7 @@ class JasperBlock(nn.Module):
         layers = [
             MaskedConv1d(in_channels, out_channels, kernel_size, stride=stride,
                                      dilation=dilation, padding=padding, bias=bias,
-                                     use_mask=self.conv_mask),
+                                     use_conv_mask=self.use_conv_mask),
             nn.BatchNorm1d(out_channels, eps=1e-3, momentum=0.1)
         ]
         return layers
@@ -352,13 +387,16 @@ class JasperBlock(nn.Module):
         return sum(p.numel() for p in self.parameters() if p.requires_grad)
 
     def forward(self, input_):
-
-        xs, lens_orig = input_
+        if self.use_conv_mask:
+            xs, lens_orig = input_
+        else:
+            xs = input_
+            lens_orig = 0
         # compute forward convolutions
         out = xs[-1]
         lens = lens_orig
         for i, l in enumerate(self.conv):
-            if isinstance(l, MaskedConv1d):
+            if self.use_conv_mask and isinstance(l, MaskedConv1d):
                 out, lens = l((out, lens))
             else:
                 out = l(out)
@@ -367,7 +405,7 @@ class JasperBlock(nn.Module):
             for i, layer in enumerate(self.res):
                 res_out = xs[i]
                 for j, res_layer in enumerate(layer):
-                    if j == 0:
+                    if j == 0 and self.use_conv_mask:
                         res_out, _ = res_layer((res_out, lens_orig))
                     else:
                         res_out = res_layer(res_out)
@@ -376,9 +414,14 @@ class JasperBlock(nn.Module):
         # compute the output
         out = self.out(out)
         if self.res is not None and self.dense_residual:
-            return xs + [out], lens
+            out = xs + [out]
+        else:
+            out = [out]
 
-        return [out], lens
+        if self.use_conv_mask:
+            return out, lens
+        else:
+            return out
 
 class GreedyCTCDecoder(nn.Module):
     """ Greedy CTC Decoder

+ 451 - 0
PyTorch/SpeechRecognition/Jasper/notebooks/JasperTRT.ipynb

@@ -0,0 +1,451 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Copyright 2019 NVIDIA Corporation. All Rights Reserved.\n",
+    "#\n",
+    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+    "# you may not use this file except in compliance with the License.\n",
+    "# You may obtain a copy of the License at\n",
+    "#\n",
+    "#     http://www.apache.org/licenses/LICENSE-2.0\n",
+    "#\n",
+    "# Unless required by applicable law or agreed to in writing, software\n",
+    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+    "# See the License for the specific language governing permissions and\n",
+    "# limitations under the License.\n",
+    "# =============================================================================="
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<img src=\"img/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
+    "\n",
+    "# Jasper Inference For TensorRT 6\n",
+    "This Jupyter notebook provides scripts to perform high-performance inference using NVIDIA TensorRT. \n",
+    "Jasper is a neural acoustic model for speech recognition. Its network architecture is designed to facilitate fast GPU inference. \n",
+    "NVIDIA TensorRT is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications.\n",
+    "After optimizing the compute-intensive acoustic model with NVIDIA TensorRT, inference throughput increased by up to 1.8x over native PyTorch."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Overview\n",
+    "\n",
+    "The Jasper model is an end-to-end neural acoustic model for automatic speech recognition (ASR) that provides near state-of-the-art results on LibriSpeech among end-to-end ASR models without any external data. The Jasper architecture of convolutional layers was designed to facilitate fast GPU inference, by allowing whole sub-blocks to be fused into a single GPU kernel. This is important for meeting strict real-time requirements of ASR systems in deployment.The results of the acoustic model are combined with the results of external language models to get the top-ranked word sequences corresponding to a given audio segment. This post-processing step is called decoding.\n",
+    "\n",
+    "The original paper is Jasper: An End-to-End Convolutional Neural Acoustic Model https://arxiv.org/pdf/1904.03288.pdf.\n",
+    "\n",
+    "### 1.1 Model architecture\n",
+    "By default the model configuration is Jasper 10x5 with dense residuals. A Jasper BxR model has B blocks, each consisting of R repeating sub-blocks.\n",
+    "Each sub-block applies the following operations in sequence: 1D-Convolution, Batch Normalization, ReLU activation, and Dropout. \n",
+    "In the original paper Jasper is trained with masked convolutions, which masks out the padded part of an input sequence in a batch before the 1D-Convolution.\n",
+    "For inference masking is not used. The reason for this is that in inference, the original mask operation does not achieve better accuracy than without the mask operation on the test and development dataset. However, no masking achieves better inference performance especially after TensorRT optimization.\n",
+    "More information on the model architecture can be found in the [root folder](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechRecognition/Jasper)\n",
+    "\n",
+    "### 1.2 TensorRT Inference pipeline\n",
+    "The Jasper inference pipeline consists of 3 components: data preprocessor, acoustic model and greedy decoder. The acoustic model is the most compute intensive, taking more than 90% of the entire end-to-end pipeline. The acoustic model is the only component with learnable parameters and also what differentiates Jasper from the competition. So, we focus on the acoustic model for the most part.\n",
+    "For the non-TRT Jasper inference pipeline, all 3 components are implemented and run with native PyTorch. For the TensorRT inference pipeline, we show the speedup of running the acoustic model with TensorRT, while preprocessing and decoding are reused from the native PyTorch pipeline.\n",
+    "To run a model with TensorRT, we first construct the model in PyTorch, which is then exported into a ONNX static graph. Finally, a TensorRT engine is constructed from the ONNX file and can be launched to do inference."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 1.3 Learning objectives\n",
+    "\n",
+    "This notebook demonstrates:\n",
+    "- Speed up Jasper Inference with TensorRT\n",
+    "- The use/download of fine-tuned NVIDIA Jasper models\n",
+    "- Use of Mixed Precision for Inference"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Requirements\n",
+    "\n",
+    "Please refer to README.md"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Jasper Inference\n",
+    "### 3.1  Start a detached session in the NGC container"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "DATA_DIR=\"$PWD/data\" # replace with user path to dataset root folder that contains various datasets. E.g. this path should contain LibriSpeech as subfolder\n",
+    "CHECKPOINT_DIR=\"$PWD/checkpoints\" # replace with user path to checkpoint folder. Following code assumes this folder to contain 'jasper_fp16.pt'\n",
+    "RESULT_DIR=\"$PWD/results\" # replace with user path to result folder, where log files and prediction files will be saved after inference."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!nvidia-docker run -it -d --rm --name \"JasperTRT\" \\\n",
+    "  --runtime=nvidia \\\n",
+    "  --shm-size=4g \\\n",
+    "  --ulimit memlock=-1 \\\n",
+    "  --ulimit stack=67108864 \\\n",
+    "  -v $DATA_DIR:/datasets \\\n",
+    "  -v $CHECKPOINT_DIR:/checkpoints/ \\\n",
+    "  -v $RESULT_DIR:/results/ \\\n",
+    "  jasper:trt6 bash"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can also specify the GPU(s) to run the container by adding NV_GPU before the nvidia-docker run command, for example, to specify GPU 1 to run the container, add \"NV_GPU=1\" before the nvidia-docker run command."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!NV_GPU=1 nvidia-docker run -it -d --rm --name \"JasperTRT\" \\\n",
+    "  --runtime=nvidia \\\n",
+    "  --shm-size=4g \\\n",
+    "  --ulimit memlock=-1 \\\n",
+    "  --ulimit stack=67108864 \\\n",
+    "  -v $DATA_DIR:/datasets \\\n",
+    "  -v $CHECKPOINT_DIR:/checkpoints/ \\\n",
+    "  -v $RESULT_DIR:/results/ \\\n",
+    "  jasper:trt6 bash"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 3.2 Download and preprocess the dataset.\n",
+    "If LibriSpeech http://www.openslr.org/12 has already been downloaded and preprocessed, no further steps in this subsection need to be taken.\n",
+    "If LibriSpeech has not been downloaded already, note that only a subset of LibriSpeech is typically used for inference (dev-* and test-*). LibriSpeech contains 1000 hours of 16kHz read English speech derived from public domain audiobooks from LibriVox project and has been carefully segmented and aligned. For more information, see paper [LIBRISPEECH: AN ASR CORPUS BASED ON PUBLIC DOMAIN AUDIO BOOKS paper](http://www.danielpovey.com/files/2015_icassp_librispeech.pdf).\n",
+    "To acquire the inference subset of LibriSpeech run (does not require GPU):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!nvidia-docker exec -it JasperTRT bash trt/scripts/download_inference_librispeech.sh"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once the data download is complete, the following folders should exist:\n",
+    "* /datasets/LibriSpeech/\n",
+    "    * dev-clean/\n",
+    "    * dev-other/\n",
+    "    * test-clean/\n",
+    "    * test-other/\n",
+    "\n",
+    "Since /datasets/ is mounted to <DATA_DIR> on the host,  once the dataset is downloaded it is accessible from outside of the container at <DATA_DIR>/LibriSpeech.\n",
+    "\n",
+    "Next, preprocessing the data can be performed with the following command:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!nvidia-docker exec -it JasperTRT bash trt/scripts/preprocess_inference_librispeech.sh"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once the data is preprocessed, the following additional files should now exist:\n",
+    "\n",
+    "* /datasets/LibriSpeech/\n",
+    "    * librispeech-dev-clean-wav.json\n",
+    "    * librispeech-dev-other-wav.json\n",
+    "    * librispeech-test-clean-wav.json\n",
+    "    * librispeech-test-other-wav.json\n",
+    "    * dev-clean/\n",
+    "    * dev-other/\n",
+    "    * test-clean/\n",
+    "    * test-other/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 3.3 Download pretrained model checkpoint\n",
+    "A pretrained model checkpoint can be downloaded from NGC model repository https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16\n",
+    "        \n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 3.4. Start TensorRT inference prediction\n",
+    "\n",
+    "Inside the container, use the following script to run inference with TensorRT.\n",
+    "You will need to set the parameters such as: \n",
+    "\n",
+    "\n",
+    "* `CHECKPOINT`: Model checkpoint path\n",
+    "* `TRT_PRECISION`: \"fp32\" or \"fp16\". Defines which precision kernels will be used for TensorRT engine (default: \"fp32\")\n",
+    "* `PYTORCH_PRECISION`: \"fp32\" or \"fp16\". Defines which precision will be used for inference in PyTorch (default: \"fp32\")\n",
+    "* `TRT_PREDICTION_PATH`: file to store inference prediction results generated with TensorRT\n",
+    "* `PYT_PREDICTION_PATH`: file to store inference prediction results generated with native PyTorch\n",
+    "* `DATASET`: LibriSpeech dataset (default: dev-clean)\n",
+    "* `NUM_STEPS`: Number of inference steps (default: -1)\n",
+    "* `BATCH_SIZE`: Mini batch size (default: 1)\n",
+    "* `NUM_FRAMES`: cuts/pads all pre-processed feature tensors to this length. 100 frames ~ 1 second of audio (default: 3600)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!nvidia-docker exec -it -e CHECKPOINT=/checkpoints/jasper_fp16.pt -e TRT_PREDICTION_PATH=/results/result.txt JasperTRT bash trt/scripts/trt_inference.sh"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    " ### 3.5. Start TensorRT Inference Benchmark\n",
+    "\n",
+    "Run the following commmand to run inference benchmark with TensorRT inside the container.\n",
+    "\n",
+    "You will need to set the parameters such as:\n",
+    "\n",
+    "* `CHECKPOINT`: Model checkpoint path    \n",
+    "* `NUM_STEPS`: number of inference steps. If -1 runs inference on entire dataset. (default: -1)\n",
+    "* `NUM_FRAMES`: cuts/pads all pre-processed feature tensors to this length. 100 frames ~ 1 second of audio (default: 512)\n",
+    "* `BATCH_SIZE`: data batch size (default: 64)\n",
+    "* `TRT_PRECISION`: \"fp32\" or \"fp16\". Defines which precision kernels will be used for TensorRT engine (default: \"fp32\")\n",
+    "* `PYTORCH_PRECISION`: \"fp32\" or \"fp16\". Defines which precision will be used for inference in PyTorch (default: \"fp32\")\n",
+    "* `CSV_PATH`: file to store CSV results (default: \"/results/res.csv\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!nvidia-docker exec -it -e CHECKPOINT=/checkpoints/jasper_fp16.pt -e TRT_PREDICTION_PATH=/results/benchmark.txt JasperTRT bash trt/scripts/trt_inference_benchmark.sh"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 4. Automatic Mixed Precision\n",
+    "\n",
+    "Mixed precision is the combined use of different numerical precisions in a computational method. Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of Tensor Cores in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. \n",
+    "\n",
+    "Using mixed precision training requires two steps:\n",
+    "\n",
+    "* Porting the model to use the FP16 data type where appropriate.\n",
+    "* Adding loss scaling to preserve small gradient values.\n",
+    "\n",
+    "The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in CUDA 8 in the NVIDIA Deep Learning SDK.\n",
+    "For information about:\n",
+    "\n",
+    "How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html) documentation.\n",
+    "\n",
+    "Techniques used for mixed precision training, see the blog [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/).\n",
+    "\n",
+    "APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/).\n",
+    "\n",
+    "To enable mixed precision, we can specify the variables `TRT_PRECISION` and `PYTORCH_PRECISION` by setting them to `TRT_PRECISION=fp16` and `PYTORCH_PRECISION=fp16` when running the inference. To run the TensorRT inference benchmarking using automatic mixed precision:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!nvidia-docker exec -it -e CHECKPOINT=/checkpoints/jasper_fp16.pt -e TRT_PREDICTION_PATH=/results/benchmark.txt -e TRT_PRECISION=fp16 -e PYTORCH_PRECISION=fp16 -e CSV_PATH=/results/res_fp16.csv JasperTRT bash trt/scripts/trt_inference_benchmark.sh"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "From the performance metrics that you get from res.csv (fp32) and res_fp16.csv (automatic mixed precision) files, you can see that automatic mixed precision can speedup the inference efficiently compared to fp32."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 5. Play with audio examples\n",
+    "\n",
+    "You can perform inference using pre-trained checkpoints which takes audio file (in .wav format) as input, and produces the corresponding text file. You can customize the content of the text file. For example, there is a keynote.wav file as input and we can listen to it:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import IPython.display as ipd\n",
+    "ipd.Audio('keynote.wav', rate=22050)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can run inference using the trt/perf.py script, the checkpoint is passed as `--ckpt` argument, `--model`_toml specifies the path to network configuration file (see examples in \"config\" directory), `--make_onnx`  <path> does export to ONNX file at <path> if set, `--engine_path` saves the engine (*.plan) file.\n",
+    "\n",
+    "To create a new engine file (jasper.plan) for TensorRT and run it using fp32 (building the engine for the first time can take several minutes):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!nvidia-docker exec -it JasperTRT python trt/perf.py --ckpt_path /checkpoints/jasper_fp16.pt --wav=keynote.wav --model_toml=configs/jasper10x5dr_nomask.toml --make_onnx --onnx_path jasper.onnx --engine_path jasper.plan"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you already have the engine file (jasper.plan), to run an existing engine file of TensorRT using fp32: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!nvidia-docker exec -it JasperTRT python trt/perf.py --wav=keynote.wav --model_toml=configs/jasper10x5dr_nomask.toml --use_existing_engine --engine_path jasper.plan --trt_fp16"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To run inference of the input audio file using automatic mixed precision, add the argument `--trt_fp16`. Using automatic mixed precision, the inference time can be reduced efficiently compared to that of using fp32 (building the engine for the first time can take several minutes):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!nvidia-docker exec -it JasperTRT python trt/perf.py --ckpt_path /checkpoints/jasper_fp16.pt --wav=keynote.wav --model_toml=configs/jasper10x5dr_nomask.toml --make_onnx --onnx_path jasper.onnx --engine_path jasper_fp16.plan --trt_fp16"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you already have the engine file (jasper_fp16.plan), to run an existing engine file of TensorRT using automatic mixed precision: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!nvidia-docker exec -it JasperTRT python trt/perf.py --wav=keynote.wav --model_toml=configs/jasper10x5dr_nomask.toml --use_existing_engine --engine_path jasper_fp16.plan --trt_fp16"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can input your own audio file and generate the output text file using this way.\n",
+    "\n",
+    "For more information about TensorRT and building an engine file in Python, please see: https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html#python_topics"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#stop your container in the end\n",
+    "!docker stop JasperTRT"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7. What's next"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now you are familiar with running Jasper inference with TensorRT, using automatic mixed precision, you may want to play with your own dataset, or train the model using your own dataset. For information on training, please see our Github repo: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechRecognition/Jasper"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

+ 57 - 0
PyTorch/SpeechRecognition/Jasper/notebooks/README.md

@@ -0,0 +1,57 @@
+## Overview
+
+This notebook provides scripts for you to run Jasper with TRT for inference step by step. You can run inference using either LibriSpeech dataset or your own audio input in .wav format, to generate the corresponding text file for the audio file. 
+
+## Requirements
+
+This repository contains a Dockerfile which extends the PyTorch 19.09-py3 NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
+    
+* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
+* [PyTorch 19.09-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
+* [NVIDIA Turing](https://www.nvidia.com/en-us/geforce/turing/) or [Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) based GPU
+* [Pretrained Jasper Model Checkpoint](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16)
+
+## Quick Start Guide
+
+Running the following scripts will build and launch the container containing all required dependencies for both TensorRT as well as native PyTorch. This is necessary for using inference with TensorRT and can also be used for data download, processing and training of the model.
+
+1. Clone the repository.
+
+```bash
+git clone https://github.com/NVIDIA/DeepLearningExamples
+cd DeepLearningExamples/PyTorch/SpeechRecognition/Jasper
+```
+2. Build the Jasper PyTorch with TRT 6 container:
+
+```bash
+bash trt/scripts/docker/trt_build.sh
+```
+3. Prepare to start a detached session in the NGC container
+Create three directories on your local machine for dataset, checkpoint, and result, respectively, naming "data" "checkpoint" "result":
+
+```bash
+mkdir data checkpoint result
+```
+Download the checkpoint file `jasperpyt_fp16` to the directory `checkpoint` from NGC Model Repository: https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16
+Assume you will download the dataset to /dev/sdb and mount the data on /dev/sdb to "data", please replace "/dev/sdb" with your own directories if you use other directories:
+
+```bash
+sudo mount /dev/sdb data
+```
+The Jasper PyTorch container will be launched in the Jupyter notebook. Within the container, the contents of the root repository will be copied to the /workspace/jasper directory. The /datasets, /checkpoints, /results directories are mounted as volumes and mapped to the corresponding directories "data" "checkpoint" "result" on the host.
+For running the notebook on your local machine, run:
+
+```bash
+jupyter notebook notebooks/JasperTRT.ipynb
+```
+For running the notebook on another machine remotely, run: 
+
+```bash
+jupyter notebook --ip=0.0.0.0 --allow-root
+```
+And navigate a web browser to the IP address or hostname of the host machine at port 8888: http://[host machine]:8888
+
+Use the token listed in the output from running the jupyter command to log in, for example: http://[host machine]:8888/?token=aae96ae9387cd28151868fee318c3b3581a2d794f3b25c6b
+
+
+

BIN
PyTorch/SpeechRecognition/Jasper/notebooks/keynote.wav


+ 16 - 3
PyTorch/SpeechRecognition/Jasper/parts/features.py

@@ -21,6 +21,15 @@ from .segment import AudioSegment
 from apex import amp
 
 
+def audio_from_file(file_path, offset=0, duration=0, trim=False, target_sr=16000):
+    audio = AudioSegment.from_file(file_path,
+                                   target_sr=target_sr,
+                                   int_values=False,
+                                   offset=offset, duration=duration, trim=trim)
+    samples=torch.tensor(audio.samples, dtype=torch.float).cuda()
+    num_samples = torch.tensor(samples.shape[0]).int().cuda()
+    return (samples.unsqueeze(0), num_samples.unsqueeze(0))
+
 class WaveformFeaturizer(object):
     def __init__(self, input_cfg, augmentor=None):
         self.augmentor = augmentor if augmentor is not None else AudioAugmentor()
@@ -51,6 +60,7 @@ class WaveformFeaturizer(object):
 
 constant = 1e-5
 def normalize_batch(x, seq_len, normalize_type):
+#    print ("normalize_batch: x, seq_len, shapes: ", x.shape, seq_len, seq_len.shape)
     if normalize_type == "per_feature":
         x_mean = torch.zeros((seq_len.shape[0], x.shape[1]), dtype=x.dtype,
                                                  device=x.device)
@@ -191,10 +201,10 @@ class FilterbankFeatures(nn.Module):
                        preemph=0.97,
                        nfilt=64, lowfreq=0, highfreq=None, log=True, dither=constant,
                        pad_to=8,
-                       max_duration=16.7, 
+                       max_duration=16.7,
                        frame_splicing=1):
         super(FilterbankFeatures, self).__init__()
-        print("PADDING: {}".format(pad_to))
+#        print("PADDING: {}".format(pad_to))
 
         torch_windows = {
             'hann': torch.hann_window,
@@ -242,10 +252,13 @@ class FilterbankFeatures(nn.Module):
     @torch.no_grad()
     def forward(self, inp):
         x, seq_len = inp
+
         dtype = x.dtype
 
         seq_len = self.get_seq_len(seq_len)
 
+#        print ("forward: x, seq_len, shapes: ", x.shape, seq_len, seq_len.shape)
+        
         # dither
         if self.dither > 0:
             x += self.dither * torch.randn_like(x)
@@ -282,7 +295,7 @@ class FilterbankFeatures(nn.Module):
         max_len = x.size(-1)
         mask = torch.arange(max_len).to(seq_len.dtype).to(x.device).expand(x.size(0),
                                                                            max_len) >= seq_len.unsqueeze(1)
-        
+
         x = x.masked_fill(mask.unsqueeze(1).to(device=x.device), 0)
         del mask
         pad_to = self.pad_to

+ 3 - 3
PyTorch/SpeechRecognition/Jasper/parts/manifest.py

@@ -89,7 +89,7 @@ class Manifest(object):
                         else:
                             min_speed = min(x['speed'] for x in files_and_speeds)
                         max_duration = self.max_duration * min_speed
-                    
+
                     data['duration'] = data['original_duration']
                     if min_duration is not None and data['duration'] < min_duration:
                         filtered_duration += data['duration']
@@ -112,7 +112,7 @@ class Manifest(object):
                         filtered_duration += data['duration']
                         continue
                     data["transcript"] = self.parse_transcript(transcript_text) # convert to vocab indices
-                    
+
                     if speed_perturbation:
                         audio_paths = [x['fname'] for x in files_and_speeds]
                         data['audio_duration'] = [x['duration'] for x in files_and_speeds]
@@ -122,7 +122,7 @@ class Manifest(object):
                     data['audio_filepath'] = [os.path.join(data_dir, x) for x in audio_paths]
                     data.pop('files')
                     data.pop('original_duration')
-         
+
                     ids.append(data)
                     duration += data['duration']
 

+ 1 - 1
PyTorch/SpeechRecognition/Jasper/requirements.txt

@@ -6,4 +6,4 @@ librosa
 toml
 soundfile
 ipdb
-sox
+sox

+ 1 - 0
PyTorch/SpeechRecognition/Jasper/scripts/docker/launch.sh

@@ -27,4 +27,5 @@ docker run -it --rm \
   -v "$DATA_DIR":/datasets \
   -v "$CHECKPOINT_DIR":/checkpoints/ \
   -v "$RESULT_DIR":/results/ \
+  -v $PWD:/code \
   jasper bash

+ 1 - 1
PyTorch/SpeechRecognition/Jasper/scripts/download_librispeech.sh

@@ -25,4 +25,4 @@ then
     python utils/download_librispeech.py utils/librispeech.csv $DATA_DIR -e ${DATA_ROOT_DIR}/
 else
     echo "Directory $DATA_DIR already exists."
-fi
+fi

+ 1 - 1
PyTorch/SpeechRecognition/Jasper/scripts/evaluation.sh

@@ -89,4 +89,4 @@ else
      $CMD
    ) |& tee "$LOGFILE"
 fi
-set +x
+set +x

+ 0 - 5
PyTorch/SpeechRecognition/Jasper/scripts/inference_benchmark.sh

@@ -82,8 +82,3 @@ else
    grep 'latency' "$LOGFILE"
 fi
 set +x
-
-
-
-
-

+ 0 - 1
PyTorch/SpeechRecognition/Jasper/scripts/train.sh

@@ -108,4 +108,3 @@ else
    ) |& tee $LOGFILE
 fi
 set +x
-

+ 0 - 1
PyTorch/SpeechRecognition/Jasper/scripts/train_benchmark.sh

@@ -128,4 +128,3 @@ else
    echo "final_eval_loss: $final_eval_loss" | tee -a "$LOGFILE"
    echo "final_eval_wer: $final_eval_wer" | tee -a "$LOGFILE"
 fi
-

+ 29 - 29
PyTorch/SpeechRecognition/Jasper/train.py

@@ -28,10 +28,10 @@ from helpers import monitor_asr_train_progress, process_evaluation_batch, proces
 from model import AudioPreprocessing, CTCLossNM, GreedyCTCDecoder, Jasper
 from optimizers import Novograd, AdamW
 
-    
+
 def lr_policy(initial_lr, step, N):
     """
-    learning rate decay 
+    learning rate decay
     Args:
         initial_lr: base learning rate
         step: current iteration number
@@ -45,7 +45,7 @@ def save(model, optimizer, epoch, output_dir):
     """
     Saves model checkpoint
     Args:
-        model: model 
+        model: model
         optimizer: optimizer
         epoch: epoch of model training
         output_dir: path to save model checkpoint
@@ -57,8 +57,8 @@ def save(model, optimizer, epoch, output_dir):
     if (not torch.distributed.is_initialized() or (torch.distributed.is_initialized() and torch.distributed.get_rank() == 0)):
         model_to_save = model.module if hasattr(model, 'module') else model  # Only save the model it-self
         save_checkpoint={
-                        'epoch': epoch, 
-                        'state_dict': model_to_save.state_dict(), 
+                        'epoch': epoch,
+                        'state_dict': model_to_save.state_dict(),
                         'optimizer': optimizer.state_dict()
                         }
 
@@ -69,15 +69,15 @@ def save(model, optimizer, epoch, output_dir):
 
 
 def train(
-        data_layer, 
+        data_layer,
         data_layer_eval,
         model,
-        ctc_loss, 
-        greedy_decoder, 
-        optimizer, 
-        optim_level, 
-        labels, 
-        multi_gpu, 
+        ctc_loss,
+        greedy_decoder,
+        optimizer,
+        optim_level,
+        labels,
+        multi_gpu,
         args,
         fn_lr_policy=None):
     """Trains model
@@ -128,10 +128,10 @@ def train(
 
             # final aggregation across all workers and minibatches) and logging of results
             wer, eloss = process_evaluation_epoch(_global_var_dict)
-        
+
             print_once("==========>>>>>>Evaluation Loss: {0}\n".format(eloss))
             print_once("==========>>>>>>Evaluation WER: {0}\n".format(wer))
-            
+
     print_once("Starting .....")
     start_time = time.time()
 
@@ -157,7 +157,7 @@ def train(
             if batch_counter == 0:
 
                 if fn_lr_policy is not None:
-                    adjusted_lr = fn_lr_policy(step) 
+                    adjusted_lr = fn_lr_policy(step)
                     for param_group in optimizer.param_groups:
                             param_group['lr'] = adjusted_lr
                 optimizer.zero_grad()
@@ -165,8 +165,8 @@ def train(
 
             t_audio_signal_t, t_a_sig_length_t, t_transcript_t, t_transcript_len_t = tensors
             model.train()
+            
             t_log_probs_t, t_encoded_len_t = model(x=(t_audio_signal_t, t_a_sig_length_t))
-
             t_loss_t = ctc_loss(log_probs=t_log_probs_t, targets=t_transcript_t, input_length=t_encoded_len_t, target_length=t_transcript_len_t)
             if args.gradient_accumulation_steps > 1:
                     t_loss_t = t_loss_t / args.gradient_accumulation_steps
@@ -238,8 +238,8 @@ def main(args):
     dataset_vocab = jasper_model_definition['labels']['labels']
     ctc_vocab = add_ctc_labels(dataset_vocab)
 
-    train_manifest = args.train_manifest 
-    val_manifest = args.val_manifest 
+    train_manifest = args.train_manifest
+    val_manifest = args.val_manifest
     featurizer_config = jasper_model_definition['input']
     featurizer_config_eval = jasper_model_definition['input_eval']
     featurizer_config["optimization_level"] = optim_level
@@ -255,7 +255,7 @@ def main(args):
         featurizer_config_eval['pad_to'] = "max"
     print_once('model_config')
     print_dict(jasper_model_definition)
-         
+
     if args.gradient_accumulation_steps < 1:
         raise ValueError('Invalid gradient accumulation steps parameter {}'.format(args.gradient_accumulation_steps))
     if args.batch_size % args.gradient_accumulation_steps != 0:
@@ -282,9 +282,9 @@ def main(args):
                                     multi_gpu=multi_gpu,
                                     pad_to_max=args.pad_to_max
                                     )
- 
+
     model = Jasper(feature_config=featurizer_config, jasper_model_definition=jasper_model_definition, feat_in=1024, num_classes=len(ctc_vocab))
- 
+
     if args.ckpt is not None:
         print_once("loading model from {}".format(args.ckpt))
         checkpoint = torch.load(args.ckpt, map_location="cpu")
@@ -304,13 +304,13 @@ def main(args):
         args.step_per_epoch = math.ceil(N / (args.batch_size * (1 if not torch.distributed.is_initialized() else torch.distributed.get_world_size())))
     elif sampler_type == 'bucket':
         args.step_per_epoch = int(len(data_layer.sampler) / args.batch_size )
-    
+
     print_once('-----------------')
     print_once('Have {0} examples to train on.'.format(N))
     print_once('Have {0} steps / (gpu * epoch).'.format(args.step_per_epoch))
     print_once('-----------------')
 
-    fn_lr_policy = lambda s: lr_policy(args.lr, s, args.num_epochs * args.step_per_epoch) 
+    fn_lr_policy = lambda s: lr_policy(args.lr, s, args.num_epochs * args.step_per_epoch)
 
 
     model.cuda()
@@ -333,7 +333,7 @@ def main(args):
             models=model,
             optimizers=optimizer,
             opt_level=AmpOptimizations[optim_level])
-    
+
     if args.ckpt is not None:
         optimizer.load_state_dict(checkpoint['optimizer'])
 
@@ -341,12 +341,12 @@ def main(args):
 
     train(
         data_layer=data_layer,
-        data_layer_eval=data_layer_eval, 
-        model=model, 
-        ctc_loss=ctc_loss, 
+        data_layer_eval=data_layer_eval,
+        model=model,
+        ctc_loss=ctc_loss,
         greedy_decoder=greedy_decoder,
-        optimizer=optimizer, 
-        labels=ctc_vocab, 
+        optimizer=optimizer,
+        labels=ctc_vocab,
         optim_level=optim_level,
         multi_gpu=multi_gpu,
         fn_lr_policy=fn_lr_policy if args.lr_decay else None,

+ 31 - 0
PyTorch/SpeechRecognition/Jasper/trt/Dockerfile

@@ -0,0 +1,31 @@
+
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.09-py3
+FROM ${FROM_IMAGE_NAME}
+
+RUN apt-get update && apt-get install -y python3
+RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.243-1_amd64.deb \
+&& dpkg -i cuda-repo-*.deb \
+&& wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb \
+&& dpkg -i nvidia-machine-learning-repo-*.deb \
+&& apt-get update \
+&& apt-get install -y --no-install-recommends python-libnvinfer python3-libnvinfer
+
+
+RUN cp -r /usr/lib/python3.6/dist-packages/tensorrt /opt/conda/lib/python3.6/site-packages/tensorrt
+# Add TensorRT executable to path (trtexec)
+ENV PATH=$PATH:/usr/src/tensorrt/bin
+
+
+# Here's a good place to install pip reqs from JoC repo.
+# At the same step, also install TRT pip reqs
+WORKDIR /tmp/pipReqs
+COPY requirements.txt /tmp/pipReqs/jocRequirements.txt
+COPY trt/requirements.txt /tmp/pipReqs/trtRequirements.txt
+RUN pip install --disable-pip-version-check -U -r jocRequirements.txt -r trtRequirements.txt
+
+# These packages are required for running preprocessing on the dataset to acquire manifest files and the like
+RUN apt-get install -y libsndfile1 && apt-get install -y ffmpeg sox && rm -rf /var/lib/apt/lists/*
+
+WORKDIR /workspace/jasper
+COPY . .
+

+ 294 - 0
PyTorch/SpeechRecognition/Jasper/trt/README.md

@@ -0,0 +1,294 @@
+
+# Jasper Inference For TensorRT
+
+This is subfolder of the Jasper for PyTorch repository, tested and maintained by NVIDIA, and provides scripts to perform high-performance inference using NVIDIA TensorRT. Jasper is a neural acoustic model for speech recognition. Its network architecture is designed to facilitate fast GPU inference. More information about Jasper and its training and be found in the [root directory](../README.md). 
+NVIDIA TensorRT is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications.
+After optimizing the compute-intensive acoustic model with NVIDIA TensorRT, inference throughput increased by up to 1.8x over native PyTorch. 
+
+
+
+## Table Of Contents
+
+- [Model overview](#model-overview)
+   * [Model architecture](#model-architecture)
+   * [TRT Inference pipeline](#trt-inference-pipeline)
+   * [Version Info](#version-info)
+- [Setup](#setup)
+   * [Requirements](#requirements)
+- [Quick Start Guide](#quick-start-guide)
+- [Advanced](#advanced)
+   * [Scripts and sample code](#scripts-and-sample-code)
+   * [Parameters](#parameters)
+   * [TRT Inference Process](#trt-inference-process)
+   * [TRT Inference Benchmark Process](#trt-inference-benchmark-process)
+- [Performance](#performance)
+   * [Results](#results)
+      * [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4)
+
+
+## Model overview
+
+### Model architecture
+By default the model configuration is Jasper 10x5 with dense residuals. A Jasper BxR model has B blocks, each consisting of R repeating sub-blocks.
+Each sub-block applies the following operations in sequence: 1D-Convolution, Batch Normalization, ReLU activation, and Dropout.
+In the original paper Jasper is trained with masked convolutions, which masks out the padded part of an input sequence in a batch before the 1D-Convolution. 
+For inference masking is not used. The reason for this is that in inference, the original mask operation does not achieve better accuracy than without the mask operation on the test and development dataset. However, no masking achieves better inference performance especially after TensorRT optimization.
+
+
+### TRT Inference pipeline
+
+The Jasper inference pipeline consists of 3 components: data preprocessor, acoustic model and greedy decoder. The acoustic model is the most compute intensive, taking more than 90% of the entire end-to-end pipeline. The acoustic model is the only component with learnable parameters and also what differentiates Jasper from the competition. So, we focus on the acoustic model for the most part.
+
+For the non-TRT Jasper inference pipeline, all 3 components are implemented and run with native PyTorch. For the TensorRT inference pipeline, we show the speedup of running the acoustic model with TensorRT, while preprocessing and decoding are reused from the native PyTorch pipeline.
+
+To run a model with TensorRT, we first construct the model in PyTorch, which is then exported into a ONNX static graph. Finally, a TensorRT engine is constructed from the ONNX file and can be launched to do inference.
+
+
+### Version Info
+
+The following software version configuration has been tested and known to work:
+
+|Software|Version|
+|--------|-------|
+|Python|3.6.9|
+|PyTorch|1.2.0|
+|TensorRT|6.0.1.5|
+|CUDA|10.1.243|
+
+## Setup
+
+The following section lists the requirements in order to start inference on the Jasper model with TRT.
+
+### Requirements
+
+This repository contains a `Dockerfile` which extends the PyTorch 19.09-py3 NGC container and encapsulates some dependencies. Ensure you have the following components:
+
+* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
+* [PyTorch 19.09-py3 NGC container](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)
+* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
+* [Pretrained Jasper Model Checkpoint](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16)
+
+Required Python packages are listed in `requirements.txt` and `trt/requirements.txt`. These packages are automatically installed when the Docker container is built. To manually install them, run:
+
+
+```bash
+pip install -r requirements.txt
+pip install -r trt/requirements.txt
+```
+
+
+## Quick Start Guide
+
+
+Running the following scripts will build and launch the container containing all required dependencies for both TensorRT as well as native PyTorch. This is necessary for using inference with TensorRT and can also be used for data download, processing and training of the model.
+
+1. Build the Jasper PyTorch and TensorRT container:
+
+```
+bash trt/scripts/docker/trt_build.sh
+```
+2. Start an interactive session in the NGC docker container:
+
+```
+bash trt/scripts/docker/trt_launch.sh <DATA_DIR> <CHECKPOINT_DIR> <RESULT_DIR>
+```
+
+Alternatively, to start a script in the docker container:
+
+```
+bash trt/scripts/docker/trt_launch.sh <DATA_DIR> <CHECKPOINT_DIR> <RESULT_DIR> <SCRIPT_PATH>
+```
+
+The `/datasets`, `/checkpoints`, `/results` directories will be mounted as volumes and mapped to the corresponding directories `<DATA_DIR>`, `<CHECKPOINT_DIR>`, `<RESULT_DIR>` on the host. **These three paths should be absolute and should already exist.** The contents of this repository will be mounted to the `/workspace/jasper` directory. Note that `<DATA_DIR>`, `<CHECKPOINT_DIR>`, and `<RESULT_DIR>` directly correspond to the same arguments in `scripts/docker/launch.sh` mentioned in the [Quick Start Guide](../README.md).
+
+Briefly, `<DATA_DIR>` should contain, or be prepared to contain a `LibriSpeech` sub-directory (created in [Acquiring Dataset](#acquiring-dataset)), `<CHECKPOINT_DIR>` should contain a PyTorch model checkpoint (`*.pt`) file obtained through training described in [Quick Start Guide](../README.md), and `<RESULT_DIR>` should be prepared to contain timing results, logs, serialized TRT engines, and ONNX files.
+
+
+
+3.  Acquiring dataset
+
+If LibriSpeech has already been downloaded and preprocessed as defined in the [Quick Start Guide](../README.md), no further steps in this subsection need to be taken.
+
+If LibriSpeech has not been downloaded already, note that only a subset of LibriSpeech is typically used for inference (`dev-*` and `test-*`). To acquire the inference subset of LibriSpeech run the following commands inside the container (does not require GPU):
+
+```
+bash trt/scripts/download_inference_librispeech.sh
+```
+
+Once the data download is complete, the following folders should exist:
+
+* `/datasets/LibriSpeech/`
+   * `dev-clean/`
+   * `dev-other/`
+   * `test-clean/`
+   * `test-other/`
+
+Next, preprocessing the data can be performed with the following command:
+
+```
+bash trt/scripts/preprocess_inference_librispeech.sh
+```
+
+Once the data is preprocessed, the following additional files should now exist:
+* `/datasets/LibriSpeech/`
+   * `librispeech-dev-clean-wav.json`
+   * `librispeech-dev-other-wav.json`
+   * `librispeech-test-clean-wav.json`
+   * `librispeech-test-other-wav.json`
+   * `dev-clean-wav/`
+   * `dev-other-wav/`
+   * `test-clean-wav/`
+   * `test-other-wav/`
+
+4. Start TRT inference prediction
+
+Inside the container, use the following script to run inference with TRT.
+```
+export CHECKPOINT=<CHECKPOINT>
+export TRT_PRECISION=<PRECISION>
+export PYTORCH_PRECISION=<PRECISION>
+export TRT_PREDICTION_PATH=<TRT_PREDICTION_PATH>
+bash trt/scripts/trt_inference.sh
+```
+A pretrained model checkpoint can be downloaded from [NGC model repository](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16). 
+More details can be found in [Advanced](#advanced) under [Scripts and sample code](#scripts-and-sample-code), [Parameters](#parameters) and [TRT Inference process](#trt-inference).
+
+4.  Start TRT inference benchmark
+
+Inside the container, use the following script to run inference benchmark with TRT.
+```
+export CHECKPOINT=<CHECKPOINT>
+export NUM_STEPS=<NUM_STEPS>
+export NUM_FRAMES=<NUM_FRAMES>
+export BATCH_SIZE=<BATCH_SIZE>
+export TRT_PRECISION=<PRECISION>
+export PYTORCH_PRECISION=<PRECISION>
+export CSV_PATH=<CSV_PATH>
+bash trt/scripts/trt_inference_benchmark.sh
+```
+A pretrained model checkpoint can be downloaded from the [NGC model repository](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16). 
+More details can be found in [Advanced](#advanced) under [Scripts and sample code](#scripts-and-sample-code), [Parameters](#parameters) and [TRT Inference Benchmark process](#trt-inference-benchmark).
+
+6. Start Jupyter notebook to run inference interactively
+The Jupyter notebook  is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.
+The notebook which is located at `notebooks/JasperTRT.ipynb` offers an interactive method to run the Steps 2,3,4,5. In addition, the notebook shows examples how to use TRT to transcribe a single audio file into text. To launch the application please follow the instructions under [../notebooks/README.md](../notebooks/README.md). 
+A pretrained model checkpoint can be downloaded from [NGC model repository](https://ngc.nvidia.com/catalog/models/nvidia:jasperpyt_fp16). 
+
+
+## Advanced
+The following sections provide greater details on inference benchmarking with TRT and show inference results
+
+### Scripts and sample code
+In the `trt/` directory, the most important files are:
+* `Dockerfile`: Container to run Jasper inference with TRT.
+* `requirements.py`: Python package dependencies. Installed when building the Docker container.
+* `perf.py`: Entry point for inference pipeline using TRT.
+* `perfprocedures.py`: Contains functionality to run inference through both the PyTorch model and TRT Engine, taking runtime measurements of each component of the inference process for comparison.
+* `trtutils.py`: Helper functions for TRT components of Jasper inference.
+* `perfutils.py`: Helper functions for non-TRT components of Jasper inference.
+
+The `trt/scripts/` directory has one-click scripts to run supported functionalities, such as:
+
+* `download_librispeech.sh`: Downloads LibriSpeech inference dataset.
+* `preprocess_librispeech.sh`: Preprocess LibriSpeech raw data files to be ready for inference.
+* `trt_inference_benchmark.sh`: Benchmarks and compares TRT and PyTorch inference pipelines using the `perf.py` script.
+* `trt_inference.sh`: Runs TRT and PyTorch inference using the `trt_inference_benchmark.sh` script.
+* `walk_benchmark.sh`: Illustrates an example of using `trt/scripts/trt_inference_benchmark.sh`, which *walks* a variety of values for `BATCH_SIZE` and `NUM_FRAMES`.
+* `docker/`: Contains the scripts for building and launching the container.
+
+
+### Parameters
+
+The list of parameters available for `trt/scripts/trt_inference_benchmark.sh` is:
+
+```
+Required:
+--------
+CHECKPOINT: Model checkpoint path
+
+Arguments with Defaults:
+--------
+DATA_DIR: directory of the dataset (Default: `/datasets/Librispeech`)
+DATASET: name of dataset to use (default: `dev-clean`)
+RESULT_DIR: directory for results including TRT engines, ONNX files, logs, and CSVs (default: `/results`)
+CREATE_LOGFILE: boolean that indicates whether to create log of session to be stored in `$RESULT_DIR` (default: "true")
+CSV_PATH: file to store CSV results (default: `/results/res.csv`)
+TRT_PREDICTION_PATH: file to store inference prediction results generated with TRT (default: `none`)
+PYT_PREDICTION_PATH: file to store inference prediction results generated with native PyTorch (default: `none`)
+VERBOSE: boolean that indicates whether to verbosely describe TRT engine building/deserialization and TRT inference (default: "false")
+TRT_PRECISION: "fp32" or "fp16". Defines which precision kernels will be used for TRT engine (default: "fp32")
+PYTORCH_PRECISION: "fp32" or "fp16". Defines which precision will be used for inference in PyTorch (default: "fp32")
+NUM_STEPS: Number of inference steps. If -1 runs inference on entire dataset (default: 100)
+BATCH_SIZE: data batch size (default: 64)
+NUM_FRAMES: cuts/pads all pre-processed feature tensors to this length. 100 frames ~ 1 second of audio (default: 512)
+FORCE_ENGINE_REBUILD: boolean that indicates whether an already-built TRT engine of equivalent precision, batch-size, and number of frames should not be used.
+    Engines are specific to the GPU, library versions, TRT versions, and CUDA versions they were built in and cannot be used in a different environment. (default: "true")
+```
+
+The complete list of parameters available for `trt/scripts/trt_inference.sh` is the same as `trt/scripts/trt_inference_benchmark.sh` only with different default input arguments. In the following, only the parameters with different default values are listed:
+
+```
+TRT_PREDICTION_PATH: file to store inference prediction results generated with TRT (default: `/results/trt_predictions.txt`)
+PYT_PREDICTION_PATH: file to store inference prediction results generated with native PyTorch (default: `/results/pyt_predictions.txtone`)
+NUM_STEPS: Number of inference steps. If -1 runs inference on entire dataset (default: -1)
+BATCH_SIZE: data batch size (default: 1)
+NUM_FRAMES: cuts/pads all pre-processed feature tensors to this length. 100 frames ~ 1 second of audio (default: 3600)
+```
+
+### TRT Inference Benchmark process
+
+The inference benchmarking is performed on a single GPU by ‘trt/scripts/trt_inference_benchmark.sh’ which delegates to `trt/perf.py`,  which takes the following steps:
+
+
+1. Construct Jasper acoustic model in PyTorch.
+
+2. Construct TRT Engine of Jasper acoustic model
+
+   1. Perform ONNX export on the PyTorch model, if its ONNX file does not already exist.
+
+	2. Construct TRT engine from ONNX export, if a saved engine file does not already exist or `FORCE_ENGINE_REBUILD` is `true`.
+
+3. For each batch in the dataset, run inference through both the PyTorch model and TRT Engine, taking runtime measurements of each component of the inference process.
+
+4. Compile performance and WER accuracy results in CSV format, written to `CSV_PATH` file.
+
+`trt/perf.py` utilizes `trt/trtutils.py` and `trt/perfutils.py`, helper functions for TRT and non-TRT components of Jasper inference respectively.
+
+### TRT Inference process
+
+The inference is performed by ‘trt/scripts/trt_inference.sh’ which delegates to ‘trt/scripts/trt_inference_benchmark.sh’. The script runs on a single GPU. To do inference prediction on the entire dataset ‘NUM_FRAMES’ is set to 3600, which roughly corresponds to 36 seconds. This covers the longest sentences in both LibriSpeech dev and test dataset. By default, ‘BATCH_SET’ is set to 1 to simulate the online inference scenario in deployment. Other batch sizes can be tried by setting a different value to this parameter. By default ‘TRT_PRECISION’ is set to full precision and can be changed by setting ‘export TRT_PRECISION=fp16’. The prediction results are stored at ‘/results/trt_predictions.txt’ and ‘/results/pyt_predictions.txt’.
+
+
+
+## Performance
+
+To benchmark the inference performance on a specific batch size and audio length refer to [Quick-Start-Guide](#quick-start-guide). To do a sweep over multiple batch sizes and audio durations run:
+```
+bash trt/scripts/walk_benchmark.sh
+```
+The results are obtained by running inference on LibriSpeech dev-clean dataset on a single T4 GPU using half precision with AMP. We compare the throughput of the acoustic model between TensorRT and native PyTorch.   
+
+### Results
+
+
+
+#### Inference performance: NVIDIA T4
+
+| Sequence Length (in seconds) | Batch size | PyTorch FP16 Throughput (#sequences/second) Percentiles |     	|     	|     	| TRT FP16 Throughput (#sequences/second) Percentiles |     	|     	|     	| PyT/TRT Speedup |
+|---------------|------------|---------------------|---------|---------|---------|-----------------|---------|---------|---------|-----------------|
+|           	|        	| 90%             	| 95% 	| 99% 	| Avg 	| 90%         	| 95% 	| 99% 	| Avg 	|             	|
+|2|1|71.002|70.897|70.535|71.987|42.974|42.932|42.861|43.166|1.668|
+||2|136.369|135.915|135.232|139.266|81.398|77.826|57.408|81.254|1.714|
+||4|231.528|228.875|220.085|239.686|130.055|117.779|104.529|135.660|1.767|
+||8|310.224|308.870|289.132|316.536|215.401|202.902|148.240|228.805|1.383|
+||16|389.086|366.839|358.419|401.267|288.353|278.708|230.790|307.070|1.307|
+|7|1|61.792|61.273|59.842|63.537|34.098|33.963|33.785|34.639|1.834|
+||2|93.869|92.480|91.528|97.082|59.397|59.221|51.050|60.934|1.593|
+||4|113.108|112.950|112.531|114.507|66.947|66.479|59.926|67.704|1.691|
+||8|118.878|118.542|117.619|120.367|83.208|82.998|82.698|84.187|1.430|
+||16|122.909|122.718|121.547|124.190|102.212|102.000|101.187|103.049|1.205|
+|16.7|1|38.665|38.404|37.946|39.363|21.267|21.197|21.127|21.456|1.835|
+||2|44.960|44.867|44.382|45.583|30.218|30.156|29.970|30.679|1.486|
+||4|47.754|47.667|47.541|48.287|29.146|29.079|28.941|29.470|1.639|
+||8|51.051|50.969|50.620|51.489|37.565|37.497|37.373|37.834|1.361|
+||16|53.316|53.288|53.188|53.773|45.217|45.090|44.946|45.560|1.180|

+ 140 - 0
PyTorch/SpeechRecognition/Jasper/trt/perf.py

@@ -0,0 +1,140 @@
+#!/usr/bin/env python3
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#           http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+'''Constructs TensorRT engine for JASPER and evaluates inference latency'''
+import argparse
+import sys, os
+# Get local modules in parent directory and current directory (assuming this was called from root of repository)
+sys.path.append("./")
+sys.path.append("./trt")
+import perfutils
+import trtutils
+import perfprocedures
+from model import GreedyCTCDecoder
+from helpers import __ctc_decoder_predictions_tensor
+
+def main(args):
+
+    # Get shared utility across PyTorch and TRT
+    pyt_components, saved_onnx = perfutils.get_pytorch_components_and_onnx(args)
+
+    # Get a TRT engine. See function for argument parsing logic
+    engine = get_engine(args)
+
+    if args.wav:
+        audio_processor = pyt_components['audio_preprocessor']
+        audio_processor.eval()
+        greedy_decoder = GreedyCTCDecoder()
+        input_wav, seq_len = pyt_components['input_wav']
+        features = audio_processor((input_wav, seq_len))
+        features = perfutils.adjust_shape(features, args.seq_len)
+        with engine.create_execution_context() as context:
+            t_log_probs_e, copyto, inference, copyfrom= perfprocedures.do_inference(context, features[0], 1)
+        log_probs=perfutils.torchify_trt_out(t_log_probs_e, 1)
+        
+        t_predictions_e = greedy_decoder(log_probs=log_probs)
+        hypotheses = __ctc_decoder_predictions_tensor(t_predictions_e, labels=perfutils.get_vocab())
+        print("INTERENCE TIME: {} ms".format(inference*1000.0))
+        print("TRANSCRIPT: ", hypotheses[0])
+
+        return
+
+    
+    wer, preds, times = perfprocedures.compare_times_trt_pyt_exhaustive(engine,
+                                                                        pyt_components,
+                                                                        num_steps=args.num_steps)
+    string_header, string_data = perfutils.do_csv_export(wer, times, args.batch_size, args.seq_len)
+    if args.csv_path is not None:
+        with open(args.csv_path, 'a+') as f:
+            # See if header is there, if so, check that it matches
+            f.seek(0) # Read from start of file
+            existing_header = f.readline()
+            if existing_header == "":
+                f.write(string_header)
+                f.write("\n")
+            elif existing_header[:-1] != string_header:
+                raise Exception(f"Writing to existing CSV with incorrect format\nProduced:\n{string_header}\nFound:\n{existing_header}\nIf you intended to write to a new results csv, please change the csv_path argument")
+            f.seek(0,2) # Write to end of file
+            f.write(string_data)
+            f.write("\n")
+    else:
+        print(string_header)
+        print(string_data)
+
+    if args.trt_prediction_path is not None:
+        with open(args.trt_prediction_path, 'w') as fp:
+            fp.write('\n'.join(preds['trt']))
+     
+    if args.pyt_prediction_path is not None:
+        with open(args.pyt_prediction_path, 'w') as fp:
+            fp.write('\n'.join(preds['pyt']))   
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description="Performance test of TRT")
+    parser.add_argument("--engine_path", default=None, type=str, help="Path to serialized TRT engine")
+    parser.add_argument("--use_existing_engine", action="store_true", default=False, help="If set, will deserialize engine at --engine_path" )
+    parser.add_argument("--engine_batch_size", default=16, type=int, help="Maximum batch size for constructed engine; needed when building")
+    parser.add_argument("--batch_size", default=16, type=int, help="Batch size for data when running inference.")
+    parser.add_argument("--dataset_dir", type=str, help="Root directory of dataset")
+    parser.add_argument("--model_toml", type=str, required=True, help="Config toml to use. A selection can be found in configs/")
+    parser.add_argument("--val_manifest", type=str, help="JSON manifest of dataset.")
+    parser.add_argument("--onnx_path", default=None, type=str, help="Path to onnx model for engine creation")
+    parser.add_argument("--seq_len", default=None, type=int, help="Generate an ONNX export with this fixed sequence length, and save to --onnx_path. Requires also using --onnx_path and --ckpt_path.")
+    parser.add_argument("--ckpt_path", default=None, type=str, help="If provided, will also construct pytorch acoustic model")
+    parser.add_argument("--max_duration", default=None, type=float, help="Maximum possible length of audio data in seconds")
+    parser.add_argument("--num_steps", default=-1, type=int, help="Number of inference steps to run")
+    parser.add_argument("--trt_fp16", action="store_true", default=False, help="If set, will allow TRT engine builder to select fp16 kernels as well as fp32")
+    parser.add_argument("--pyt_fp16", action="store_true", default=False, help="If set, will construct pytorch model with fp16 weights")
+    parser.add_argument("--make_onnx", action="store_true", default=False, help="If set, will create an ONNX model and store it at the path specified by --onnx_path")
+    parser.add_argument("--csv_path", type=str, default=None, help="File to append csv info about inference time")
+    parser.add_argument("--trt_prediction_path", type=str, default=None, help="File to write predictions inferred with trt")
+    parser.add_argument("--pyt_prediction_path", type=str, default=None, help="File to write predictions inferred with pytorch")
+    parser.add_argument("--verbose", action="store_true", default=False, help="If set, will verbosely describe TRT engine building and deserialization as well as TRT inference")
+    parser.add_argument("--wav", type=str, help='absolute path to .wav file (16KHz)')
+    parser.add_argument("--max_workspace_size", default=4*1024*1024*1024, type=int, help="Maximum batch size for constructed engine; needed when building")
+
+    return parser.parse_args()
+
+def get_engine(args):
+    '''Get a TRT engine
+
+    If --should_serialize is present, always build from ONNX and store result in --engine_path.
+    Else If an engine is provided as an argument (--engine_path) use that one.
+    Otherwise, make one from onnx (--onnx_load_path), but don't serialize it.
+    '''
+    engine = None
+
+    if args.engine_path is not None and args.use_existing_engine:
+        engine = trtutils.deserialize_engine(args.engine_path, args.verbose)
+    elif args.engine_path is not None and args.onnx_path is not None:
+        # Build a new engine and serialize it.
+        engine = trtutils.build_engine_from_parser(args.onnx_path, args.engine_batch_size, args.trt_fp16, args.verbose, args.max_workspace_size)
+        with open(args.engine_path, 'wb') as f:
+            f.write(engine.serialize())
+    else:
+        raise Exception("One of the following sets of arguments must be provided:\n"+
+                        "<engine_path> + --use_existing_engine\n"+
+                        "<engine_path> + <onnx_path>\n"+
+                        "in order to construct a TRT engine")
+    if engine is None:
+        raise Exception("Failed to acquire TRT engine")
+
+    return engine
+
+if __name__ == "__main__":
+    args = parse_args()
+
+    main(args)

+ 337 - 0
PyTorch/SpeechRecognition/Jasper/trt/perfprocedures.py

@@ -0,0 +1,337 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#           http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+'''A collection of accuracy and latency evaluation procedures for JASPER on PyTorch and TRT.
+'''
+
+
+import pycuda.driver as cuda
+import pycuda.autoinit
+import perfutils
+import trtutils
+import time
+import torch
+from tqdm import tqdm
+
+def time_pyt(engine, pyt_components):
+    '''Times execution of PyTorch inference
+    '''
+    baked_seq_len = engine.get_binding_shape(0)[1]
+    preprocess_times = []
+    pyt_infers = []
+    pyt_components['audio_preprocessor'].eval()
+    pyt_components['acoustic_model'].eval()
+    with torch.no_grad():
+        for data in tqdm(pyt_components['data_layer'].data_iterator):
+            tensors = []
+            for d in data:
+                tensors.append(d.to(torch.device("cuda")))
+            input_tensor = (tensors[0], tensors[1])
+            t0 = time.perf_counter()
+            am_input = pyt_components['audio_preprocessor'](x=input_tensor)
+            # Pad or cut to the neccessary engine length
+            am_input = perfutils.adjust_shape(am_input, baked_seq_len)
+            batch_size = am_input[0].shape[0]
+            torch.cuda.synchronize()
+            t1 = time.perf_counter()
+            # Run PyT inference
+            pyt_out = pyt_components['acoustic_model'](x=am_input)
+            torch.cuda.synchronize()
+            t2 = time.perf_counter()
+            perfutils.global_process_batch(log_probs=pyt_out,
+                                           original_tensors=tensors,
+                                           batch_size=batch_size,
+                                           is_trt=False)
+            assemble_times.append(t1-t0)
+            pyt_infers.append(t2-t1)
+
+    pyt_wer = perfutils.global_process_epoch(is_trt=False)
+    trt_wer = None
+    trt_preds = perfutils._global_trt_dict['predictions']
+    pyt_preds = perfutils._global_pyt_dict['predictions']
+    times = {
+        'preprocess': assemble_times,
+        'pyt_infers': pyt_infers
+    }
+    wer = {
+        'trt': trt_wer,
+        'pyt': pyt_wer
+    }
+    preds = {
+        'trt': trt_preds,
+        'pyt': pyt_preds
+    }
+    return wer, preds, times
+
+def time_trt(engine, pyt_components):
+    '''Times execution of TRT inference
+    '''
+    baked_seq_len = engine.get_binding_shape(0)[1]
+    assemble_times = []
+    trt_copytos = []
+    trt_copyfroms = []
+    trt_infers = []
+    decodingandeval = []
+    with engine.create_execution_context() as context, torch.no_grad():
+        for data in tqdm(pyt_components['data_layer'].data_iterator):
+            tensors = []
+            for d in data:
+                tensors.append(d.to(torch.device("cuda")))
+            input_tensor = (tensors[0], tensors[1])
+            t0 = time.perf_counter()
+            am_input = pyt_components['audio_preprocessor'](x=input_tensor)
+            # Pad or cut to the neccessary engine length
+            am_input = perfutils.adjust_shape(am_input, baked_seq_len)
+            batch_size = am_input[0].shape[0]
+            torch.cuda.synchronize()
+            t1 = time.perf_counter()
+            # Run TRT inference
+            trt_out, time_to, time_infer, time_from= do_inference(
+                                                                  context=context,
+                                                                  inp=am_input,
+                                                                  batch_size=batch_size)
+            t3 = time.perf_counter()
+            trt_out = perfutils.torchify_trt_out(trt_out, batch_size)
+            perfutils.global_process_batch(log_probs=trt_out,
+                                           original_tensors=tensors,
+                                           batch_size=batch_size,
+                                           is_trt=True)
+            torch.cuda.synchronize()
+            t4 = time.perf_counter()
+
+
+            assemble_times.append(t1-t0)
+            trt_copytos.append(time_to)
+            trt_copyfroms.append(time_from)
+            trt_infers.append(time_infer)
+            decodingandeval.append(t4-t3)
+
+
+    trt_wer = perfutils.global_process_epoch(is_trt=True)
+    pyt_wer = perfutils.global_process_epoch(is_trt=False)
+    trt_preds = perfutils._global_trt_dict['predictions']
+    pyt_preds = perfutils._global_pyt_dict['predictions']
+    times = {
+        'assemble': assemble_times,
+        'trt_copyto': trt_copytos,
+        'trt_copyfrom': trt_copyfroms,
+        'trt_infers': trt_infers,
+        'decodingandeval': decodingandeval
+    }
+    wer = {
+        'trt': trt_wer,
+        'pyt': pyt_wer
+    }
+    preds = {
+        'trt': trt_preds,
+        'pyt': pyt_preds
+    }
+    return wer, preds, times
+
+def run_trt(engine, pyt_components):
+    '''Runs TRT inference for accuracy evaluation
+    '''
+    baked_seq_len = engine.get_binding_shape(0)[1]
+    wers = []
+    preds = []
+    with engine.create_execution_context() as context, torch.no_grad():
+        for data in tqdm(pyt_components['data_layer'].data_iterator):
+            tensors = []
+            for d in data:
+                tensors.append(d.to(torch.device("cuda")))
+            input_tensor = (tensors[0], tensors[1])
+            am_input = pyt_components['audio_preprocessor'](x=input_tensor)
+            # Pad or cut to the neccessary engine length
+            am_input = perfutils.adjust_shape(am_input, baked_seq_len)
+            batch_size = am_input[0].shape[0]
+            torch.cuda.synchronize()
+            # Run TRT inference
+            trt_out, _,_,_= do_inference(context=context, inp=am_input, batch_size=batch_size)
+            trt_out = perfutils.torchify_trt_out(trt_out, batch_size=batch_size)
+            wer, pred = perfutils.get_results(log_probs=trt_out,
+                                              original_tensors=tensors,
+                                              batch_size=batch_size)
+            wers.append(wer)
+            preds.append(pred)
+
+
+    return wers, preds
+
+def compare_times_trt_pyt_exhaustive(engine, pyt_components, num_steps):
+    '''Compares execution times and WER between TRT and PyTorch'''
+
+    # The engine has a fixed-size sequence length, which needs to be known for slicing/padding input
+    baked_seq_len = engine.get_binding_shape(0)[1]
+    preprocess_times = []
+    inputadjust_times = []
+    outputadjust_times = []
+    process_batch_times = []
+    trt_solo_times = []
+    trt_async_times = []
+    tohost_sync_times =[]
+    pyt_infer_times = []
+    step_counter = 0
+
+    with engine.create_execution_context() as context, torch.no_grad():
+        for data in tqdm(pyt_components['data_layer'].data_iterator):
+            if num_steps >= 1:
+                if step_counter > num_steps:
+                    break
+                step_counter +=1
+            tensors = []
+            for d in data:
+                tensors.append(d.to(torch.device("cuda")))
+
+            input_tensor = (tensors[0], tensors[1])
+            preprocess_start = time.perf_counter()
+            am_input = pyt_components['audio_preprocessor'](x=input_tensor)
+            torch.cuda.synchronize()
+            preprocess_end = time.perf_counter()
+
+            # Pad or cut to the neccessary engine length
+            inputadjust_start = time.perf_counter()
+            am_input = perfutils.adjust_shape(am_input, baked_seq_len)
+            torch.cuda.synchronize()
+            inputadjust_end = time.perf_counter()
+
+            batch_size = am_input[0].shape[0]
+
+            # Run TRT inference 1: Async copying and inference
+            trt_out, time_taken= do_inference_overlap(
+                                                      context=context,
+                                                      inp=am_input,
+                                                      batch_size=batch_size)
+            torch.cuda.synchronize()
+            outputadjust_start = time.perf_counter()
+            trt_out = perfutils.torchify_trt_out(trt_out, batch_size)
+            torch.cuda.synchronize()
+            outputadjust_end = time.perf_counter()
+
+            process_batch_start = time.perf_counter()
+            perfutils.global_process_batch(log_probs=trt_out,
+                                           original_tensors=tensors,
+                                           batch_size=batch_size,
+                                           is_trt=True)
+            torch.cuda.synchronize()
+            process_batch_end = time.perf_counter()
+            # Create explicit stream so pytorch doesn't complete asynchronously
+            pyt_infer_start = time.perf_counter()
+            pyt_out = pyt_components['acoustic_model'](x=am_input[0])
+            torch.cuda.synchronize()
+            pyt_infer_end = time.perf_counter()
+            perfutils.global_process_batch(log_probs=pyt_out,
+                                           original_tensors=tensors,
+                                           batch_size=batch_size,
+                                           is_trt=False)
+            # Run TRT inference 2: Synchronous copying and inference
+            _, time_to, time_infer, time_from = do_inference(
+                                                             context=context,
+                                                             inp=am_input,
+                                                             batch_size=batch_size)
+            preprocess_times.append(preprocess_end - preprocess_start)
+            inputadjust_times.append(inputadjust_end - inputadjust_start)
+            outputadjust_times.append(outputadjust_end - outputadjust_start)
+            process_batch_times.append(process_batch_end - process_batch_start)
+            trt_solo_times.append(time_infer)
+            trt_async_times.append(time_taken)
+            tohost_sync_times.append(time_from)
+            pyt_infer_times.append(pyt_infer_end - pyt_infer_start)
+
+    trt_wer = perfutils.global_process_epoch(is_trt=True)
+    pyt_wer = perfutils.global_process_epoch(is_trt=False)
+    trt_preds = perfutils._global_trt_dict['predictions']
+    pyt_preds = perfutils._global_pyt_dict['predictions']
+    times = {
+        'preprocess': preprocess_times, # Time to go through preprocessing
+        'pyt_infer': pyt_infer_times, # Time for batch completion through pytorch
+        'input_adjust': inputadjust_times, # Time to pad/cut for TRT engine size requirements
+        'output_adjust' : outputadjust_times, # Time to reshape output of TRT and copy from host to device
+        'post_process': process_batch_times, # Time to run greedy decoding and do CTC conversion
+        'trt_solo_infer': trt_solo_times, # Time to execute just TRT acoustic model
+        'to_host': tohost_sync_times, # Time to execute device to host copy synchronously
+        'trt_async_infer': trt_async_times, # Time to execute combined async TRT acoustic model + device to host copy
+
+    }
+    wer = {
+        'trt': trt_wer,
+        'pyt': pyt_wer
+    }
+    preds = {
+        'trt': trt_preds,
+        'pyt': pyt_preds
+    }
+    return wer, preds, times
+
+def do_inference(context, inp, batch_size):
+    '''Do inference using a TRT engine and time it
+    Execution and device-to-host copy are completed synchronously
+    '''
+
+
+    # Typical Python-TRT used in samples would copy input data from host to device.
+    # Because the PyTorch Tensor is already on the device, such a copy is unneeded.
+
+    # Create input array of device pointers
+    inputs = [inp[0].data_ptr()]
+    t0 = time.perf_counter()
+    # Create output buffers and stream
+    outputs, bindings, stream = trtutils.allocate_buffers_with_existing_inputs(context.engine,
+                                                                               inputs,
+                                                                               batch_size)
+    t1 = time.perf_counter()
+    # Run inference and transfer outputs to host asynchronously
+    context.execute_async(batch_size=batch_size,
+                          bindings=bindings,
+                          stream_handle=stream.handle)
+    stream.synchronize()
+    t2 = time.perf_counter()
+    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
+    stream.synchronize()
+    t3 = time.perf_counter()
+
+
+    copyto = t1-t0
+    inference = t2-t1
+    copyfrom = t3-t2
+    out = outputs[0].host
+    return out, copyto, inference, copyfrom
+
+def do_inference_overlap(context, inp, batch_size):
+    '''Do inference using a TRT engine and time it
+    Execution and device-to-host copy are completed asynchronously
+    '''
+    # Typical Python-TRT used in samples would copy input data from host to device.
+    # Because the PyTorch Tensor is already on the device, such a copy is unneeded.
+
+    # Create input array of device pointers
+    inputs = [inp[0].data_ptr()]
+    t0 = time.perf_counter()
+    # Create output buffers and stream
+    outputs, bindings, stream = trtutils.allocate_buffers_with_existing_inputs(context.engine,
+                                                                               inputs,
+                                                                               batch_size)
+    t1 = time.perf_counter()
+    # Run inference and transfer outputs to host asynchronously
+    context.execute_async(batch_size=batch_size,
+                          bindings=bindings,
+                          stream_handle=stream.handle)
+    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
+    stream.synchronize()
+    t2 = time.perf_counter()
+
+
+    copyto = t1-t0
+    inference = t2-t1
+    out = outputs[0].host
+    return out, t2-t1

+ 252 - 0
PyTorch/SpeechRecognition/Jasper/trt/perfutils.py

@@ -0,0 +1,252 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#           http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+'''Contains helper functions for non-TRT components of JASPER inference
+'''
+
+from model import GreedyCTCDecoder, AudioPreprocessing, Jasper
+from dataset import AudioToTextDataLayer
+from helpers import Optimization, AmpOptimizations, process_evaluation_batch, process_evaluation_epoch, add_ctc_labels, norm
+from apex import amp
+import torch
+import torch.nn as nn
+import toml
+from parts.features import audio_from_file
+
+_global_ctc_labels = None
+def get_vocab():
+    ''' Gets the CTC vocab
+
+    Requires calling get_pytorch_components_and_onnx() to setup global labels.
+    '''
+    if _global_ctc_labels is None:
+        raise Exception("Feature labels have not been found. Execute `get_pytorch_components_and_onnx()` first")
+
+    return _global_ctc_labels
+
+def get_results(log_probs, original_tensors, batch_size):
+    ''' Returns WER and predictions for the outputs of the acoustic model
+
+    Used for one-off batches. Epoch-wide evaluation should use
+    global_process_batch and global_process_epoch
+    '''
+    # Used to get WER and predictions for one-off batches
+    greedy_decoder = GreedyCTCDecoder()
+    predicts = norm(greedy_decoder(log_probs=log_probs))
+    values_dict = dict(
+        predictions=[predicts],
+        transcript=[original_tensors[2][0:batch_size,...]],
+        transcript_length=[original_tensors[3][0:batch_size,...]],
+    )
+    temp_dict = {
+        'predictions': [],
+        'transcripts': [],
+    }
+    process_evaluation_batch(values_dict, temp_dict, labels=get_vocab())
+    predictions = temp_dict['predictions']
+    wer, _ = process_evaluation_epoch(temp_dict)
+    return wer, predictions
+
+
+_global_trt_dict = {
+        'predictions': [],
+        'transcripts': [],
+}
+_global_pyt_dict = {
+        'predictions': [],
+        'transcripts': [],
+}
+
+def global_process_batch(log_probs, original_tensors, batch_size, is_trt=True):
+    '''Accumulates prediction evaluations for batches across an epoch
+
+    is_trt determines which global dictionary will be used.
+    To get WER at any point, use global_process_epoch.
+    For one-off WER evaluations, use get_results()
+    '''
+    # State-based approach for full WER comparison across a dataset.
+    greedy_decoder = GreedyCTCDecoder()
+    predicts = norm(greedy_decoder(log_probs=log_probs))
+    values_dict = dict(
+        predictions=[predicts],
+        transcript=[original_tensors[2][0:batch_size,...]],
+        transcript_length=[original_tensors[3][0:batch_size,...]],
+    )
+    dict_to_process = _global_trt_dict if is_trt else _global_pyt_dict
+    process_evaluation_batch(values_dict, dict_to_process, labels=get_vocab())
+
+
+def global_process_epoch(is_trt=True):
+    '''Returns WER in accumulated global dictionary
+    '''
+    dict_to_process = _global_trt_dict if is_trt else _global_pyt_dict
+    wer, _ = process_evaluation_epoch(dict_to_process)
+    return wer
+
+
+def get_onnx(path, acoustic_model, signal_shape, dtype=torch.float):
+    ''' Get an ONNX model with float weights
+
+    Requires an --onnx_save_path and --ckpt_path (so that an acoustic model could be constructed).
+    Fixed-length --seq_len must be provided as well.
+    '''
+    with torch.no_grad():
+        phony_signal = torch.zeros(signal_shape, dtype=dtype, device=torch.device("cuda"))
+        torch.onnx.export(acoustic_model, (phony_signal,), path, input_names=["FEATURES"], output_names=["LOGITS"])
+        fn=path+".readable"
+        with open(fn, 'w') as f:
+            #Write human-readable graph representation to file as well.
+            import onnx
+            tempModel = onnx.load(path)
+            pgraph = onnx.helper.printable_graph(tempModel.graph)
+            f.write(pgraph)
+
+    return path
+
+
+def get_pytorch_components_and_onnx(args):
+    '''Returns PyTorch components used for inference
+    '''
+    model_definition = toml.load(args.model_toml)
+    dataset_vocab = model_definition['labels']['labels']
+    # Set up global labels for future vocab calls
+    global _global_ctc_labels
+    _global_ctc_labels= add_ctc_labels(dataset_vocab)
+    featurizer_config = model_definition['input_eval']
+
+    optim_level = Optimization.mxprO3 if args.pyt_fp16 else Optimization.mxprO0
+
+    featurizer_config["optimization_level"] = optim_level
+    acoustic_model = None
+    audio_preprocessor = None
+    onnx_path = None
+    data_layer = None
+    wav = None
+    seq_len = None
+    dtype=torch.float
+    
+    if args.max_duration is not None:
+        featurizer_config['max_duration'] = args.max_duration
+    if args.dataset_dir is not None:    
+        data_layer =  AudioToTextDataLayer(dataset_dir=args.dataset_dir,
+                                           featurizer_config=featurizer_config,
+                                           manifest_filepath=args.val_manifest,
+                                           labels=dataset_vocab,
+                                           batch_size=args.batch_size,
+                                           shuffle=False)
+    if args.wav is not None:
+        args.batch_size=1
+        args.engine_batch_size=1
+        wav, seq_len = audio_from_file(args.wav)
+        if args.seq_len is None or args.seq_len == 0:
+            args.seq_len = seq_len/(featurizer_config['sample_rate']/100)
+        
+
+    model = Jasper(feature_config=featurizer_config,
+                   jasper_model_definition=model_definition,
+                   feat_in=1024,
+                   num_classes=len(get_vocab()))
+
+    model.cuda()
+    model.eval()
+    acoustic_model = model.acoustic_model
+    audio_preprocessor = model.audio_preprocessor
+
+    if args.ckpt_path is not None:
+        checkpoint = torch.load(args.ckpt_path, map_location="cpu")
+        model.load_state_dict(checkpoint['state_dict'], strict=False)
+        
+    if args.make_onnx:
+        if args.onnx_path is None or acoustic_model is None:
+            raise Exception("--ckpt_path, --onnx_path must be provided when using --make_onnx")
+        onnx_path = get_onnx(args.onnx_path, acoustic_model,
+                             signal_shape=(args.engine_batch_size, 64, args.seq_len), dtype=torch.float)
+
+    if args.pyt_fp16:
+        amp.initialize(models=acoustic_model, opt_level=AmpOptimizations[optim_level])
+        
+    return {'data_layer': data_layer,
+            'audio_preprocessor': audio_preprocessor,
+            'acoustic_model': acoustic_model,
+            'input_wav' : (wav, seq_len) }, onnx_path
+
+def adjust_shape(am_input, baked_length):
+    '''Pads or cuts acoustic model input tensor to some fixed_length
+
+    '''
+    in_seq_len = am_input[0].shape[2]
+    newSeq=am_input[0]
+    if in_seq_len > baked_length:
+        # Cut extra bits off, no inference done
+        newSeq = am_input[0][...,0:baked_length].contiguous()
+    elif in_seq_len < baked_length:
+        # Zero-pad to satisfy length
+        pad_length = baked_length - in_seq_len
+        newSeq = nn.functional.pad(am_input[0], (0, pad_length), 'constant', 0)
+    return (newSeq,)
+
+def torchify_trt_out(trt_out, batch_size):
+    '''Reshapes flat data to format for greedy+CTC decoding
+
+    Used to convert numpy array on host to PyT Tensor
+    '''
+    desired_shape = (batch_size,-1,len(get_vocab()))
+
+    # Predictions must be reshaped.
+    return torch.Tensor(trt_out).reshape(desired_shape)
+
+def do_csv_export(wers, times, batch_size, num_frames):
+    '''Produces CSV header and data for input data
+
+    wers: dictionary of WER with keys={'trt', 'pyt'}
+    times: dictionary of execution times
+    '''
+    def take_durations_and_output_percentile(durations, ratios):
+        from heapq import nlargest, nsmallest
+        import numpy as np
+        import math
+        durations = np.asarray(durations) * 1000 # in ms
+        latency = durations
+        # The first few entries may not be representative due to warm-up effects
+        # The last entry might not be representative if dataset_size % batch_size != 0
+        latency = latency[5:-1]
+        mean_latency = np.mean(latency)
+        latency_worst = nlargest(math.ceil( (1 - min(ratios))* len(latency)), latency)
+        latency_ranges=get_percentile(ratios, latency_worst, len(latency))
+        latency_ranges["0.5"] = mean_latency
+        return latency_ranges
+    def get_percentile(ratios, arr, nsamples):
+        res = {}
+        for a in ratios:
+            idx = max(int(nsamples * (1 - a)), 0)
+            res[a] = arr[idx]
+        return res
+
+    ratios = [0.9, 0.95, 0.99, 1.]
+    header=[]
+    data=[]
+    header.append("BatchSize")
+    header.append("NumFrames")
+    data.append(f"{batch_size}")
+    data.append(f"{num_frames}")
+    for title, wer in wers.items():
+        header.append(title)
+        data.append(f"{wer}")
+    for title, durations in times.items():
+        ratio_latencies_dict = take_durations_and_output_percentile(durations, ratios)
+        for ratio, latency in ratio_latencies_dict.items():
+            header.append(f"{title}_{ratio}")
+            data.append(f"{latency}")
+    string_header = ", ".join(header)
+    string_data = ", ".join(data)
+    return string_header, string_data

+ 2 - 0
PyTorch/SpeechRecognition/Jasper/trt/requirements.txt

@@ -0,0 +1,2 @@
+pycuda
+pillow

+ 5 - 0
PyTorch/SpeechRecognition/Jasper/trt/scripts/docker/trt_build.sh

@@ -0,0 +1,5 @@
+#!/bin/bash
+
+# Constructs a docker image containing dependencies for execution of JASPER through TRT
+echo "docker build . -f ./trt/Dockerfile -t jasper:trt6"
+docker build . -f ./trt/Dockerfile -t jasper:trt6

+ 39 - 0
PyTorch/SpeechRecognition/Jasper/trt/scripts/docker/trt_launch.sh

@@ -0,0 +1,39 @@
+#!/bin/bash
+
+# Launch TRT JASPER container.
+
+DATA_DIR=$1
+CHECKPOINT_DIR=$2
+RESULT_DIR=$3
+PROGRAM_PATH=${PROGRAM_PATH}
+
+if [ $# -lt 3 ]; then
+    echo "Usage: ./trt_launch.sh <DATA_DIR> <CHECKPOINT_DIR> <RESULT_DIR> (<SCRIPT_PATH>)"
+    echo "All directory paths must be absolute paths and exist"
+    exit 1
+fi
+
+for dir in $DATA_DIR $CHECKPOINT_DIR $RESULT_DIR; do
+    if [[ $dir != /* ]]; then
+        echo "All directory paths must be absolute paths!"
+        echo "${dir} is not an absolute path"
+        exit 1
+    fi
+
+    if [ ! -d $dir ]; then
+        echo "All directory paths must exist!"
+        echo "${dir} does not exist"
+        exit 1
+    fi
+done
+
+
+nvidia-docker run -it --rm \
+  --runtime=nvidia \
+  --shm-size=4g \
+  --ulimit memlock=-1 \
+  --ulimit stack=67108864 \
+  -v $DATA_DIR:/datasets \
+  -v $CHECKPOINT_DIR:/checkpoints/ \
+  -v $RESULT_DIR:/results/ \
+  jasper:trt6 bash $PROGRAM_PATH

+ 30 - 0
PyTorch/SpeechRecognition/Jasper/trt/scripts/download_inference_librispeech.sh

@@ -0,0 +1,30 @@
+#!/usr/bin/env bash
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+#Downloads the inference-subset of the Librispeech corpus.
+
+
+
+DATA_SET="LibriSpeech"
+DATA_ROOT_DIR="/datasets"
+DATA_DIR="${DATA_ROOT_DIR}/${DATA_SET}"
+if [ ! -d "$DATA_DIR" ]
+then
+    mkdir -p $DATA_DIR
+    chmod go+rx $DATA_DIR
+    python utils/download_librispeech.py utils/inference_librispeech.csv $DATA_DIR -e ${DATA_ROOT_DIR}/
+else
+    echo "Directory $DATA_DIR already exists."
+fi

+ 35 - 0
PyTorch/SpeechRecognition/Jasper/trt/scripts/preprocess_inference_librispeech.sh

@@ -0,0 +1,35 @@
+#!/usr/bin/env bash
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Constructs JSON manifests for inference-subset of Librispeech corpus.
+
+python ./utils/convert_librispeech.py \
+    --input_dir /datasets/LibriSpeech/dev-clean \
+    --dest_dir /datasets/LibriSpeech/dev-clean-wav \
+    --output_json /datasets/LibriSpeech/librispeech-dev-clean-wav.json
+python ./utils/convert_librispeech.py \
+    --input_dir /datasets/LibriSpeech/dev-other \
+    --dest_dir /datasets/LibriSpeech/dev-other-wav \
+    --output_json /datasets/LibriSpeech/librispeech-dev-other-wav.json
+
+
+python ./utils/convert_librispeech.py \
+    --input_dir /datasets/LibriSpeech/test-clean \
+    --dest_dir /datasets/LibriSpeech/test-clean-wav \
+    --output_json /datasets/LibriSpeech/librispeech-test-clean-wav.json
+python ./utils/convert_librispeech.py \
+    --input_dir /datasets/LibriSpeech/test-other \
+    --dest_dir /datasets/LibriSpeech/test-other-wav \
+    --output_json /datasets/LibriSpeech/librispeech-test-other-wav.json

+ 56 - 0
PyTorch/SpeechRecognition/Jasper/trt/scripts/trt_inference.sh

@@ -0,0 +1,56 @@
+#!/bin/bash
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#           http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Performs inference and measures latency and accuracy of TRT and PyTorch implementations of JASPER.
+
+echo "Container nvidia build = " $NVIDIA_BUILD_ID
+
+# Mandatory Arguments
+CHECKPOINT=$CHECKPOINT
+
+# Arguments with Defaults
+DATA_DIR=${DATA_DIR:-"/datasets/LibriSpeech"}
+DATASET=${DATASET:-"dev-clean"}
+RESULT_DIR=${RESULT_DIR:-"/results"}
+CREATE_LOGFILE=${CREATE_LOGFILE:-"true"}
+TRT_PRECISION=${TRT_PRECISION:-"fp32"}
+PYTORCH_PRECISION=${PYTORCH_PRECISION:-"fp32"}
+NUM_STEPS=${NUM_STEPS:-"-1"}
+BATCH_SIZE=${BATCH_SIZE:-1}
+NUM_FRAMES=${NUM_FRAMES:-3600}
+FORCE_ENGINE_REBUILD=${FORCE_ENGINE_REBUILD:-"true"}
+CSV_PATH=${CSV_PATH:-"/results/res.csv"}
+TRT_PREDICTION_PATH=${TRT_PREDICTION_PATH:-"/results/trt_predictions.txt"}
+PYT_PREDICTION_PATH=${PYT_PREDICTION_PATH:-"/results/pyt_predictions.txt"}
+VERBOSE=${VERBOSE:-"false"}
+
+
+
+export CHECKPOINT="$CHECKPOINT"
+export DATA_DIR="$DATA_DIR"
+export DATASET="$DATASET"
+export RESULT_DIR="$RESULT_DIR"
+export CREATE_LOGFILE="$CREATE_LOGFILE"
+export TRT_PRECISION="$TRT_PRECISION"
+export PYTORCH_PRECISION="$PYTORCH_PRECISION"
+export NUM_STEPS="$NUM_STEPS"
+export BATCH_SIZE="$BATCH_SIZE"
+export NUM_FRAMES="$NUM_FRAMES"
+export FORCE_ENGINE_REBUILD="$FORCE_ENGINE_REBUILD"
+export CSV_PATH="$CSV_PATH"
+export TRT_PREDICTION_PATH="$TRT_PREDICTION_PATH"
+export PYT_PREDICTION_PATH="$PYT_PREDICTION_PATH"
+export VERBOSE="$VERBOSE"
+
+bash ./trt/scripts/trt_inference_benchmark.sh $1 $2 $3 $4 $5 $6 $7

+ 162 - 0
PyTorch/SpeechRecognition/Jasper/trt/scripts/trt_inference_benchmark.sh

@@ -0,0 +1,162 @@
+#!/bin/bash
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#           http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Measures latency and accuracy of TRT and PyTorch implementations of JASPER.
+
+echo "Container nvidia build = " $NVIDIA_BUILD_ID
+
+# Mandatory Arguments
+CHECKPOINT=$CHECKPOINT
+
+# Arguments with Defaults
+DATA_DIR=${DATA_DIR:-"/datasets/LibriSpeech"}
+DATASET=${DATASET:-"dev-clean"}
+RESULT_DIR=${RESULT_DIR:-"/results"}
+CREATE_LOGFILE=${CREATE_LOGFILE:-"true"}
+TRT_PRECISION=${TRT_PRECISION:-"fp32"}
+PYTORCH_PRECISION=${PYTORCH_PRECISION:-"fp32"}
+NUM_STEPS=${NUM_STEPS:-"100"}
+BATCH_SIZE=${BATCH_SIZE:-64}
+NUM_FRAMES=${NUM_FRAMES:-512}
+FORCE_ENGINE_REBUILD=${FORCE_ENGINE_REBUILD:-"true"}
+CSV_PATH=${CSV_PATH:-"/results/res.csv"}
+TRT_PREDICTION_PATH=${TRT_PREDICTION_PATH:-"none"}
+PYT_PREDICTION_PATH=${PYT_PREDICTION_PATH:-"none"}
+VERBOSE=${VERBOSE:-"false"}
+
+
+# Set up flag-based arguments
+TRT_PREC=""
+if [ "$TRT_PRECISION" = "fp16" ] ; then
+    TRT_PREC="--trt_fp16"
+elif [ "$TRT_PRECISION" = "fp32" ] ; then
+    TRT_PREC=""
+else
+   echo "Unknown <trt_precision> argument"
+   exit -2
+fi
+
+PYTORCH_PREC=""
+if [ "$PYTORCH_PRECISION" = "fp16" ] ; then
+    PYTORCH_PREC="--pyt_fp16"
+elif [ "$PYTORCH_PRECISION" = "fp32" ] ; then
+    PYTORCH_PREC=""
+else
+   echo "Unknown <pytorch_precision> argument"
+   exit -2
+fi
+
+SHOULD_VERBOSE=""
+if [ "$VERBOSE" = "true" ] ; then
+    SHOULD_VERBOSE="--verbose"
+fi
+
+
+STEPS=""
+if [ "$NUM_STEPS" -gt 0 ] ; then
+   STEPS=" --num_steps $NUM_STEPS"
+fi
+
+# Making engine and onnx directories in RESULT_DIR if they don't already exist
+ONNX_DIR=$RESULT_DIR/onnxs
+ENGINE_DIR=$RESULT_DIR/engines
+mkdir -p $ONNX_DIR
+mkdir -p $ENGINE_DIR
+
+
+PREFIX=BS${BATCH_SIZE}_NF${NUM_FRAMES}
+
+# Currently, TRT parser for ONNX can't parse half-precision weights, so ONNX
+# export will always be FP32. This is also enforced in perf.py
+ONNX_FILE=fp32_${PREFIX}.onnx
+ENGINE_FILE=${TRT_PRECISION}_${PREFIX}.engine
+
+
+
+# If an ONNX with the same precision and number of frames exists, don't recreate it because
+# TRT engine construction can be done on an onnx of any batch size
+# "%P" only prints filenames (rather than absolute/relative path names)
+EXISTING_ONNX=$(find $ONNX_DIR -name "fp32_BS*_NF${NUM_FRAMES}.onnx" -printf "%P\n" | head -n 1)
+SHOULD_MAKE_ONNX=""
+if [ -z "$EXISTING_ONNX" ] ; then
+    SHOULD_MAKE_ONNX="--make_onnx"
+else
+    ONNX_FILE=${EXISTING_ONNX}
+fi
+
+# Follow FORCE_ENGINE_REBUILD about reusing existing engines.
+# If false, the existing engine must match precision, batch size, and number of frames
+SHOULD_MAKE_ENGINE=""
+if [ "$FORCE_ENGINE_REBUILD" != "true" ] ; then
+    EXISTING_ENGINE=$(find $ENGINE_DIR -name "${ENGINE_FILE}")
+    if [ -n "$EXISTING_ENGINE" ] ; then
+        SHOULD_MAKE_ENGINE="--use_existing_engine"
+    fi
+fi
+
+
+
+if [ "$TRT_PREDICTION_PATH" = "none" ] ; then
+   TRT_PREDICTION_PATH=""
+else
+   TRT_PREDICTION_PATH=" --trt_prediction_path=${TRT_PREDICTION_PATH}"
+fi
+
+
+if [ "$PYT_PREDICTION_PATH" = "none" ] ; then
+   PYT_PREDICTION_PATH=""
+else
+   PYT_PREDICTION_PATH=" --pyt_prediction_path=${PYT_PREDICTION_PATH}"
+fi
+
+CMD="python trt/perf.py"
+CMD+=" --batch_size $BATCH_SIZE"
+CMD+=" --engine_batch_size $BATCH_SIZE"
+CMD+=" --model_toml configs/jasper10x5dr_nomask.toml"
+CMD+=" --dataset_dir $DATA_DIR"
+CMD+=" --val_manifest $DATA_DIR/librispeech-${DATASET}-wav.json "
+CMD+=" --ckpt $CHECKPOINT"
+CMD+=" $SHOULD_VERBOSE"
+CMD+=" $TRT_PREC"
+CMD+=" $PYTORCH_PREC"
+CMD+=" $STEPS"
+CMD+=" --engine_path ${RESULT_DIR}/engines/${ENGINE_FILE}"
+CMD+=" --onnx_path ${RESULT_DIR}/onnxs/${ONNX_FILE}"
+CMD+=" --seq_len $NUM_FRAMES"
+CMD+=" $SHOULD_MAKE_ONNX"
+CMD+=" $SHOULD_MAKE_ENGINE"
+CMD+=" --csv_path $CSV_PATH"
+CMD+=" $1 $2 $3 $4 $5 $6 $7 $8 $9"
+CMD+=" $TRT_PREDICTION_PATH"
+CMD+=" $PYT_PREDICTION_PATH"
+
+
+if [ "$CREATE_LOGFILE" == "true" ] ; then
+  export GBS=$(expr $BATCH_SIZE )
+  printf -v TAG "jasper_trt_inference_benchmark_%s_gbs%d" "$PYTORCH_PRECISION" $GBS
+  DATESTAMP=`date +'%y%m%d%H%M%S'`
+  LOGFILE=$RESULT_DIR/$TAG.$DATESTAMP.log
+  printf "Logs written to %s\n" "$LOGFILE"
+fi
+
+set -x
+if [ -z "$LOGFILE" ] ; then
+   $CMD
+else
+   (
+     $CMD
+   ) |& tee $LOGFILE
+   grep 'latency' $LOGFILE
+fi
+set +x

+ 38 - 0
PyTorch/SpeechRecognition/Jasper/trt/scripts/walk_benchmark.sh

@@ -0,0 +1,38 @@
+#!/bin/bash
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#           http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# A usage example of trt_inference_benchmark.sh.
+
+
+export NUM_STEPS=100
+export FORCE_ENGINE_REBUILD="true"
+export CHECKPOINT="/checkpoints/jasper.pt"
+export CREATE_LOGFILE="false"
+for prec in fp16;
+do
+    export TRT_PRECISION=$prec
+    export PYTORCH_PRECISION=$prec
+    export CSV_PATH="/results/${prec}.csv"
+    for nf in 208 304 512 704 1008 1680;
+    do
+        export NUM_FRAMES=$nf
+        for bs in 1 2 4 8 16 32 64;
+        do
+            export BATCH_SIZE=$bs
+
+            echo "Doing batch size ${bs}, sequence length ${nf}, precision ${prec}"
+            bash trt/scripts/trt_inference_benchmark.sh $1 $2 $3 $4 $5 $6
+        done
+    done
+done

+ 92 - 0
PyTorch/SpeechRecognition/Jasper/trt/trtutils.py

@@ -0,0 +1,92 @@
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#           http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+'''Contains helper functions for TRT components of JASPER inference
+'''
+import pycuda.driver as cuda
+import tensorrt as trt
+
+# Simple class: more explicit than dealing with 2-tuple
+class HostDeviceMem(object):
+    '''Type for managing host and device buffers
+
+    A simple class which is more explicit that dealing with a 2-tuple.
+    '''
+    def __init__(self, host_mem, device_mem):
+        self.host = host_mem
+        self.device = device_mem
+
+    def __str__(self):
+        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
+
+    def __repr__(self):
+        return self.__str__()
+
+def build_engine_from_parser(model_path, batch_size, is_fp16=True, is_verbose=False, max_workspace_size=4*1024*1024*1024):
+    '''Builds TRT engine from an ONNX file
+    Note that network output 1 is unmarked so that the engine will not use
+    vestigial length calculations associated with masked_fill
+    '''
+    TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE) if is_verbose else trt.Logger(trt.Logger.WARNING)
+    with trt.Builder(TRT_LOGGER) as builder:
+        builder.max_batch_size = batch_size
+        builder.fp16_mode = is_fp16
+        builder.max_workspace_size = max_workspace_size
+        with builder.create_network() as network:
+            with trt.OnnxParser(network, TRT_LOGGER) as parser:
+                with open(model_path, 'rb') as model:
+                    parser.parse(model.read())
+                
+                return builder.build_cuda_engine(network)
+
+def deserialize_engine(engine_path, is_verbose):
+    '''Deserializes TRT engine at engine_path
+    '''
+    TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE) if is_verbose else trt.Logger(trt.Logger.WARNING)
+    with open(engine_path, 'rb') as f, trt.Runtime(TRT_LOGGER) as runtime:
+        engine = runtime.deserialize_cuda_engine(f.read())
+    return engine
+
+
+def allocate_buffers_with_existing_inputs(engine, inp, batch_size=1):
+    '''
+    allocate_buffers() (see TRT python samples) but uses an existing inputs on device
+
+    inp:  List of pointers to device memory. Pointers are in the same order as
+          would be produced by allocate_buffers(). That is, inputs are in the
+          order defined by iterating through `engine`
+    '''
+
+    # Add input to bindings
+    bindings = []
+    outputs = []
+    stream = cuda.Stream()
+    inp_idx = 0
+
+    for binding in engine:
+        if engine.binding_is_input(binding):
+            bindings.append(inp[inp_idx])
+            inp_idx += 1
+        else:
+            # Unchanged from do_inference()
+            size = trt.volume(engine.get_binding_shape(binding)) * batch_size
+            dtype = trt.nptype(engine.get_binding_dtype(binding))
+            # Allocate host and device buffers
+            host_mem = cuda.pagelocked_empty(size, dtype)
+            device_mem = cuda.mem_alloc(host_mem.nbytes*2)
+            # Append the device buffer to device bindings.
+            bindings.append(int(device_mem))
+            # Append to the appropriate list.
+            outputs.append(HostDeviceMem(host_mem, device_mem))
+
+    return outputs, bindings, stream

+ 5 - 0
PyTorch/SpeechRecognition/Jasper/utils/inference_librispeech.csv

@@ -0,0 +1,5 @@
+url,md5
+http://www.openslr.org/resources/12/dev-clean.tar.gz,42e2234ba48799c1f50f24a7926300a1
+http://www.openslr.org/resources/12/dev-other.tar.gz,c8d0bcc9cca99d4f8b62fcc847357931
+http://www.openslr.org/resources/12/test-clean.tar.gz,32fa31d27d2e1cad72775fee3f4849a9
+http://www.openslr.org/resources/12/test-other.tar.gz,fb5a50374b501bb3bac4815ee91d3135