Explorar o código

trtis to triton update

Swetha Mandava %!s(int64=5) %!d(string=hai) anos
pai
achega
58a3ed6bab

+ 3 - 2
TensorFlow/LanguageModeling/BERT/Dockerfile

@@ -1,4 +1,4 @@
-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:19.10-py3
+ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.03-tf1-py3
 
 FROM ${FROM_IMAGE_NAME}
 
@@ -13,10 +13,11 @@ RUN git clone https://github.com/attardi/wikiextractor.git
 RUN git clone https://github.com/soskek/bookcorpus.git
 RUN git clone https://github.com/titipata/pubmed_parser
 
+
 RUN pip3 install /workspace/pubmed_parser
 
 #Copy the perf_client over
-ARG TRTIS_CLIENTS_URL=https://github.com/NVIDIA/tensorrt-inference-server/releases/download/v1.5.0/v1.5.0_ubuntu1804.clients.tar.gz
+ARG TRTIS_CLIENTS_URL=https://github.com/NVIDIA/triton-inference-server/releases/download/v1.12.0/v1.12.0_ubuntu1804.clients.tar.gz
 RUN mkdir -p /workspace/install \
     && curl -L ${TRTIS_CLIENTS_URL} | tar xvz -C /workspace/install
 

+ 4 - 4
TensorFlow/LanguageModeling/BERT/README.md

@@ -28,7 +28,7 @@ This repository provides a script and recipe to train the BERT model for TensorF
     * [Multi-node](#multi-node)
   * [Inference process](#inference-process)
   * [Inference Process With TensorRT](#inference-process-with-tensorrt)
-  * [Deploying the BERT model using TensorRT Inference Server](#deploying-the-bert-model-using-tensorrt-inference-server)
+  * [Deploying the BERT model using Triton Inference Server](#deploying-the-bert-model-using-triton-inference-server)
   * [BioBERT](#biobert)
 - [Performance](#performance)
   * [Benchmarking](#benchmarking)
@@ -619,9 +619,9 @@ I0312 23:14:00.550973 140287431493376 run_squad.py:1397] 0 Inference Performance
 ### Inference Process With TensorRT
 NVIDIA TensorRT is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. More information on how to perform inference using TensorRT can be found in the subfolder [./trt/README.md](trt/README.md)
 
-### Deploying the BERT model using TensorRT Inference Server
+### Deploying the BERT model using Triton Inference Server
 
-The [NVIDIA TensorRT Inference Server](https://github.com/NVIDIA/tensorrt-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server. More information on how to perform inference using `TensorRT Inference Server` can be found in the subfolder `./trtis/README.md`.
+The [NVIDIA Triton Inference Server](https://github.com/NVIDIA/triton-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server. More information on how to perform inference using `Triton Inference Server` can be found in the subfolder `./triton/README.md`.
 
 ### BioBERT
 
@@ -1153,7 +1153,7 @@ September 2019
 
 July 2019
 - Results obtained using 19.06
-- Inference Studies using TensorRT Inference Server
+- Inference Studies using Triton Inference Server
 
 March 2019
 - Initial release

+ 36 - 34
TensorFlow/LanguageModeling/BERT/trtis/README.md → TensorFlow/LanguageModeling/BERT/triton/README.md

@@ -1,18 +1,18 @@
-# Deploying the BERT model using TensorRT Inference Server
+# Deploying the BERT model using Triton Inference Server
 
-The [NVIDIA TensorRT Inference Server](https://github.com/NVIDIA/tensorrt-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.
-This folder contains detailed performance analysis as well as scripts to run SQuAD fine-tuning on BERT model using TensorRT Inference Server.
+The [NVIDIA Triton Inference Server](https://github.com/NVIDIA/triton-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.
+This folder contains detailed performance analysis as well as scripts to run SQuAD fine-tuning on BERT model using Triton Inference Server.
 
 ## Table Of Contents
 
-- [TensorRT Inference Server Overview](#tensorrt-inference-server-overview)
-- [Performance analysis for TensorRT Inference Server](#performance-analysis-for-tensorrt-inference-server)
+- [Triton Inference Server Overview](#triton-inference-server-overview)
+- [Running the Triton Inference Server and client](#running-the-triton-inference-server-and-client)
+- [Performance analysis for Triton Inference Server](#performance-analysis-for-triton-inference-server)
   * [Advanced Details](#advanced-details)
-- [Running the TensorRT Inference Server and client](#running-the-tensorrt-inference-server-and-client)
 
-## TensorRT Inference Server Overview
+## Triton Inference Server Overview
 
-A typical TensorRT Inference Server pipeline can be broken down into the following 8 steps:
+A typical Triton Inference Server pipeline can be broken down into the following 8 steps:
 1. Client serializes the inference request into a message and sends it to the server (Client Send)
 2. Message travels over the network from the client to the server (Network)
 3. Message arrives at server, and is deserialized (Server Receive)
@@ -23,30 +23,40 @@ A typical TensorRT Inference Server pipeline can be broken down into the followi
 8. Completed message is deserialized by the client and processed as a completed inference request (Client Receive)
 
 Generally, for local clients, steps 1-4 and 6-8 will only occupy a small fraction of time, compared to steps 5-6. As backend deep learning systems like BERT are rarely exposed directly to end users, but instead only interfacing with local front-end servers, for the sake of BERT, we can consider that all clients are local.
-In this section, we will go over how to launch TensorRT Inference Server and client and get the best performant solution that fits your specific application needs.
+In this section, we will go over how to launch Triton Inference Server and client and get the best performant solution that fits your specific application needs.
 
 Note: The following instructions are run from outside the container and call `docker run` commands as required.
 
-## Performance analysis for TensorRT Inference Server
+## Running the Triton Inference Server and client
+
+The `run_triton.sh` script exports the TensorFlow BERT model as a `tensorflow_savedmodel` that Triton Inference Server accepts, builds a matching [Triton Inference Server model config](https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-guide/docs/model_configuration.html#), starts the server on local host in a detached state, runs client on SQuAD v1.1 dataset and then evaluates the validity of predictions on the basis of exact match and F1 score all in one step.
+
+```bash
+bash triton/scripts/run_triton.sh <init_checkpoint> <batch_size> <precision> <use_xla> <seq_length> <doc_stride> <bert_model> <squad_version> <triton_version_name> <triton_model_name> <triton_export_model> <triton_dyn_batching_delay> <triton_engine_count> <triton_model_overwrite>
+```
+
+You can also run inference with a sample by passing `--question` and `--context` arguments to the client.
+
+## Performance analysis for Triton Inference Server
 
 Based on the figures 1 and 2 below, we recommend using the Dynamic Batcher with `max_batch_size = 8`, `max_queue_delay_microseconds` as large as possible to fit within your latency window (the values used below are extremely large to exaggerate their effect), and only 1 instance of the engine. The largest improvements to both throughput and latency come from increasing the batch size due to efficiency gains in the GPU with larger batches. The Dynamic Batcher combines the best of both worlds by efficiently batching together a large number of simultaneous requests, while also keeping latency down for infrequent requests. We recommend only 1 instance of the engine due to the negligible improvement to throughput at the cost of significant increases in latency. Many models can benefit from multiple engine instances but as the figures below show, that is not the case for this model.
 
-![](../data/images/trtis_base_summary.png?raw=true)
+![](../data/images/triton_base_summary.png?raw=true)
 
-Figure 1: Latency vs Throughput for BERT Base, FP16, Sequence Length = 128 using various configurations available in TensorRT Inference Server
+Figure 1: Latency vs Throughput for BERT Base, FP16, Sequence Length = 128 using various configurations available in Triton Inference Server
 
-![](../data/images/trtis_large_summary.png?raw=true)
+![](../data/images/triton_large_summary.png?raw=true)
 
-Figure 2: Latency vs Throughput for BERT Large, FP16, Sequence Length = 384 using various configurations available in TensorRT Inference Server
+Figure 2: Latency vs Throughput for BERT Large, FP16, Sequence Length = 384 using various configurations available in Triton Inference Server
 
 ### Advanced Details
 
-This section digs deeper into the performance numbers and configurations corresponding to running TensorRT Inference Server for BERT fine tuning for Question Answering. It explains the tradeoffs in selecting maximum batch sizes, batching techniques and number of inference engines on the same GPU to understand how we arrived at the optimal configuration specified previously.
+This section digs deeper into the performance numbers and configurations corresponding to running Triton Inference Server for BERT fine tuning for Question Answering. It explains the tradeoffs in selecting maximum batch sizes, batching techniques and number of inference engines on the same GPU to understand how we arrived at the optimal configuration specified previously.
 
-Results can be reproduced by running `generate_figures.sh`. It exports the TensorFlow BERT model as a `tensorflow_savedmodel` that TensorRT Inference Server accepts, builds a matching [TensorRT Inference Server model config](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_configuration.html#), starts the server on localhost in a detached state and runs [perf_client](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/client.html#performance-example-application) for various configurations.
+Results can be reproduced by running `generate_figures.sh`. It exports the TensorFlow BERT model as a `tensorflow_savedmodel` that Triton Inference Server accepts, builds a matching [Triton Inference Server model config](https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-guide/docs/model_configuration.html#), starts the server on localhost in a detached state and runs [perf_client](https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-guide/docs/client.html#performance-example-application) for various configurations.
 
 ```bash
-bash trtis/scripts/generate_figures.sh <bert_model> <seq_length> <precision> <init_checkpoint>
+bash triton/scripts/generate_figures.sh <bert_model> <seq_length> <precision> <init_checkpoint>
 ```
 
 All results below are obtained on a single DGX-1 V100 32GB GPU for BERT Base, Sequence Length = 128 and FP16 precision running on a local server. Latencies are indicated by bar plots using the left axis. Throughput is indicated by the blue line plot using the right axis. X-axis indicates the concurrency - the maximum number of inference requests that can be in the pipeline at any given time. For example, when the concurrency is set to 1, the client waits for an inference request to be completed (Step 8) before it sends another to the server (Step 1).  A high number of concurrent requests can reduce the impact of network latency on overall throughput.
@@ -59,11 +69,11 @@ Note: We compare BS=1, Client Concurrent Requests = 64 to BS=8, Client Concurren
 
 Increasing the batch size from 1 to 8 results in an increase in compute time by 1.8x (8.38ms to 15.46ms) showing that computation is more efficient at higher batch sizes. Hence, an optimal batch size would be the maximum batch size that can both fit in memory and is within the preferred latency threshold.
 
-![](../data/images/trtis_bs_1.png?raw=true)
+![](../data/images/triton_bs_1.png?raw=true)
 
 Figure 3: Latency & Throughput vs Concurrency at Batch size = 1
 
-![](../data/images/trtis_bs_8.png?raw=true)
+![](../data/images/triton_bs_8.png?raw=true)
 
 Figure 4: Latency & Throughput vs Concurrency at Batch size = 8
 
@@ -71,38 +81,30 @@ Figure 4: Latency & Throughput vs Concurrency at Batch size = 8
 
 Static batching is a feature of the inference server that allows inference requests to be served as they are received. It is preferred in scenarios where low latency is desired at the cost of throughput when the GPU is under utilized.
 
-Dynamic batching is a feature of the inference server that allows inference requests to be combined by the server, so that a batch is created dynamically, resulting in an increased throughput. It is preferred in scenarios where we would like to maximize throughput and GPU utilization at the cost of higher latencies. You can set the [Dynamic Batcher parameters](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-master-branch-guide/docs/model_configuration.html#dynamic-batcher) `max_queue_delay_microseconds` to indicate the maximum amount of time you are willing to wait and ‘preferred_batchsize’ to indicate your optimal batch sizes in the TensorRT Inference Server model config.
+Dynamic batching is a feature of the inference server that allows inference requests to be combined by the server, so that a batch is created dynamically, resulting in an increased throughput. It is preferred in scenarios where we would like to maximize throughput and GPU utilization at the cost of higher latencies. You can set the [Dynamic Batcher parameters](https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-master-branch-guide/docs/model_configuration.html#dynamic-batcher) `max_queue_delay_microseconds` to indicate the maximum amount of time you are willing to wait and ‘preferred_batchsize’ to indicate your optimal batch sizes in the Triton Inference Server model config.
 
 Figures 5 and 6 emphasize the increase in overall throughput with dynamic batching. At low numbers of concurrent requests, the increased throughput comes at the cost of increasing latency as the requests are queued up to `max_queue_delay_microseconds`. The effect of `preferred_batchsize` for dynamic batching is visually depicted by the dip in Server Queue time at integer multiples of the preferred batch sizes. At higher numbers of concurrent requests, observe that the throughput approach a maximum limit as we saturate the GPU utilization.
 
-![](../data/images/trtis_static.png?raw=true)
+![](../data/images/triton_static.png?raw=true)
 
 Figure 5: Latency & Throughput vs Concurrency using Static Batching at `Batch size` = 1
 
-![](../data/images/trtis_dynamic.png?raw=true)
+![](../data/images/triton_dynamic.png?raw=true)
 
 Figure 6: Latency & Throughput vs Concurrency using Dynamic Batching at `Batch size` = 1, `preferred_batchsize` = [4, 8] and `max_queue_delay_microseconds` = 5000
 
 #### Model execution instance count
 
-TensorRT Inference Server enables us to launch multiple engines in separate CUDA streams by setting the `instance_group_count` parameter to improve both latency and throughput. Multiple engines are useful when the model doesn’t saturate the GPU allowing the GPU to run multiple instances of the model in parallel.
+Triton Inference Server enables us to launch multiple engines in separate CUDA streams by setting the `instance_group_count` parameter to improve both latency and throughput. Multiple engines are useful when the model doesn’t saturate the GPU allowing the GPU to run multiple instances of the model in parallel.
 
 Figures 7 and 8 show a drop in queue time as more models are available to serve an inference request. However, this is countered by an increase in compute time as multiple models compete for resources. Since BERT is a large model which utilizes the majority of the GPU, the benefit to running multiple engines is not seen.
 
-![](../data/images/trtis_ec_1.png?raw=true)
+![](../data/images/triton_ec_1.png?raw=true)
 
 Figure 7: Latency & Throughput vs Concurrency at Batch size = 1, Engine Count = 1
 (One copy of the model loaded in GPU memory)
 
-![](../data/images/trtis_ec_4.png?raw=true)
+![](../data/images/triton_ec_4.png?raw=true)
 
 Figure 8: Latency & Throughput vs Concurrency at Batch size = 1, Engine count = 4
-(Four copies the model loaded in GPU memory)
-
-## Running the TensorRT Inference Server and client
-
-The `run_trtis.sh` script exports the TensorFlow BERT model as a `tensorflow_savedmodel` that TensorRT Inference Server accepts, builds a matching [TensorRT Inference Server model config](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_configuration.html#), starts the server on local host in a detached state, runs client and then evaluates the validity of predictions on the basis of exact match and F1 score all in one step.
-
-```bash
-bash trtis/scripts/run_trtis.sh <init_checkpoint> <batch_size> <precision> <use_xla> <seq_length> <doc_stride> <bert_model> <squad_version> <trtis_version_name> <trtis_model_name> <trtis_export_model> <trtis_dyn_batching_delay> <trtis_engine_count> <trtis_model_overwrite>
-```
+(Four copies the model loaded in GPU memory)

+ 323 - 0
TensorFlow/LanguageModeling/BERT/triton/run_squad_triton_client.py

@@ -0,0 +1,323 @@
+# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import modeling
+import tokenization
+from tensorrtserver.api import ProtocolType, InferContext, ServerStatusContext, grpc_service_pb2_grpc, grpc_service_pb2, model_config_pb2
+from utils.create_squad_data import *
+import grpc
+from run_squad import write_predictions, get_predictions, RawResult
+import numpy as np
+import tqdm
+from functools import partial
+
+import sys
+if sys.version_info >= (3, 0):
+  import queue
+else:
+  import Queue as queue
+
+
+flags = tf.flags
+FLAGS = flags.FLAGS
+
+## Required parameters
+flags.DEFINE_string(
+    "bert_config_file", None,
+    "The config json file corresponding to the pre-trained BERT model. "
+    "This specifies the model architecture.")
+
+flags.DEFINE_string("vocab_file", None,
+                    "The vocabulary file that the BERT model was trained on.")
+
+flags.DEFINE_string(
+    "output_dir", None,
+    "The output directory where the model checkpoints will be written.")
+
+flags.DEFINE_bool(
+    "do_lower_case", True,
+    "Whether to lower case the input text. Should be True for uncased "
+    "models and False for cased models.")
+
+flags.DEFINE_integer(
+    "max_seq_length", 384,
+    "The maximum total input sequence length after WordPiece tokenization. "
+    "Sequences longer than this will be truncated, and sequences shorter "
+    "than this will be padded.")
+
+flags.DEFINE_integer(
+    "doc_stride", 128,
+    "When splitting up a long document into chunks, how much stride to "
+    "take between chunks.")
+
+flags.DEFINE_integer(
+    "max_query_length", 64,
+    "The maximum number of tokens for the question. Questions longer than "
+    "this will be truncated to this length.")
+
+flags.DEFINE_integer("predict_batch_size", 8,
+                     "Total batch size for predictions.")
+
+flags.DEFINE_integer(
+    "n_best_size", 20,
+    "The total number of n-best predictions to generate in the "
+    "nbest_predictions.json output file.")
+
+flags.DEFINE_integer(
+    "max_answer_length", 30,
+    "The maximum length of an answer that can be generated. This is needed "
+    "because the start and end predictions are not conditioned on one another.")
+
+flags.DEFINE_bool(
+    "version_2_with_negative", False,
+    "If true, the SQuAD examples contain some that do not have an answer.")
+
+flags.DEFINE_bool(
+    "verbose_logging", False,
+    "If true, all of the warnings related to data processing will be printed. "
+    "A number of warnings are expected for a normal SQuAD evaluation.")
+
+# Triton Specific flags
+flags.DEFINE_string("triton_model_name", "bert", "exports to appropriate directory for Triton")
+flags.DEFINE_integer("triton_model_version", 1, "exports to appropriate directory for Triton")
+flags.DEFINE_string("triton_server_url", "localhost:8001", "exports to appropriate directory for Triton")
+
+# Input Text for Inference
+flags.DEFINE_string("question", None, "Question for Inference")
+flags.DEFINE_string("context", None, "Context for Inference")
+flags.DEFINE_string(
+    "predict_file", None,
+    "SQuAD json for predictions. E.g., dev-v1.1.json or test-v1.1.json")
+
+
+# Set this to either 'label_ids' for Google bert or 'unique_ids' for JoC
+label_id_key = "unique_ids"
+
+# User defined class to store infer_ctx and request id
+# from callback function and let main thread to handle them
+class UserData:
+    def __init__(self):
+        self._completed_requests = queue.Queue()
+
+# Callback function used for async_run(), it can capture
+# additional information using functools.partial as long as the last
+# two arguments are reserved for InferContext and request id
+def completion_callback(user_data, idx, start_time, inputs, infer_ctx, request_id):
+    user_data._completed_requests.put((infer_ctx, request_id, idx, start_time, inputs))
+
+def batch(iterable, n=1):
+    l = len(iterable)
+    for ndx in range(0, l, n):
+        label_ids_data = ()
+        input_ids_data = ()
+        input_mask_data = ()
+        segment_ids_data = ()
+        for i in range(0, min(n, l-ndx)):
+            label_ids_data = label_ids_data + (np.array([iterable[ndx + i].unique_id], dtype=np.int32),)
+            input_ids_data = input_ids_data+ (np.array(iterable[ndx + i].input_ids, dtype=np.int32),)
+            input_mask_data = input_mask_data+ (np.array(iterable[ndx + i].input_mask, dtype=np.int32),)
+            segment_ids_data = segment_ids_data+ (np.array(iterable[ndx + i].segment_ids, dtype=np.int32),)
+
+        inputs_dict = {label_id_key: label_ids_data,
+                       'input_ids': input_ids_data,
+                       'input_mask': input_mask_data,
+                       'segment_ids': segment_ids_data}
+        yield inputs_dict
+
+def main(_):
+    """
+    Ask a question of context on Triton.
+    :param context: str
+    :param question: str
+    :param question_id: int
+    :return:
+    """
+    tokenizer = tokenization.FullTokenizer(vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
+
+    # Get the Data
+    if FLAGS.predict_file:
+        eval_examples = read_squad_examples(
+            input_file=FLAGS.predict_file, is_training=False,
+            version_2_with_negative=FLAGS.version_2_with_negative)
+    elif FLAGS.question and FLAGS.answer:
+        input_data = [{"paragraphs":[{"context":FLAGS.context,
+                        "qas":[{"id":0, "question":FLAGS.question}]}]}]
+
+        eval_examples = read_squad_examples(input_file=None, is_training=False,
+            version_2_with_negative=FLAGS.version_2_with_negative, input_data=input_data)
+    else:
+        raise ValueError("Either predict_file or question+answer need to defined")
+    
+    # Get Eval Features = Preprocessing
+    eval_features = []
+    def append_feature(feature):
+        eval_features.append(feature)
+
+    convert_examples_to_features(
+        examples=eval_examples[0:],
+        tokenizer=tokenizer,
+        max_seq_length=FLAGS.max_seq_length,
+        doc_stride=FLAGS.doc_stride,
+        max_query_length=FLAGS.max_query_length,
+        is_training=False,
+        output_fn=append_feature)
+
+    protocol_str = 'grpc' # http or grpc
+    url = FLAGS.triton_server_url
+    verbose = True
+    model_name = FLAGS.triton_model_name
+    model_version = FLAGS.triton_model_version
+    batch_size = FLAGS.predict_batch_size
+
+    protocol = ProtocolType.from_str(protocol_str) # or 'grpc'
+
+    ctx = InferContext(url, protocol, model_name, model_version, verbose)
+
+    status_ctx = ServerStatusContext(url, protocol, model_name=model_name, verbose=verbose)
+
+    model_config_pb2.ModelConfig()
+
+    status_result = status_ctx.get_server_status()
+    user_data = UserData()
+
+    max_outstanding = 20
+    # Number of outstanding requests
+    outstanding = 0
+
+    sent_prog = tqdm.tqdm(desc="Send Requests", total=len(eval_features))
+    recv_prog = tqdm.tqdm(desc="Recv Requests", total=len(eval_features))
+
+    def process_outstanding(do_wait, outstanding):
+
+        if (outstanding == 0 or do_wait is False):
+            return outstanding
+
+        # Wait for deferred items from callback functions
+        (infer_ctx, ready_id, idx, start_time, inputs) = user_data._completed_requests.get()
+
+        if (ready_id is None):
+            return outstanding
+
+        # If we are here, we got an id
+        result = ctx.get_async_run_results(ready_id)
+        stop = time.time()
+
+        if (result is None):
+            raise ValueError("Context returned null for async id marked as done")
+
+        outstanding -= 1
+
+        time_list.append(stop - start_time)
+
+        batch_count = len(inputs[label_id_key])
+
+        for i in range(batch_count):
+            unique_id = int(inputs[label_id_key][i][0])
+            start_logits = [float(x) for x in result["start_logits"][i].flat]
+            end_logits = [float(x) for x in result["end_logits"][i].flat]
+            all_results.append(
+                RawResult(
+                    unique_id=unique_id,
+                    start_logits=start_logits,
+                    end_logits=end_logits))
+
+        recv_prog.update(n=batch_count)
+       	return outstanding
+
+    all_results = []
+    time_list = []
+
+    print("Starting Sending Requests....\n")
+
+    all_results_start = time.time()
+    idx = 0
+    for inputs_dict in batch(eval_features, batch_size):
+
+        present_batch_size = len(inputs_dict[label_id_key])
+
+        outputs_dict = {'start_logits': InferContext.ResultFormat.RAW,
+                        'end_logits': InferContext.ResultFormat.RAW}
+
+        start_time = time.time()
+        ctx.async_run(partial(completion_callback, user_data, idx, start_time, inputs_dict),
+        	inputs_dict, outputs_dict, batch_size=present_batch_size)
+        outstanding += 1
+        idx += 1
+
+        sent_prog.update(n=present_batch_size)
+
+        # Try to process at least one response per request
+        outstanding = process_outstanding(outstanding >= max_outstanding, outstanding)
+
+    tqdm.tqdm.write("All Requests Sent! Waiting for responses. Outstanding: {}.\n".format(outstanding))
+
+    # Now process all outstanding requests
+    while (outstanding > 0):
+        outstanding = process_outstanding(True, outstanding)
+
+    all_results_end = time.time()
+    all_results_total = (all_results_end - all_results_start) * 1000.0
+
+    print("-----------------------------")
+    print("Total Time: {} ms".format(all_results_total))
+    print("-----------------------------")
+
+    print("-----------------------------")
+    print("Total Inference Time = %0.2f for"
+          "Sentences processed = %d" % (sum(time_list), len(eval_features)))
+    print("Throughput Average (sentences/sec) = %0.2f" % (len(eval_features) / all_results_total * 1000.0))
+    print("-----------------------------")
+
+    if FLAGS.output_dir and FLAGS.predict_file:
+        # When inferencing on a dataset, get inference statistics and write results to json file
+        time_list.sort()
+
+        avg = np.mean(time_list)
+        cf_95 = max(time_list[:int(len(time_list) * 0.95)])
+        cf_99 = max(time_list[:int(len(time_list) * 0.99)])
+        cf_100 = max(time_list[:int(len(time_list) * 1)])
+        print("-----------------------------")
+        print("Summary Statistics")
+        print("Batch size =", FLAGS.predict_batch_size)
+        print("Sequence Length =", FLAGS.max_seq_length)
+        print("Latency Confidence Level 95 (ms) =", cf_95 * 1000)
+        print("Latency Confidence Level 99 (ms)  =", cf_99 * 1000)
+        print("Latency Confidence Level 100 (ms)  =", cf_100 * 1000)
+        print("Latency Average (ms)  =", avg * 1000)
+        print("-----------------------------")
+
+
+        output_prediction_file = os.path.join(FLAGS.output_dir, "predictions.json")
+        output_nbest_file = os.path.join(FLAGS.output_dir, "nbest_predictions.json")
+        output_null_log_odds_file = os.path.join(FLAGS.output_dir, "null_odds.json")
+
+        write_predictions(eval_examples, eval_features, all_results,
+                          FLAGS.n_best_size, FLAGS.max_answer_length,
+                          FLAGS.do_lower_case, output_prediction_file,
+                          output_nbest_file, output_null_log_odds_file,
+                          FLAGS.version_2_with_negative, FLAGS.verbose_logging)
+    else:
+        # When inferencing on a single example, write best answer to stdout
+        all_predictions, all_nbest_json, scores_diff_json = get_predictions(
+                  eval_examples, eval_features, all_results,
+                  FLAGS.n_best_size, FLAGS.max_answer_length,
+                  FLAGS.do_lower_case, FLAGS.version_2_with_negative, 
+                  FLAGS.verbose_logging)
+        print("Context is: %s \n\nQuestion is: %s \n\nPredicted Answer is: %s" %(FLAGS.context, FLAGS.question, all_predictions[0]))
+
+
+if __name__ == "__main__":
+  flags.mark_flag_as_required("vocab_file")
+  flags.mark_flag_as_required("bert_config_file")
+  tf.compat.v1.app.run()
+

+ 9 - 9
TensorFlow/LanguageModeling/BERT/trtis/scripts/export_model.sh → TensorFlow/LanguageModeling/BERT/triton/scripts/export_model.sh

@@ -20,15 +20,15 @@ use_xla=${4:-"true"}
 seq_length=${5:-"384"}
 doc_stride=${6:-"128"}
 BERT_DIR=${7:-"data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16"}
-trtis_model_version=${8:-1}
-trtis_model_name=${9:-"bert"}
-trtis_dyn_batching_delay=${10:-0}
-trtis_engine_count=${11:-1}
-trtis_model_overwrite=${12:-"False"}
+triton_model_version=${8:-1}
+triton_model_name=${9:-"bert"}
+triton_dyn_batching_delay=${10:-0}
+triton_engine_count=${11:-1}
+triton_model_overwrite=${12:-"False"}
 
-additional_args="--trtis_model_version=$trtis_model_version --trtis_model_name=$trtis_model_name --trtis_max_batch_size=$batch_size \
-                 --trtis_model_overwrite=$trtis_model_overwrite --trtis_dyn_batching_delay=$trtis_dyn_batching_delay \
-                 --trtis_engine_count=$trtis_engine_count"
+additional_args="--triton_model_version=$triton_model_version --triton_model_name=$triton_model_name --triton_max_batch_size=$batch_size \
+                 --triton_model_overwrite=$triton_model_overwrite --triton_dyn_batching_delay=$triton_dyn_batching_delay \
+                 --triton_engine_count=$triton_engine_count"
 
 if [ "$precision" = "fp16" ] ; then
    echo "fp16 activated!"
@@ -51,7 +51,7 @@ bash scripts/docker/launch.sh \
        --doc_stride=${doc_stride} \
        --predict_batch_size=${batch_size} \
        --output_dir=/results \
-       --export_trtis=True \
+       --export_triton=True \
        ${additional_args}
 
 

+ 146 - 0
TensorFlow/LanguageModeling/BERT/triton/scripts/generate_figures.sh

@@ -0,0 +1,146 @@
+#!/bin/bash
+
+# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Set the number of devices to use
+export NVIDIA_VISIBLE_DEVICES=0
+
+# Always need to be overwriting models to keep memory use low
+export TRITON_MODEL_OVERWRITE=True
+
+bert_model=${1:-small}
+seq_length=${2:-128}
+precision=${3:-fp16}
+init_checkpoint=${4:-"/results/models/bert_${bert_model}_${precision}_${seq_length}_v1/model.ckpt-5474"}
+
+MODEL_NAME="bert_${bert_model}_${seq_length}_${precision}"
+
+if [ "$bert_model" = "large" ] ; then
+    export BERT_DIR=data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16
+else
+    export BERT_DIR=data/download/google_pretrained_weights/uncased_L-12_H-768_A-12
+fi
+
+doc_stride=128
+use_xla=true
+EXPORT_MODEL_ARGS="${precision} ${use_xla} ${seq_length} ${doc_stride} ${BERT_DIR} 1 ${MODEL_NAME}"
+PERF_CLIENT_ARGS="1000 10 20 localhost"
+
+# Start Server
+bash triton/scripts/launch_server.sh $precision
+
+# Restart Server
+restart_server() {
+docker kill triton_server_cont
+bash triton/scripts/launch_server.sh $precision
+}
+
+############## Dynamic Batching Comparison ##############
+SERVER_BATCH_SIZE=8
+CLIENT_BATCH_SIZE=1
+TRITON_ENGINE_COUNT=1
+
+# Dynamic batching 10 ms
+TRITON_DYN_BATCHING_DELAY=10
+bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
+restart_server
+sleep 15
+bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
+
+# Dynamic batching 5 ms
+TRITON_DYN_BATCHING_DELAY=5
+bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
+restart_server
+sleep 15
+bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
+
+# Dynamic batching 2 ms
+TRITON_DYN_BATCHING_DELAY=2
+bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
+restart_server
+sleep 15
+bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
+
+
+# Static Batching (i.e. Dynamic batching 0 ms)
+TRITON_DYN_BATCHING_DELAY=0
+bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
+restart_server
+sleep 15
+bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
+
+
+# ############## Engine Count Comparison ##############
+SERVER_BATCH_SIZE=1
+CLIENT_BATCH_SIZE=1
+TRITON_DYN_BATCHING_DELAY=0
+
+# Engine Count = 4
+TRITON_ENGINE_COUNT=4
+bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
+restart_server
+sleep 15
+bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
+
+# Engine Count = 2
+TRITON_ENGINE_COUNT=2
+bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
+restart_server
+sleep 15
+bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
+
+# Engine Count = 1
+TRITON_ENGINE_COUNT=1
+bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
+restart_server
+sleep 15
+bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
+
+
+############## Batch Size Comparison ##############
+# BATCH=1 Generate model and perf
+SERVER_BATCH_SIZE=1
+CLIENT_BATCH_SIZE=1
+TRITON_ENGINE_COUNT=1 
+TRITON_DYN_BATCHING_DELAY=0 
+
+bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
+restart_server
+sleep 15
+bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 64 localhost
+
+# BATCH=2 Generate model and perf
+SERVER_BATCH_SIZE=2
+CLIENT_BATCH_SIZE=2
+bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
+restart_server
+sleep 15
+bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 32 localhost
+
+# BATCH=4 Generate model and perf
+SERVER_BATCH_SIZE=4
+CLIENT_BATCH_SIZE=4
+bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
+restart_server
+sleep 15
+bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 16 localhost
+
+# BATCH=8 Generate model and perf
+SERVER_BATCH_SIZE=8
+CLIENT_BATCH_SIZE=8
+bash triton/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRITON_DYN_BATCHING_DELAY} ${TRITON_ENGINE_COUNT} ${TRITON_MODEL_OVERWRITE}
+restart_server
+sleep 15
+bash triton/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 8 localhost
+

+ 5 - 5
TensorFlow/LanguageModeling/BERT/trtis/scripts/launch_server.sh → TensorFlow/LanguageModeling/BERT/triton/scripts/launch_server.sh

@@ -9,16 +9,16 @@ else
    export TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE=0
 fi
 
-# Start TRTIS server in detached state
-nvidia-docker run -d --rm \
+# Start TRITON server in detached state
+docker run --gpus all -d --rm \
    --shm-size=1g \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    -p8000:8000 \
    -p8001:8001 \
    -p8002:8002 \
-   --name trt_server_cont \
+   --name triton_server_cont \
    -e NVIDIA_VISIBLE_DEVICES=$NV_VISIBLE_DEVICES \
    -e TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE \
-   -v $PWD/results/trtis_models:/models \
-   nvcr.io/nvidia/tensorrtserver:19.08-py3 trtserver --model-store=/models --strict-model-config=false
+   -v $PWD/results/triton_models:/models \
+   nvcr.io/nvidia/tritonserver:20.03-py3 trtserver --model-store=/models --strict-model-config=false

+ 6 - 24
TensorFlow/LanguageModeling/BERT/trtis/scripts/run_client.sh → TensorFlow/LanguageModeling/BERT/triton/scripts/run_client.sh

@@ -16,36 +16,18 @@
 batch_size=${1:-"8"}
 seq_length=${2:-"384"}
 doc_stride=${3:-"128"}
-trtis_version_name=${4:-"1"}
-trtis_model_name=${5:-"bert"}
+triton_version_name=${4:-"1"}
+triton_model_name=${5:-"bert"}
 BERT_DIR=${6:-"data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16"}
-squad_version=${7:-"1.1"}
-
-export SQUAD_DIR=data/download/squad/v${squad_version}
-if [ "$squad_version" = "1.1" ] ; then
-    version_2_with_negative="False"
-else
-    version_2_with_negative="True"
-fi
-
-echo "Squad directory set as " $SQUAD_DIR
-if [ ! -d "$SQUAD_DIR" ] ; then
-   echo "Error! $SQUAD_DIR directory missing. Please mount SQuAD dataset."
-   exit -1
-fi
 
 bash scripts/docker/launch.sh \
-   "python trtis/run_squad_trtis_client.py \
-      --trtis_model_name=$trtis_model_name \
-      --trtis_model_version=$trtis_version_name \
+   "python triton/run_squad_triton_client.py \
+      --triton_model_name=$triton_model_name \
+      --triton_model_version=$triton_version_name \
       --vocab_file=$BERT_DIR/vocab.txt \
       --bert_config_file=$BERT_DIR/bert_config.json \
-      --predict_file=$SQUAD_DIR/dev-v${squad_version}.json \
       --predict_batch_size=$batch_size \
       --max_seq_length=${seq_length} \
       --doc_stride=${doc_stride} \
       --output_dir=/results \
-      --version_2_with_negative=${version_2_with_negative}"
-
-bash scripts/docker/launch.sh "python $SQUAD_DIR/evaluate-v${squad_version}.py \
-    $SQUAD_DIR/dev-v${squad_version}.json /results/predictions.json"
+      ${@:7}"

+ 7 - 7
TensorFlow/LanguageModeling/BERT/trtis/scripts/run_perf_client.sh → TensorFlow/LanguageModeling/BERT/triton/scripts/run_perf_client.sh

@@ -23,21 +23,21 @@ MAX_CONCURRENCY=${7:-50}
 SERVER_HOSTNAME=${8:-"localhost"}
 
 if [[ $SERVER_HOSTNAME == *":"* ]]; then
-  echo "ERROR! Do not include the port when passing the Server Hostname. These scripts require that the TRTIS HTTP endpoint is on Port 8000 and the gRPC endpoint is on Port 8001. Exiting..."
+  echo "ERROR! Do not include the port when passing the Server Hostname. These scripts require that the TRITON HTTP endpoint is on Port 8000 and the gRPC endpoint is on Port 8001. Exiting..."
   exit 1
 fi
 
 if [ "$SERVER_HOSTNAME" = "localhost" ]
 then
-    if [ ! "$(docker inspect -f "{{.State.Running}}" trt_server_cont)" = "true" ] ; then
+    if [ ! "$(docker inspect -f "{{.State.Running}}" triton_server_cont)" = "true" ] ; then
 
-        echo "Launching TRTIS server"
-        bash trtis/scripts/launch_server.sh $precision
+        echo "Launching TRITON server"
+        bash triton/scripts/launch_server.sh $precision
         SERVER_LAUNCHED=true
 
         function cleanup_server {
-            echo "Killing TRTIS server"
-            docker kill trt_server_cont
+            echo "Killing TRITON server"
+            docker kill triton_server_cont
         }
 
         # Ensure we cleanup the server on exit
@@ -47,7 +47,7 @@ then
 fi
 
 # Wait until server is up. curl on the health of the server and sleep until its ready
-bash trtis/scripts/wait_for_trtis_server.sh $SERVER_HOSTNAME
+bash triton/scripts/wait_for_triton_server.sh $SERVER_HOSTNAME
 
 TIMESTAMP=$(date "+%y%m%d_%H%M")
 

+ 38 - 21
TensorFlow/LanguageModeling/BERT/trtis/scripts/run_trtis.sh → TensorFlow/LanguageModeling/BERT/triton/scripts/run_triton.sh

@@ -21,12 +21,13 @@ seq_length=${5:-"384"}
 doc_stride=${6:-"128"}
 bert_model=${7:-"large"}
 squad_version=${8:-"1.1"}
-trtis_version_name=${9:-1}
-trtis_model_name=${10:-"bert"}
-trtis_export_model=${11:-"false"}
-trtis_dyn_batching_delay=${12:-0}
-trtis_engine_count=${13:-1}
-trtis_model_overwrite=${14:-"False"}
+triton_version_name=${9:-1}
+triton_model_name=${10:-"bert"}
+triton_export_model=${11:-"true"}
+triton_dyn_batching_delay=${12:-0}
+triton_engine_count=${13:-1}
+triton_model_overwrite=${14:-"False"}
+squad_version=${15:-"1.1"}
 
 if [ "$bert_model" = "large" ] ; then
     export BERT_DIR=data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16
@@ -39,8 +40,21 @@ if [ ! -d "$BERT_DIR" ] ; then
    exit -1
 fi
 
+export SQUAD_DIR=data/download/squad/v${squad_version}
+if [ "$squad_version" = "1.1" ] ; then
+    version_2_with_negative="False"
+else
+    version_2_with_negative="True"
+fi
+
+echo "Squad directory set as " $SQUAD_DIR
+if [ ! -d "$SQUAD_DIR" ] ; then
+   echo "Error! $SQUAD_DIR directory missing. Please mount SQuAD dataset."
+   exit -1
+fi
+
 # Need to ignore case on some variables
-trtis_export_model=$(echo "$trtis_export_model" | tr '[:upper:]' '[:lower:]')
+triton_export_model=$(echo "$triton_export_model" | tr '[:upper:]' '[:lower:]')
 
 # Explicitly save this variable to pass down to new containers
 NV_VISIBLE_DEVICES=${NVIDIA_VISIBLE_DEVICES:-"all"}
@@ -56,33 +70,36 @@ echo "   seq_length      = $seq_length"
 echo "   doc_stride      = $doc_stride"
 echo "   bert_model      = $bert_model"
 echo "   squad_version   = $squad_version"
-echo "   version_name    = $trtis_version_name"
-echo "   model_name      = $trtis_model_name"
-echo "   export_model    = $trtis_export_model"
+echo "   version_name    = $triton_version_name"
+echo "   model_name      = $triton_model_name"
+echo "   export_model    = $triton_export_model"
 echo
 echo "Env: "
 echo "   NVIDIA_VISIBLE_DEVICES = $NV_VISIBLE_DEVICES"
 echo
 
 # Export Model in SavedModel format if enabled
-if [ "$trtis_export_model" = "true" ] ; then
-   echo "Exporting model as: Name - $trtis_model_name Version - $trtis_version_name"
+if [ "$triton_export_model" = "true" ] ; then
+   echo "Exporting model as: Name - $triton_model_name Version - $triton_version_name"
 
-      bash trtis/scripts/export_model.sh $init_checkpoint $batch_size $precision $use_xla $seq_length \
-         $doc_stride $BERT_DIR $RESULTS_DIR $trtis_version_name $trtis_model_name \
-         $trtis_dyn_batching_delay $trtis_engine_count $trtis_model_overwrite
+      bash triton/scripts/export_model.sh $init_checkpoint $batch_size $precision $use_xla $seq_length \
+         $doc_stride $BERT_DIR $triton_version_name $triton_model_name \
+         $triton_dyn_batching_delay $triton_engine_count $triton_model_overwrite
 fi
 
 # Start TRTIS server in detached state
-bash trtis/scripts/launch_server.sh $precision
+bash triton/scripts/launch_server.sh $precision
 
 # Wait until server is up. curl on the health of the server and sleep until its ready
-bash trtis/scripts/wait_for_trtis_server.sh localhost
+bash triton/scripts/wait_for_triton_server.sh localhost
 
-# Start TRTIS client for inference and evaluate results
-bash trtis/scripts/run_client.sh $batch_size $seq_length $doc_stride $trtis_version_name $trtis_model_name \
-    $BERT_DIR $squad_version
+# Start TRTIS client for inference on SQuAD Dataset
+bash triton/scripts/run_client.sh $batch_size $seq_length $doc_stride $triton_version_name $triton_model_name \
+    $BERT_DIR --predict_file=$SQUAD_DIR/dev-v${squad_version}.json --version_2_with_negative=${version_2_with_negative}
 
+# Evaluate SQuAD results
+bash scripts/docker/launch.sh "python $SQUAD_DIR/evaluate-v${squad_version}.py \
+    $SQUAD_DIR/dev-v${squad_version}.json /results/predictions.json"
 
 #Kill the TRTIS Server
-docker kill trt_server_cont
+docker kill triton_server_cont

+ 2 - 2
TensorFlow/LanguageModeling/BERT/trtis/scripts/wait_for_trtis_server.sh → TensorFlow/LanguageModeling/BERT/triton/scripts/wait_for_triton_server.sh

@@ -15,7 +15,7 @@
 
 SERVER_URI=${1:-"localhost"}
 
-echo "Waiting for TRTIS Server to be ready at http://$SERVER_URI:8000..."
+echo "Waiting for TRITON Server to be ready at http://$SERVER_URI:8000..."
 
 live_command="curl -m 1 -L -s -o /dev/null -w %{http_code} http://$SERVER_URI:8000/api/health/live"
 ready_command="curl -m 1 -L -s -o /dev/null -w %{http_code} http://$SERVER_URI:8000/api/health/ready"
@@ -30,4 +30,4 @@ while [[ ${current_status} != "200" ]] || [[ $($ready_command) != "200" ]]; do
    current_status=$($live_command)
 done
 
-echo "TRTIS Server is ready!"
+echo "TRITON Server is ready!"

+ 0 - 222
TensorFlow/LanguageModeling/BERT/trtis/run_squad_trtis_client.py

@@ -1,222 +0,0 @@
-# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-import modeling
-import tokenization
-from tensorrtserver.api import ProtocolType, InferContext, ServerStatusContext, grpc_service_pb2_grpc, grpc_service_pb2, model_config_pb2
-from utils.create_squad_data import *
-import grpc
-from run_squad import *
-import numpy as np
-import tqdm
-
-# Set this to either 'label_ids' for Google bert or 'unique_ids' for JoC
-label_id_key = "unique_ids"
-
-PendingResult = collections.namedtuple("PendingResult",
-                                   ["async_id", "start_time", "inputs"])
-
-def batch(iterable, n=1):
-    l = len(iterable)
-    for ndx in range(0, l, n):
-        label_ids_data = ()
-        input_ids_data = ()
-        input_mask_data = ()
-        segment_ids_data = ()
-        for i in range(0, min(n, l-ndx)):
-            label_ids_data = label_ids_data + (np.array([iterable[ndx + i].unique_id], dtype=np.int32),)
-            input_ids_data = input_ids_data+ (np.array(iterable[ndx + i].input_ids, dtype=np.int32),)
-            input_mask_data = input_mask_data+ (np.array(iterable[ndx + i].input_mask, dtype=np.int32),)
-            segment_ids_data = segment_ids_data+ (np.array(iterable[ndx + i].segment_ids, dtype=np.int32),)
-
-        inputs_dict = {label_id_key: label_ids_data,
-                       'input_ids': input_ids_data,
-                       'input_mask': input_mask_data,
-                       'segment_ids': segment_ids_data}
-        yield inputs_dict
-
-def run_client():
-    """
-    Ask a question of context on TRTIS.
-    :param context: str
-    :param question: str
-    :param question_id: int
-    :return:
-    """
-
-    tokenizer = tokenization.FullTokenizer(vocab_file=FLAGS.vocab_file, do_lower_case=FLAGS.do_lower_case)
-
-
-    eval_examples = read_squad_examples(
-        input_file=FLAGS.predict_file, is_training=False,
-        version_2_with_negative=FLAGS.version_2_with_negative)
-
-    eval_features = []
-
-    def append_feature(feature):
-        eval_features.append(feature)
-
-    convert_examples_to_features(
-        examples=eval_examples[0:],
-        tokenizer=tokenizer,
-        max_seq_length=FLAGS.max_seq_length,
-        doc_stride=FLAGS.doc_stride,
-        max_query_length=FLAGS.max_query_length,
-        is_training=False,
-        output_fn=append_feature)
-
-    protocol_str = 'grpc' # http or grpc
-    url = FLAGS.trtis_server_url
-    verbose = True
-    model_name = FLAGS.trtis_model_name
-    model_version = FLAGS.trtis_model_version
-    batch_size = FLAGS.predict_batch_size
-
-    protocol = ProtocolType.from_str(protocol_str) # or 'grpc'
-
-    ctx = InferContext(url, protocol, model_name, model_version, verbose)
-
-    channel = grpc.insecure_channel(url)
-
-    stub = grpc_service_pb2_grpc.GRPCServiceStub(channel)
-
-    prof_request = grpc_service_pb2.server__status__pb2.model__config__pb2.ModelConfig()
-
-    prof_response = stub.Profile(prof_request)
-
-    status_ctx = ServerStatusContext(url, protocol, model_name=model_name, verbose=verbose)
-
-    model_config_pb2.ModelConfig()
-
-    status_result = status_ctx.get_server_status()
-
-    outstanding = {}
-    max_outstanding = 20
-
-    sent_prog = tqdm.tqdm(desc="Send Requests", total=len(eval_features))
-    recv_prog = tqdm.tqdm(desc="Recv Requests", total=len(eval_features))
-
-    def process_outstanding(do_wait):
-
-        if (len(outstanding) == 0):
-            return
-        
-        ready_id = ctx.get_ready_async_request(do_wait)
-
-        if (ready_id is None):
-            return
-
-        # If we are here, we got an id
-        result = ctx.get_async_run_results(ready_id, False)
-        stop = time.time()
-
-        if (result is None):
-            raise ValueError("Context returned null for async id marked as done")
-
-        outResult = outstanding.pop(ready_id)
-
-        time_list.append(stop - outResult.start_time)
-
-        batch_count = len(outResult.inputs[label_id_key])
-
-        for i in range(batch_count):
-            unique_id = int(outResult.inputs[label_id_key][i][0])
-            start_logits = [float(x) for x in result["start_logits"][i].flat]
-            end_logits = [float(x) for x in result["end_logits"][i].flat]
-            all_results.append(
-                RawResult(
-                    unique_id=unique_id,
-                    start_logits=start_logits,
-                    end_logits=end_logits))
-
-        recv_prog.update(n=batch_count)
-
-    all_results = []
-    time_list = []
-
-    print("Starting Sending Requests....\n")
-
-    all_results_start = time.time()
-
-    for inputs_dict in batch(eval_features, batch_size):
-
-        present_batch_size = len(inputs_dict[label_id_key])
-
-        outputs_dict = {'start_logits': InferContext.ResultFormat.RAW,
-                        'end_logits': InferContext.ResultFormat.RAW}
-
-        start = time.time()
-        async_id = ctx.async_run(inputs_dict, outputs_dict, batch_size=present_batch_size)
-
-        outstanding[async_id] = PendingResult(async_id=async_id, start_time=start, inputs=inputs_dict)
-
-        sent_prog.update(n=present_batch_size)
-
-        # Try to process at least one response per request
-        process_outstanding(len(outstanding) >= max_outstanding)
-
-    tqdm.tqdm.write("All Requests Sent! Waiting for responses. Outstanding: {}.\n".format(len(outstanding)))
-
-    # Now process all outstanding requests
-    while (len(outstanding) > 0):
-        process_outstanding(True)
-
-    all_results_end = time.time()
-    all_results_total = (all_results_end - all_results_start) * 1000.0
-
-    print("-----------------------------")
-    print("Individual Time Runs - Ignoring first two iterations")
-    print("Total Time: {} ms".format(all_results_total))
-    print("-----------------------------")
-
-    print("-----------------------------")
-    print("Total Inference Time = %0.2f for"
-          "Sentences processed = %d" % (sum(time_list), len(eval_features)))
-    print("Throughput Average (sentences/sec) = %0.2f" % (len(eval_features) / all_results_total * 1000.0))
-    print("-----------------------------")
-
-    time_list.sort()
-
-    avg = np.mean(time_list)
-    cf_95 = max(time_list[:int(len(time_list) * 0.95)])
-    cf_99 = max(time_list[:int(len(time_list) * 0.99)])
-    cf_100 = max(time_list[:int(len(time_list) * 1)])
-    print("-----------------------------")
-    print("Summary Statistics")
-    print("Batch size =", FLAGS.predict_batch_size)
-    print("Sequence Length =", FLAGS.max_seq_length)
-    print("Latency Confidence Level 95 (ms) =", cf_95 * 1000)
-    print("Latency Confidence Level 99 (ms)  =", cf_99 * 1000)
-    print("Latency Confidence Level 100 (ms)  =", cf_100 * 1000)
-    print("Latency Average (ms)  =", avg * 1000)
-    print("-----------------------------")
-
-
-    output_prediction_file = os.path.join(FLAGS.output_dir, "predictions.json")
-    output_nbest_file = os.path.join(FLAGS.output_dir, "nbest_predictions.json")
-    output_null_log_odds_file = os.path.join(FLAGS.output_dir, "null_odds.json")
-
-    write_predictions(eval_examples, eval_features, all_results,
-                      FLAGS.n_best_size, FLAGS.max_answer_length,
-                      FLAGS.do_lower_case, output_prediction_file,
-                      output_nbest_file, output_null_log_odds_file)
-
-
-
-if __name__ == "__main__":
-  flags.mark_flag_as_required("vocab_file")
-  flags.mark_flag_as_required("bert_config_file")
-  flags.mark_flag_as_required("output_dir")
-
-  run_client()
-

+ 0 - 146
TensorFlow/LanguageModeling/BERT/trtis/scripts/generate_figures.sh

@@ -1,146 +0,0 @@
-#!/bin/bash
-
-# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-
-# Set the number of devices to use
-export NVIDIA_VISIBLE_DEVICES=0
-
-# Always need to be overwriting models to keep memory use low
-export TRTIS_MODEL_OVERWRITE=True
-
-bert_model=${1:-small}
-seq_length=${2:-128}
-precision=${3:-fp16}
-init_checkpoint=${4:-"/results/models/bert_tf_${bert_model}_${precision}_${seq_length}_v1/model.ckpt-5474"}
-
-MODEL_NAME="bert_${bert_model}_${seq_length}_${precision}"
-
-if [ "$bert_model" = "large" ] ; then
-    export BERT_DIR=data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16
-else
-    export BERT_DIR=data/download/google_pretrained_weights/uncased_L-12_H-768_A-12
-fi
-
-doc_stride=128
-use_xla=true
-EXPORT_MODEL_ARGS="${precision} ${use_xla} ${seq_length} ${doc_stride} ${BERT_DIR} 1 ${MODEL_NAME}"
-PERF_CLIENT_ARGS="1000 10 20 localhost"
-
-# Start Server
-bash trtis/scripts/launch_server.sh $precision
-
-# Restart Server
-restart_server() {
-docker kill trt_server_cont
-bash trtis/scripts/launch_server.sh $precision
-}
-
-############## Dynamic Batching Comparison ##############
-SERVER_BATCH_SIZE=8
-CLIENT_BATCH_SIZE=1
-TRTIS_ENGINE_COUNT=1
-
-# Dynamic batching 10 ms
-TRTIS_DYN_BATCHING_DELAY=10
-bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
-restart_server
-sleep 15
-bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
-
-# Dynamic batching 5 ms
-TRTIS_DYN_BATCHING_DELAY=5
-bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
-restart_server
-sleep 15
-bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
-
-# Dynamic batching 2 ms
-TRTIS_DYN_BATCHING_DELAY=2
-bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
-restart_server
-sleep 15
-bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
-
-
-# Static Batching (i.e. Dynamic batching 0 ms)
-TRTIS_DYN_BATCHING_DELAY=0
-bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
-restart_server
-sleep 15
-bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
-
-
-# ############## Engine Count Comparison ##############
-SERVER_BATCH_SIZE=1
-CLIENT_BATCH_SIZE=1
-TRTIS_DYN_BATCHING_DELAY=0
-
-# Engine Count = 4
-TRTIS_ENGINE_COUNT=4
-bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
-restart_server
-sleep 15
-bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
-
-# Engine Count = 2
-TRTIS_ENGINE_COUNT=2
-bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
-restart_server
-sleep 15
-bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
-
-# Engine Count = 1
-TRTIS_ENGINE_COUNT=1
-bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
-restart_server
-sleep 15
-bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} ${PERF_CLIENT_ARGS}
-
-
-############## Batch Size Comparison ##############
-# BATCH=1 Generate model and perf
-SERVER_BATCH_SIZE=1
-CLIENT_BATCH_SIZE=1
-TRTIS_ENGINE_COUNT=1 
-TRTIS_DYN_BATCHING_DELAY=0 
-
-bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
-restart_server
-sleep 15
-bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 64 localhost
-
-# BATCH=2 Generate model and perf
-SERVER_BATCH_SIZE=2
-CLIENT_BATCH_SIZE=2
-bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
-restart_server
-sleep 15
-bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 32 localhost
-
-# BATCH=4 Generate model and perf
-SERVER_BATCH_SIZE=4
-CLIENT_BATCH_SIZE=4
-bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
-restart_server
-sleep 15
-bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 16 localhost
-
-# BATCH=8 Generate model and perf
-SERVER_BATCH_SIZE=8
-CLIENT_BATCH_SIZE=8
-bash trtis/scripts/export_model.sh ${init_checkpoint} ${SERVER_BATCH_SIZE} ${EXPORT_MODEL_ARGS} ${TRTIS_DYN_BATCHING_DELAY} ${TRTIS_ENGINE_COUNT} ${TRTIS_MODEL_OVERWRITE}
-restart_server
-sleep 15
-bash trtis/scripts/run_perf_client.sh ${MODEL_NAME} 1 ${precision} ${CLIENT_BATCH_SIZE} 1000 10 8 localhost
-

+ 5 - 4
TensorFlow/LanguageModeling/BERT/utils/create_squad_data.py

@@ -149,10 +149,11 @@ class InputFeatures(object):
     self.end_position = end_position
     self.is_impossible = is_impossible
 
-def read_squad_examples(input_file, is_training, version_2_with_negative=False):
-  """Read a SQuAD json file into a list of SquadExample."""
-  with tf.gfile.Open(input_file, "r") as reader:
-    input_data = json.load(reader)["data"]
+def read_squad_examples(input_file, is_training, version_2_with_negative=False, input_data=None):
+  """Return list of SquadExample from input_data or input_file (SQuAD json file)"""
+  if input_data is None:
+    with tf.gfile.Open(input_file, "r") as reader:
+      input_data = json.load(reader)["data"]
 
   def is_whitespace(c):
     if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:

+ 55 - 0
TensorFlow/LanguageModeling/BERT/utils/dllogger_class.py

@@ -0,0 +1,55 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+
+# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dllogger import Logger, StdOutBackend, JSONStreamBackend, Verbosity
+import numpy
+
+class dllogger_class():
+
+    def format_step(self, step):
+        if isinstance(step, str):
+            return step
+        elif isinstance(step, int):
+            return "Iteration: {} ".format(step)
+        elif len(step) > 0:
+            return "Iteration: {} ".format(step[0])
+        else:
+            return ""
+
+    def __init__(self, log_path="bert_dllog.json"):
+        self.logger = Logger([
+            StdOutBackend(Verbosity.DEFAULT, step_format=self.format_step),
+            JSONStreamBackend(Verbosity.VERBOSE, log_path),
+            ])
+        self.logger.metadata("mlm_loss", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "TRAIN"})
+        self.logger.metadata("nsp_loss", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "TRAIN"})
+        self.logger.metadata("avg_loss_step", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "TRAIN"})
+        self.logger.metadata("total_loss", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "TRAIN"})
+        self.logger.metadata("loss", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "TRAIN"})
+        self.logger.metadata("f1", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "VAL"})
+        self.logger.metadata("precision", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "VAL"})
+        self.logger.metadata("recall", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "VAL"})
+        self.logger.metadata("mcc", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "VAL"})
+        self.logger.metadata("exact_match", {"format": ":.4f", "GOAL": "MINIMIZE", "STAGE": "VAL"})
+        self.logger.metadata(
+            "throughput_train",
+            {"unit": "seq/s", "format": ":.3f", "GOAL": "MAXIMIZE", "STAGE": "TRAIN"},
+        )
+        self.logger.metadata(
+            "throughput_inf",
+            {"unit": "seq/s", "format": ":.3f", "GOAL": "MAXIMIZE", "STAGE": "VAL"},
+        )