|
|
@@ -0,0 +1,726 @@
|
|
|
+{
|
|
|
+ "cells": [
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 1,
|
|
|
+ "metadata": {
|
|
|
+ "colab": {},
|
|
|
+ "colab_type": "code",
|
|
|
+ "id": "Gwt7z7qdmTbW"
|
|
|
+ },
|
|
|
+ "outputs": [],
|
|
|
+ "source": [
|
|
|
+ "# Copyright 2019 NVIDIA Corporation. All Rights Reserved.\n",
|
|
|
+ "#\n",
|
|
|
+ "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
|
|
|
+ "# you may not use this file except in compliance with the License.\n",
|
|
|
+ "# You may obtain a copy of the License at\n",
|
|
|
+ "#\n",
|
|
|
+ "# http://www.apache.org/licenses/LICENSE-2.0\n",
|
|
|
+ "#\n",
|
|
|
+ "# Unless required by applicable law or agreed to in writing, software\n",
|
|
|
+ "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
|
|
|
+ "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
|
|
|
+ "# See the License for the specific language governing permissions and\n",
|
|
|
+ "# limitations under the License.\n",
|
|
|
+ "# =============================================================================="
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {
|
|
|
+ "colab_type": "text",
|
|
|
+ "id": "i4NKCp2VmTbn"
|
|
|
+ },
|
|
|
+ "source": [
|
|
|
+ "<img src=\"http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
|
|
|
+ "\n",
|
|
|
+ "# DLRM Triton Inference Demo"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {
|
|
|
+ "colab_type": "text",
|
|
|
+ "id": "fW0OKDzvmTbt"
|
|
|
+ },
|
|
|
+ "source": [
|
|
|
+ "## Overview\n",
|
|
|
+ "\n",
|
|
|
+ "Recomendation system (RecSys) inference involves determining an ordered list of items with which the query user will most likely interact with. For very large commercial databases with millions to hundreds of millions of items to choose from (like advertisements, apps), usually an item retrieval procedure is carried out to reduce the number of items to a more manageable quantity, e.g. a few hundreds to a few thousands. The methods include computationally-light algorithms such as approximate neighborhood search, random forest and filtering based on user preferences. From thereon, a deep learning based RecSys is invoked to re-rank the items and those with the highest scores are presented to the users. This process is well demonstrated in the Google AppStore recommendation system in Figure 1. \n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "Figure 1: Google’s app recommendation process. [Source](https://arxiv.org/pdf/1606.07792.pdf).\n",
|
|
|
+ "\n",
|
|
|
+ "As we can see, for each query user, the number of user-item pairs to score can be as large as a few thousands. This places an extremely heavy duty on RecSys inference server, which must handle high throughput to serve many users concurrently yet at low latency to satisfy stringent latency thresholds of online commerce engines.\n",
|
|
|
+ "\n",
|
|
|
+ "The NVIDIA Triton Inference Server [9] provides a cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or GRPC endpoint, allowing remote clients to request inferencing for any model being managed by the server. Triton automatically manages and makes use of all the available GPUs.\n",
|
|
|
+ "\n",
|
|
|
+ "We will next see how to prepare the DLRM model for inference with the Triton inference server and see how Triton is up to the task. \n",
|
|
|
+ "\n",
|
|
|
+ "### Learning objectives\n",
|
|
|
+ "\n",
|
|
|
+ "This notebook demonstrates the steps for preparing a pre-trained DLRM model for deployment and inference with the NVIDIA [Triton inference server](https://github.com/NVIDIA/triton-inference-server). \n",
|
|
|
+ "\n",
|
|
|
+ "## Content\n",
|
|
|
+ "1. [Requirements](#1)\n",
|
|
|
+ "1. [Prepare model for inference](#2)\n",
|
|
|
+ "1. [Start the Triton inference server](#3)\n",
|
|
|
+ "1. [Testing server with the performance client](#4)\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {
|
|
|
+ "colab_type": "text",
|
|
|
+ "id": "aDFrE4eqmTbv"
|
|
|
+ },
|
|
|
+ "source": [
|
|
|
+ "<a id=\"1\"></a>\n",
|
|
|
+ "## 1. Requirements\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "### 1.1 Docker container\n",
|
|
|
+ "The most convenient way to make use of the NVIDIA DLRM model is via a docker container, which provides a self-contained, isolated and re-producible environment for all experiments.\n",
|
|
|
+ "\n",
|
|
|
+ "First, clone the repository:\n",
|
|
|
+ "\n",
|
|
|
+ "```\n",
|
|
|
+ "git clone https://github.com/NVIDIA/DeepLearningExamples\n",
|
|
|
+ "cd DeepLearningExamples/PyTorch/Recommendation/DLRM\n",
|
|
|
+ "```\n",
|
|
|
+ "\n",
|
|
|
+ "To execute this notebook, first build the following inference container:\n",
|
|
|
+ "\n",
|
|
|
+ "```\n",
|
|
|
+ "docker build -t dlrm-inference . -f triton/Dockerfile\n",
|
|
|
+ "```\n",
|
|
|
+ "\n",
|
|
|
+ "Start in interactive docker session with:\n",
|
|
|
+ "\n",
|
|
|
+ "```\n",
|
|
|
+ "docker run -it --rm --gpus device=0 --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --net=host -v <PATH_TO_SAVED_MODEL>:/models -v <PATH_TO_EXPORT_MODEL>:/repository <PATH_TO_PREPROCESSED_DATA>:/data dlrm-inference bash\n",
|
|
|
+ "```\n",
|
|
|
+ "where:\n",
|
|
|
+ "\n",
|
|
|
+ "- PATH_TO_SAVED_MODEL: directory containing the trained DLRM models with `.pt` extension.\n",
|
|
|
+ " \n",
|
|
|
+ "- PATH_TO_EXPORT_MODEL: directory which will contain the converted model to be used with the NVIDIA Triton inference server.\n",
|
|
|
+ "\n",
|
|
|
+ "- PATH_TO_PREPROCESSED_DATA: path to the preprocessed Criteo Terabyte dataset containing 3 binary data files: `test_data.bin`, `train_data.bin` and `val_data.bin` and a JSON `file model_size.json` totalling ~650GB.\n",
|
|
|
+ "\n",
|
|
|
+ "Within the docker interactive bash session, start Jupyter with\n",
|
|
|
+ "\n",
|
|
|
+ "```\n",
|
|
|
+ "export PYTHONPATH=/workspace/dlrm\n",
|
|
|
+ "jupyter notebook --ip 0.0.0.0 --port 8888\n",
|
|
|
+ "```\n",
|
|
|
+ "\n",
|
|
|
+ "Then open the Jupyter GUI interface on your host machine at http://localhost:8888. Within the container, this demo notebook is located at `/workspace/dlrm/notebooks`.\n",
|
|
|
+ "\n",
|
|
|
+ "### 1.2 Hardware\n",
|
|
|
+ "This notebook can be executed on any CUDA-enabled NVIDIA GPU with at least 24GB of GPU memory, although for efficient mixed precision inference, a [Tensor Core NVIDIA GPU](https://www.nvidia.com/en-us/data-center/tensorcore/) is desired (Volta, Turing or newer architectures). "
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 3,
|
|
|
+ "metadata": {
|
|
|
+ "colab": {},
|
|
|
+ "colab_type": "code",
|
|
|
+ "id": "k7RLEcKhmTb0"
|
|
|
+ },
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "name": "stdout",
|
|
|
+ "output_type": "stream",
|
|
|
+ "text": [
|
|
|
+ "Sat Apr 4 00:55:05 2020 \r\n",
|
|
|
+ "+-----------------------------------------------------------------------------+\r\n",
|
|
|
+ "| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |\r\n",
|
|
|
+ "|-------------------------------+----------------------+----------------------+\r\n",
|
|
|
+ "| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |\r\n",
|
|
|
+ "| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |\r\n",
|
|
|
+ "|===============================+======================+======================|\r\n",
|
|
|
+ "| 0 Tesla V100-PCIE... On | 00000000:1A:00.0 Off | 0 |\r\n",
|
|
|
+ "| N/A 30C P0 37W / 250W | 19757MiB / 32510MiB | 0% Default |\r\n",
|
|
|
+ "+-------------------------------+----------------------+----------------------+\r\n",
|
|
|
+ " \r\n",
|
|
|
+ "+-----------------------------------------------------------------------------+\r\n",
|
|
|
+ "| Processes: GPU Memory |\r\n",
|
|
|
+ "| GPU PID Type Process name Usage |\r\n",
|
|
|
+ "|=============================================================================|\r\n",
|
|
|
+ "+-----------------------------------------------------------------------------+\r\n"
|
|
|
+ ]
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "!nvidia-smi"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {
|
|
|
+ "colab_type": "text",
|
|
|
+ "id": "HqSUGePjmTb9"
|
|
|
+ },
|
|
|
+ "source": [
|
|
|
+ "<a id=\"2\"></a>\n",
|
|
|
+ "## 2. Prepare model for inference\n",
|
|
|
+ "\n",
|
|
|
+ "We first convert model to a format accepted by the NVIDIA Triton inference server. Triton can accept TorchScript, ONNX amongst other formats. \n",
|
|
|
+ "\n",
|
|
|
+ "To deploy model into Triton compatible format, we provide the deployer.py [script](../triton/deployer.py).\n",
|
|
|
+ "\n",
|
|
|
+ "### TorchScript\n",
|
|
|
+ "TorchScript is a way to create serializable and optimizable models from PyTorch code. Any TorchScript program can be saved from a Python process and loaded in a process where there is no Python dependency.\n",
|
|
|
+ "\n",
|
|
|
+ "We provide two options to convert models to TorchScript:\n",
|
|
|
+ "- --ts-script convert to torchscript using torch.jit.script\n",
|
|
|
+ "- --ts-trace convert to torchscript using torch.jit.trace\n",
|
|
|
+ "\n",
|
|
|
+ "\n",
|
|
|
+ "In the conversion below, we assume:\n",
|
|
|
+ "\n",
|
|
|
+ "- The trained model is stored at /models/dlrm_model_fp16.pt\n",
|
|
|
+ "\n",
|
|
|
+ "- The maximum batchsize that Triton will handle is 65536.\n",
|
|
|
+ "\n",
|
|
|
+ "- The processed dataset directory is /data which contain a `model_size.json` file."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 12,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "name": "stdout",
|
|
|
+ "output_type": "stream",
|
|
|
+ "text": [
|
|
|
+ "deploying model dlrm-ts-script-16 in format pytorch_libtorch\n",
|
|
|
+ "done\n"
|
|
|
+ ]
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "%%bash\n",
|
|
|
+ "python ../triton/deployer.py \\\n",
|
|
|
+ "--ts-script \\\n",
|
|
|
+ "--triton-model-name dlrm-ts-script-16 \\\n",
|
|
|
+ "--triton-max-batch-size 65536 \\\n",
|
|
|
+ "--save-dir /repository \\\n",
|
|
|
+ "-- --model_checkpoint /models/dlrm_model_fp16.pt \\\n",
|
|
|
+ "--fp16 \\\n",
|
|
|
+ "--batch_size 4096 \\\n",
|
|
|
+ "--num_numerical_features 13 \\\n",
|
|
|
+ "--embedding_dim 128 \\\n",
|
|
|
+ "--top_mlp_sizes 1024 1024 512 256 1 \\\n",
|
|
|
+ "--bottom_mlp_sizes 512 256 128 \\\n",
|
|
|
+ "--interaction_op dot \\\n",
|
|
|
+ "--hash_indices \\\n",
|
|
|
+ "--dataset /data \\\n",
|
|
|
+ "--dump_perf_data ./perfdata"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {
|
|
|
+ "colab_type": "text",
|
|
|
+ "id": "EQAIszkxmTcT"
|
|
|
+ },
|
|
|
+ "source": [
|
|
|
+ "### ONNX\n",
|
|
|
+ "\n",
|
|
|
+ "[ONNX](https://onnx.ai/) is an open format built to represent machine learning models. ONNX defines a common set of operators - the building blocks of machine learning and deep learning models - and a common file format to enable AI developers to use models with a variety of frameworks, tools, runtimes, and compilers.\n",
|
|
|
+ "\n",
|
|
|
+ "Conversion of DLRM pre-trained PyTorch model to ONNX model can be done with:"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 6,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "name": "stdout",
|
|
|
+ "output_type": "stream",
|
|
|
+ "text": [
|
|
|
+ "deploying model dlrm-onnx-16 in format onnxruntime_onnx\n",
|
|
|
+ "done\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "name": "stderr",
|
|
|
+ "output_type": "stream",
|
|
|
+ "text": [
|
|
|
+ "/opt/conda/lib/python3.6/site-packages/torch/onnx/symbolic_opset9.py:2044: UserWarning: Exporting aten::index operator of advanced indexing in opset 11 is achieved by combination of multiple ONNX operators, including Reshape, Transpose, Concat, and Gather. If indices include negative values, the exported graph will produce incorrect results.\n",
|
|
|
+ " \"If indices include negative values, the exported graph will produce incorrect results.\")\n",
|
|
|
+ "/opt/conda/lib/python3.6/site-packages/torch/onnx/utils.py:915: UserWarning: No names were found for specified dynamic axes of provided input.Automatically generated names will be applied to each dynamic axes of input input__0\n",
|
|
|
+ " 'Automatically generated names will be applied to each dynamic axes of input {}'.format(key))\n",
|
|
|
+ "/opt/conda/lib/python3.6/site-packages/torch/onnx/utils.py:915: UserWarning: No names were found for specified dynamic axes of provided input.Automatically generated names will be applied to each dynamic axes of input input__1\n",
|
|
|
+ " 'Automatically generated names will be applied to each dynamic axes of input {}'.format(key))\n",
|
|
|
+ "/opt/conda/lib/python3.6/site-packages/torch/onnx/utils.py:915: UserWarning: No names were found for specified dynamic axes of provided input.Automatically generated names will be applied to each dynamic axes of input output__0\n",
|
|
|
+ " 'Automatically generated names will be applied to each dynamic axes of input {}'.format(key))\n"
|
|
|
+ ]
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "%%bash\n",
|
|
|
+ "python ../triton/deployer.py \\\n",
|
|
|
+ "--onnx \\\n",
|
|
|
+ "--triton-model-name dlrm-onnx-16 \\\n",
|
|
|
+ "--triton-max-batch-size 4096 \\\n",
|
|
|
+ "--save-dir /repository \\\n",
|
|
|
+ "-- --model_checkpoint /models/dlrm_model_fp16.pt \\\n",
|
|
|
+ "--fp16 \\\n",
|
|
|
+ "--batch_size 4096 \\\n",
|
|
|
+ "--num_numerical_features 13 \\\n",
|
|
|
+ "--embedding_dim 128 \\\n",
|
|
|
+ "--top_mlp_sizes 1024 1024 512 256 1 \\\n",
|
|
|
+ "--bottom_mlp_sizes 512 256 128 \\\n",
|
|
|
+ "--interaction_op dot \\\n",
|
|
|
+ "--hash_indices \\\n",
|
|
|
+ "--dataset /data \\\n",
|
|
|
+ "--dump_perf_data ./perfdata"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {
|
|
|
+ "colab_type": "text",
|
|
|
+ "id": "RL8d9IwzmTcV"
|
|
|
+ },
|
|
|
+ "source": [
|
|
|
+ "<a id=\"3\"></a>\n",
|
|
|
+ "## 3. Start the Triton inference server"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {
|
|
|
+ "colab_type": "text",
|
|
|
+ "id": "o6wayGf1mTcX"
|
|
|
+ },
|
|
|
+ "source": [
|
|
|
+ "*Note: this step must be done outside the of the current docker container.*\n",
|
|
|
+ "\n",
|
|
|
+ "Open a bash window on the **host machine** and execute the following commands:\n",
|
|
|
+ "\n",
|
|
|
+ "```\n",
|
|
|
+ "docker pull nvcr.io/nvidia/tensorrtserver:20.03-py3\n",
|
|
|
+ "docker run -d --rm --gpus device=0 --ipc=host --network=host -p 8000:8000 -p 8001:8001 -p 8002:8002 -v <PATH_TO_MODEL_REPOSITORY>:/repository nvcr.io/nvidia/tensorrtserver:20.03-py3 trtserver --model-store=/repository --log-verbose=1 --model-control-mode=explicit\n",
|
|
|
+ "```\n",
|
|
|
+ "\n",
|
|
|
+ "where:\n",
|
|
|
+ "\n",
|
|
|
+ "- PATH_TO_MODEL_REPOSITORY: directory on the host machine containing the converted models in section 2 above. \n",
|
|
|
+ "\n",
|
|
|
+ "Note that each DLRM model will require ~19GB of GPU memory.\n",
|
|
|
+ "\n",
|
|
|
+ "Within the `/models` directory on the inference server, the structure should look similar to the below:\n",
|
|
|
+ "\n",
|
|
|
+ "```\n",
|
|
|
+ "/models\n",
|
|
|
+ "`-- dlrm-onnx-16\n",
|
|
|
+ " |-- 1\n",
|
|
|
+ " | `-- model.onnx\n",
|
|
|
+ " | |-- bottom_mlp.0.weight\n",
|
|
|
+ " | |-- bottom_mlp.2.weight\n",
|
|
|
+ " | |-- bottom_mlp.4.weight\n",
|
|
|
+ " | |-- embeddings.0.weight\n",
|
|
|
+ " | |-- embeddings.1.weight\n",
|
|
|
+ " | |-- embeddings.10.weight\n",
|
|
|
+ " | |-- embeddings.11.weight\n",
|
|
|
+ " | |-- embeddings.12.weight\n",
|
|
|
+ " | |-- embeddings.13.weight\n",
|
|
|
+ " | |-- embeddings.14.weight\n",
|
|
|
+ " | |-- embeddings.15.weight\n",
|
|
|
+ " | |-- embeddings.17.weight\n",
|
|
|
+ " | |-- embeddings.18.weight\n",
|
|
|
+ " | |-- embeddings.19.weight\n",
|
|
|
+ " | |-- embeddings.2.weight\n",
|
|
|
+ " | |-- embeddings.20.weight\n",
|
|
|
+ " | |-- embeddings.21.weight\n",
|
|
|
+ " | |-- embeddings.22.weight\n",
|
|
|
+ " | |-- embeddings.23.weight\n",
|
|
|
+ " | |-- embeddings.24.weight\n",
|
|
|
+ " | |-- embeddings.25.weight\n",
|
|
|
+ " | |-- embeddings.3.weight\n",
|
|
|
+ " | |-- embeddings.4.weight\n",
|
|
|
+ " | |-- embeddings.6.weight\n",
|
|
|
+ " | |-- embeddings.7.weight\n",
|
|
|
+ " | |-- embeddings.8.weight\n",
|
|
|
+ " | |-- embeddings.9.weight\n",
|
|
|
+ " | |-- model.onnx\n",
|
|
|
+ " | |-- top_mlp.0.weight\n",
|
|
|
+ " | |-- top_mlp.2.weight\n",
|
|
|
+ " | |-- top_mlp.4.weight\n",
|
|
|
+ " | `-- top_mlp.6.weight\n",
|
|
|
+ " `-- config.pbtxt\n",
|
|
|
+ "```"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {
|
|
|
+ "colab_type": "text",
|
|
|
+ "id": "X959LYwjmTcw"
|
|
|
+ },
|
|
|
+ "source": [
|
|
|
+ "<a id=\"4\"></a>\n",
|
|
|
+ "## 4. Testing server with the performance client\n",
|
|
|
+ "\n",
|
|
|
+ "After model deployment has completed, we can test the deployed model against the Criteo test dataset. \n",
|
|
|
+ "\n",
|
|
|
+ "Note: This requires mounting the Criteo test data to, e.g. `/data/test_data.bin`. Within the dataset directory, there must also be a `model_size.json` file."
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 9,
|
|
|
+ "metadata": {},
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "name": "stdout",
|
|
|
+ "output_type": "stream",
|
|
|
+ "text": [
|
|
|
+ "Process is terminated.\n"
|
|
|
+ ]
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "%%bash\n",
|
|
|
+ "python ../triton/client.py \\\n",
|
|
|
+ "--triton-server-url localhost:8000 \\\n",
|
|
|
+ "--protocol HTTP \\\n",
|
|
|
+ "--triton-model-name dlrm-onnx-16 \\\n",
|
|
|
+ "--num_numerical_features 13 \\\n",
|
|
|
+ "--dataset_config /data/model_size.json \\\n",
|
|
|
+ "--inference_data /data/test_data.bin \\\n",
|
|
|
+ "--batch_size 4096 \\\n",
|
|
|
+ "--fp16"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "The Triton inference server comes with a [performance client](https://docs.nvidia.com/deeplearning/sdk/triton-inference-server-master-branch-guide/docs/optimization.html#perf-client) which is designed to stress test the server using multiple client threads.\n",
|
|
|
+ "\n",
|
|
|
+ "The perf_client generates inference requests to your model and measures the throughput and latency of those requests. To get representative results, the perf_client measures the throughput and latency over a time window, and then repeats the measurements until it gets stable values. By default the perf_client uses average latency to determine stability but you can use the --percentile flag to stabilize results based on that confidence level. For example, if --percentile=95 is used the results will be stabilized using the 95-th percentile request latency. \n",
|
|
|
+ "\n",
|
|
|
+ "### Request Concurrency\n",
|
|
|
+ "\n",
|
|
|
+ "By default perf_client measures your model’s latency and throughput using the lowest possible load on the model. To do this perf_client sends one inference request to the server and waits for the response. When that response is received, the perf_client immediately sends another request, and then repeats this process during the measurement windows. The number of outstanding inference requests is referred to as the request concurrency, and so by default perf_client uses a request concurrency of 1.\n",
|
|
|
+ "\n",
|
|
|
+ "Using the --concurrency-range <start>:<end>:<step> option you can have perf_client collect data for a range of request concurrency levels. Use the --help option to see complete documentation for this and other options.\n",
|
|
|
+ " \n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 13,
|
|
|
+ "metadata": {
|
|
|
+ "scrolled": false
|
|
|
+ },
|
|
|
+ "outputs": [
|
|
|
+ {
|
|
|
+ "name": "stdout",
|
|
|
+ "output_type": "stream",
|
|
|
+ "text": [
|
|
|
+ "*** Measurement Settings ***\n",
|
|
|
+ " Batch size: 4096\n",
|
|
|
+ " Measurement window: 5000 msec\n",
|
|
|
+ " Latency limit: 5000 msec\n",
|
|
|
+ " Concurrency limit: 10 concurrent requests\n",
|
|
|
+ " Using synchronous calls for inference\n",
|
|
|
+ " Stabilizing using average latency\n",
|
|
|
+ "\n",
|
|
|
+ "Request concurrency: 1\n",
|
|
|
+ " Pass [1] throughput: 67993.6 infer/sec. Avg latency: 60428 usec (std 22260 usec)\n",
|
|
|
+ " Pass [2] throughput: 61440 infer/sec. Avg latency: 66310 usec (std 21723 usec)\n",
|
|
|
+ " Pass [3] throughput: 68812.8 infer/sec. Avg latency: 59617 usec (std 22128 usec)\n",
|
|
|
+ " Client: \n",
|
|
|
+ " Request count: 84\n",
|
|
|
+ " Throughput: 68812.8 infer/sec\n",
|
|
|
+ " Avg latency: 59617 usec (standard deviation 22128 usec)\n",
|
|
|
+ " p50 latency: 71920 usec\n",
|
|
|
+ " p90 latency: 80018 usec\n",
|
|
|
+ " p95 latency: 83899 usec\n",
|
|
|
+ " p99 latency: 88054 usec\n",
|
|
|
+ " Avg gRPC time: 58773 usec (marshal 274 usec + response wait 58458 usec + unmarshal 41 usec)\n",
|
|
|
+ " Server: \n",
|
|
|
+ " Request count: 102\n",
|
|
|
+ " Avg request latency: 57208 usec (overhead 6 usec + queue 20184 usec + compute 37018 usec)\n",
|
|
|
+ "\n",
|
|
|
+ "Request concurrency: 2\n",
|
|
|
+ " Pass [1] throughput: 154010 infer/sec. Avg latency: 53139 usec (std 22418 usec)\n",
|
|
|
+ " Pass [2] throughput: 155648 infer/sec. Avg latency: 52483 usec (std 24768 usec)\n",
|
|
|
+ " Pass [3] throughput: 150733 infer/sec. Avg latency: 54271 usec (std 23803 usec)\n",
|
|
|
+ " Client: \n",
|
|
|
+ " Request count: 184\n",
|
|
|
+ " Throughput: 150733 infer/sec\n",
|
|
|
+ " Avg latency: 54271 usec (standard deviation 23803 usec)\n",
|
|
|
+ " p50 latency: 57022 usec\n",
|
|
|
+ " p90 latency: 83000 usec\n",
|
|
|
+ " p95 latency: 84782 usec\n",
|
|
|
+ " p99 latency: 88989 usec\n",
|
|
|
+ " Avg gRPC time: 55692 usec (marshal 274 usec + response wait 55374 usec + unmarshal 44 usec)\n",
|
|
|
+ " Server: \n",
|
|
|
+ " Request count: 216\n",
|
|
|
+ " Avg request latency: 53506 usec (overhead 244 usec + queue 19818 usec + compute 33444 usec)\n",
|
|
|
+ "\n",
|
|
|
+ "Request concurrency: 3\n",
|
|
|
+ " Pass [1] throughput: 189235 infer/sec. Avg latency: 64917 usec (std 21807 usec)\n",
|
|
|
+ " Pass [2] throughput: 201523 infer/sec. Avg latency: 60425 usec (std 24622 usec)\n",
|
|
|
+ " Pass [3] throughput: 203981 infer/sec. Avg latency: 60661 usec (std 24397 usec)\n",
|
|
|
+ " Client: \n",
|
|
|
+ " Request count: 249\n",
|
|
|
+ " Throughput: 203981 infer/sec\n",
|
|
|
+ " Avg latency: 60661 usec (standard deviation 24397 usec)\n",
|
|
|
+ " p50 latency: 72344 usec\n",
|
|
|
+ " p90 latency: 87765 usec\n",
|
|
|
+ " p95 latency: 91976 usec\n",
|
|
|
+ " p99 latency: 95775 usec\n",
|
|
|
+ " Avg gRPC time: 57213 usec (marshal 291 usec + response wait 56875 usec + unmarshal 47 usec)\n",
|
|
|
+ " Server: \n",
|
|
|
+ " Request count: 315\n",
|
|
|
+ " Avg request latency: 55254 usec (overhead 545 usec + queue 19408 usec + compute 35301 usec)\n",
|
|
|
+ "\n",
|
|
|
+ "Request concurrency: 4\n",
|
|
|
+ " Pass [1] throughput: 273613 infer/sec. Avg latency: 59555 usec (std 22608 usec)\n",
|
|
|
+ " Pass [2] throughput: 288358 infer/sec. Avg latency: 56895 usec (std 21886 usec)\n",
|
|
|
+ " Pass [3] throughput: 285082 infer/sec. Avg latency: 57494 usec (std 21833 usec)\n",
|
|
|
+ " Client: \n",
|
|
|
+ " Request count: 348\n",
|
|
|
+ " Throughput: 285082 infer/sec\n",
|
|
|
+ " Avg latency: 57494 usec (standard deviation 21833 usec)\n",
|
|
|
+ " p50 latency: 62012 usec\n",
|
|
|
+ " p90 latency: 83694 usec\n",
|
|
|
+ " p95 latency: 84966 usec\n",
|
|
|
+ " p99 latency: 93177 usec\n",
|
|
|
+ " Avg gRPC time: 59042 usec (marshal 317 usec + response wait 58669 usec + unmarshal 56 usec)\n",
|
|
|
+ " Server: \n",
|
|
|
+ " Request count: 404\n",
|
|
|
+ " Avg request latency: 56316 usec (overhead 569 usec + queue 19140 usec + compute 36607 usec)\n",
|
|
|
+ "\n",
|
|
|
+ "Request concurrency: 5\n",
|
|
|
+ " Pass [1] throughput: 335872 infer/sec. Avg latency: 60666 usec (std 22599 usec)\n",
|
|
|
+ " Pass [2] throughput: 308838 infer/sec. Avg latency: 65721 usec (std 22284 usec)\n",
|
|
|
+ " Pass [3] throughput: 339968 infer/sec. Avg latency: 59920 usec (std 22992 usec)\n",
|
|
|
+ " Client: \n",
|
|
|
+ " Request count: 415\n",
|
|
|
+ " Throughput: 339968 infer/sec\n",
|
|
|
+ " Avg latency: 59920 usec (standard deviation 22992 usec)\n",
|
|
|
+ " p50 latency: 67406 usec\n",
|
|
|
+ " p90 latency: 84561 usec\n",
|
|
|
+ " p95 latency: 86191 usec\n",
|
|
|
+ " p99 latency: 94862 usec\n",
|
|
|
+ " Avg gRPC time: 61127 usec (marshal 304 usec + response wait 60771 usec + unmarshal 52 usec)\n",
|
|
|
+ " Server: \n",
|
|
|
+ " Request count: 490\n",
|
|
|
+ " Avg request latency: 58036 usec (overhead 696 usec + queue 18923 usec + compute 38417 usec)\n",
|
|
|
+ "\n",
|
|
|
+ "Request concurrency: 6\n",
|
|
|
+ " Pass [1] throughput: 368640 infer/sec. Avg latency: 66037 usec (std 20247 usec)\n",
|
|
|
+ " Pass [2] throughput: 348979 infer/sec. Avg latency: 71309 usec (std 20236 usec)\n",
|
|
|
+ " Pass [3] throughput: 334234 infer/sec. Avg latency: 72704 usec (std 18491 usec)\n",
|
|
|
+ " Client: \n",
|
|
|
+ " Request count: 408\n",
|
|
|
+ " Throughput: 334234 infer/sec\n",
|
|
|
+ " Avg latency: 72704 usec (standard deviation 18491 usec)\n",
|
|
|
+ " p50 latency: 80327 usec\n",
|
|
|
+ " p90 latency: 87164 usec\n",
|
|
|
+ " p95 latency: 91824 usec\n",
|
|
|
+ " p99 latency: 95617 usec\n",
|
|
|
+ " Avg gRPC time: 71989 usec (marshal 315 usec + response wait 71617 usec + unmarshal 57 usec)\n",
|
|
|
+ " Server: \n",
|
|
|
+ " Request count: 504\n",
|
|
|
+ " Avg request latency: 68951 usec (overhead 957 usec + queue 18350 usec + compute 49644 usec)\n",
|
|
|
+ "\n",
|
|
|
+ "Request concurrency: 7\n",
|
|
|
+ " Pass [1] throughput: 395674 infer/sec. Avg latency: 72406 usec (std 18789 usec)\n",
|
|
|
+ " Pass [2] throughput: 407142 infer/sec. Avg latency: 69909 usec (std 19644 usec)\n",
|
|
|
+ " Pass [3] throughput: 355533 infer/sec. Avg latency: 81048 usec (std 12687 usec)\n",
|
|
|
+ " Client: \n",
|
|
|
+ " Request count: 434\n",
|
|
|
+ " Throughput: 355533 infer/sec\n",
|
|
|
+ " Avg latency: 81048 usec (standard deviation 12687 usec)\n",
|
|
|
+ " p50 latency: 84046 usec\n",
|
|
|
+ " p90 latency: 91642 usec\n",
|
|
|
+ " p95 latency: 94089 usec\n",
|
|
|
+ " p99 latency: 100453 usec\n",
|
|
|
+ " Avg gRPC time: 79919 usec (marshal 313 usec + response wait 79552 usec + unmarshal 54 usec)\n",
|
|
|
+ " Server: \n",
|
|
|
+ " Request count: 525\n",
|
|
|
+ " Avg request latency: 76078 usec (overhead 1042 usec + queue 17815 usec + compute 57221 usec)\n",
|
|
|
+ "\n",
|
|
|
+ "Request concurrency: 8\n",
|
|
|
+ " Pass [1] throughput: 524288 infer/sec. Avg latency: 62235 usec (std 15989 usec)\n",
|
|
|
+ " Pass [2] throughput: 524288 infer/sec. Avg latency: 62741 usec (std 15967 usec)\n",
|
|
|
+ " Pass [3] throughput: 517734 infer/sec. Avg latency: 63449 usec (std 15144 usec)\n",
|
|
|
+ " Client: \n",
|
|
|
+ " Request count: 632\n",
|
|
|
+ " Throughput: 517734 infer/sec\n",
|
|
|
+ " Avg latency: 63449 usec (standard deviation 15144 usec)\n",
|
|
|
+ " p50 latency: 68562 usec\n",
|
|
|
+ " p90 latency: 75212 usec\n",
|
|
|
+ " p95 latency: 77256 usec\n",
|
|
|
+ " p99 latency: 79685 usec\n",
|
|
|
+ " Avg gRPC time: 62683 usec (marshal 304 usec + response wait 62321 usec + unmarshal 58 usec)\n",
|
|
|
+ " Server: \n",
|
|
|
+ " Request count: 768\n",
|
|
|
+ " Avg request latency: 58942 usec (overhead 1574 usec + queue 2167 usec + compute 55201 usec)\n",
|
|
|
+ "\n",
|
|
|
+ "Request concurrency: 9\n",
|
|
|
+ " Pass [1] throughput: 376832 infer/sec. Avg latency: 98868 usec (std 34719 usec)\n",
|
|
|
+ " Pass [2] throughput: 407142 infer/sec. Avg latency: 90421 usec (std 35435 usec)\n",
|
|
|
+ " Pass [3] throughput: 346522 infer/sec. Avg latency: 106082 usec (std 33649 usec)\n",
|
|
|
+ " Client: \n",
|
|
|
+ " Request count: 423\n",
|
|
|
+ " Throughput: 346522 infer/sec\n",
|
|
|
+ " Avg latency: 106082 usec (standard deviation 33649 usec)\n",
|
|
|
+ " p50 latency: 122774 usec\n",
|
|
|
+ " p90 latency: 139616 usec\n",
|
|
|
+ " p95 latency: 143511 usec\n",
|
|
|
+ " p99 latency: 148324 usec\n",
|
|
|
+ " Avg gRPC time: 106566 usec (marshal 323 usec + response wait 106177 usec + unmarshal 66 usec)\n",
|
|
|
+ " Server: \n",
|
|
|
+ " Request count: 505\n",
|
|
|
+ " Avg request latency: 102100 usec (overhead 1046 usec + queue 43598 usec + compute 57456 usec)\n",
|
|
|
+ "\n",
|
|
|
+ "Request concurrency: 10\n",
|
|
|
+ " Pass [1] throughput: 407962 infer/sec. Avg latency: 100260 usec (std 27654 usec)\n",
|
|
|
+ " Pass [2] throughput: 403866 infer/sec. Avg latency: 101427 usec (std 34082 usec)\n",
|
|
|
+ " Pass [3] throughput: 412058 infer/sec. Avg latency: 99376 usec (std 31125 usec)\n",
|
|
|
+ " Client: \n",
|
|
|
+ " Request count: 503\n",
|
|
|
+ " Throughput: 412058 infer/sec\n",
|
|
|
+ " Avg latency: 99376 usec (standard deviation 31125 usec)\n",
|
|
|
+ " p50 latency: 100025 usec\n",
|
|
|
+ " p90 latency: 137764 usec\n",
|
|
|
+ " p95 latency: 141030 usec\n",
|
|
|
+ " p99 latency: 144104 usec\n",
|
|
|
+ " Avg gRPC time: 98137 usec (marshal 348 usec + response wait 97726 usec + unmarshal 63 usec)\n",
|
|
|
+ " Server: \n",
|
|
|
+ " Request count: 612\n",
|
|
|
+ " Avg request latency: 94377 usec (overhead 1417 usec + queue 40909 usec + compute 52051 usec)\n",
|
|
|
+ "\n",
|
|
|
+ "Inferences/Second vs. Client Average Batch Latency\n",
|
|
|
+ "Concurrency: 1, throughput: 68812.8 infer/sec, latency 59617 usec\n",
|
|
|
+ "Concurrency: 2, throughput: 150733 infer/sec, latency 54271 usec\n",
|
|
|
+ "Concurrency: 3, throughput: 203981 infer/sec, latency 60661 usec\n",
|
|
|
+ "Concurrency: 4, throughput: 285082 infer/sec, latency 57494 usec\n",
|
|
|
+ "Concurrency: 5, throughput: 339968 infer/sec, latency 59920 usec\n",
|
|
|
+ "Concurrency: 6, throughput: 334234 infer/sec, latency 72704 usec\n",
|
|
|
+ "Concurrency: 7, throughput: 355533 infer/sec, latency 81048 usec\n",
|
|
|
+ "Concurrency: 8, throughput: 517734 infer/sec, latency 63449 usec\n",
|
|
|
+ "Concurrency: 9, throughput: 346522 infer/sec, latency 106082 usec\n",
|
|
|
+ "Concurrency: 10, throughput: 412058 infer/sec, latency 99376 usec\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "name": "stderr",
|
|
|
+ "output_type": "stream",
|
|
|
+ "text": [
|
|
|
+ "WARNING: Overriding max_threads specification to ensure requested concurrency range.\n"
|
|
|
+ ]
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "source": [
|
|
|
+ "%%bash\n",
|
|
|
+ "/workspace/install/bin/perf_client \\\n",
|
|
|
+ "--max-threads 10 \\\n",
|
|
|
+ "-m dlrm-onnx-16 \\\n",
|
|
|
+ "-x 1 \\\n",
|
|
|
+ "-p 5000 \\\n",
|
|
|
+ "-v -i gRPC \\\n",
|
|
|
+ "-u localhost:8001 \\\n",
|
|
|
+ "-b 4096 \\\n",
|
|
|
+ "-l 5000 \\\n",
|
|
|
+ "--concurrency-range 1:10 \\\n",
|
|
|
+ "--input-data ./perfdata \\\n",
|
|
|
+ "-f result.csv"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {},
|
|
|
+ "source": [
|
|
|
+ "### Visualizing Latency vs. Throughput\n",
|
|
|
+ "\n",
|
|
|
+ "The perf_client provides the -f option to generate a file containing CSV output of the results.\n",
|
|
|
+ "You can import the CSV file into a spreadsheet to help visualize the latency vs inferences/second tradeoff as well as see some components of the latency. Follow these steps:\n",
|
|
|
+ "- Open this [spreadsheet](https://docs.google.com/spreadsheets/d/1IsdW78x_F-jLLG4lTV0L-rruk0VEBRL7Mnb-80RGLL4)\n",
|
|
|
+ "\n",
|
|
|
+ "- Make a copy from the File menu “Make a copy…”\n",
|
|
|
+ "\n",
|
|
|
+ "- Open the copy\n",
|
|
|
+ "\n",
|
|
|
+ "- Select the A1 cell on the “Raw Data” tab\n",
|
|
|
+ "\n",
|
|
|
+ "- From the File menu select “Import…”\n",
|
|
|
+ "\n",
|
|
|
+ "- Select “Upload” and upload the file\n",
|
|
|
+ "\n",
|
|
|
+ "- Select “Replace data at selected cell” and then select the “Import data” button\n",
|
|
|
+ "\n",
|
|
|
+ "\n"
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "markdown",
|
|
|
+ "metadata": {
|
|
|
+ "colab_type": "text",
|
|
|
+ "id": "g8MxXY5GmTc8"
|
|
|
+ },
|
|
|
+ "source": [
|
|
|
+ "# Conclusion\n",
|
|
|
+ "\n",
|
|
|
+ "In this notebook, we have walked through the complete process of preparing the pretrained DLRM for inference with the Triton inference server. Then, we stress test the server with the performance client to verify inference throughput.\n",
|
|
|
+ "\n",
|
|
|
+ "## What's next\n",
|
|
|
+ "Now it's time to deploy your own DLRM model with Triton. "
|
|
|
+ ]
|
|
|
+ },
|
|
|
+ {
|
|
|
+ "cell_type": "code",
|
|
|
+ "execution_count": 0,
|
|
|
+ "metadata": {
|
|
|
+ "colab": {},
|
|
|
+ "colab_type": "code",
|
|
|
+ "id": "249yGNLmmTc_"
|
|
|
+ },
|
|
|
+ "outputs": [],
|
|
|
+ "source": []
|
|
|
+ }
|
|
|
+ ],
|
|
|
+ "metadata": {
|
|
|
+ "colab": {
|
|
|
+ "include_colab_link": true,
|
|
|
+ "name": "TensorFlow_UNet_Industrial_Colab_train_and_inference.ipynb",
|
|
|
+ "provenance": []
|
|
|
+ },
|
|
|
+ "kernelspec": {
|
|
|
+ "display_name": "Python 3",
|
|
|
+ "language": "python",
|
|
|
+ "name": "python3"
|
|
|
+ },
|
|
|
+ "language_info": {
|
|
|
+ "codemirror_mode": {
|
|
|
+ "name": "ipython",
|
|
|
+ "version": 3
|
|
|
+ },
|
|
|
+ "file_extension": ".py",
|
|
|
+ "mimetype": "text/x-python",
|
|
|
+ "name": "python",
|
|
|
+ "nbconvert_exporter": "python",
|
|
|
+ "pygments_lexer": "ipython3",
|
|
|
+ "version": "3.6.9"
|
|
|
+ }
|
|
|
+ },
|
|
|
+ "nbformat": 4,
|
|
|
+ "nbformat_minor": 1
|
|
|
+}
|