5 years ago · 0d15a95c8f
--- a/PyTorch/Recommendation/DLRM/README.md
+++ b/PyTorch/Recommendation/DLRM/README.md
@@ -139,7 +139,7 @@ TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by defaul
 
				 
			
 
				 ### Hybrid-parallel multiGPU with all-2-all communication
			
 
				 
			
 
				-Many recommendation models contain very large embedding tables. As a result the model is often too large to fit onto a single device. This could be easily. This could be easily solved by training in a model-parallel way, using either the CPU or other GPUs as "memory donors". However, this approach is suboptimal as the "memory donor" devices' compute is not utilized. In this repository we use the model-parallel approach for the bottom part of the model (Embedding Tables + Bottom MLP) while using a usual data parallel approach for the top part of the model (Dot Interaction + Top MLP). This way we can train models much larger than what would normally fit into a single GPU while at the same time making the training faster by using multiple GPUs. We call this approach hybrid-parallel.
			
 
				+Many recommendation models contain very large embedding tables. As a result the model is often too large to fit onto a single device. This could be easily solved by training in a model-parallel way, using either the CPU or other GPUs as "memory donors". However, this approach is suboptimal as the "memory donor" devices' compute is not utilized. In this repository we use the model-parallel approach for the bottom part of the model (Embedding Tables + Bottom MLP) while using a usual data parallel approach for the top part of the model (Dot Interaction + Top MLP). This way we can train models much larger than what would normally fit into a single GPU while at the same time making the training faster by using multiple GPUs. We call this approach hybrid-parallel.
			
 
				 
			
 
				 The transition from model-parallel to data-parallel in the middle of the neural net needs a specific multiGPU communication pattern called [all-2-all](https://en.wikipedia.org/wiki/All-to-all_\(parallel_pattern\)) which is available in our [PyTorch 20.06-py3] NGC docker container. In the [original DLRM whitepaper](https://arxiv.org/abs/1906.00091) this has been also referred to as "butterlfy shuffle". 
			
 
				 
			
--- a/PyTorch/Recommendation/DLRM/triton/README.md
+++ b/PyTorch/Recommendation/DLRM/triton/README.md
@@ -1,6 +1,6 @@
 
				 # Deploying the DLRM model using Triton Inference Server
			
 
				 
			
 
				-The [NVIDIA Triton Inference Server](https://github.com/NVIDIA/trtis-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.
			
 
				+The [NVIDIA Triton Inference Server](https://github.com/NVIDIA/triton-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server.
			
 
				 
			
 
				 This folder contains instructions for deploment and exemplary client application to run inference on
			
 
				 Triton Inference Server as well as detailed performance analysis.