This document provides detailed instructions on running DLRM training as well as benchmark results for this model.
The Deep Learning Recommendation Model (DLRM) is a recommendation model designed to make use of both categorical and numerical inputs. It was first described in Deep Learning Recommendation Model for Personalization and Recommendation Systems. This repository provides a reimplementation of the code base provided originally here. The scripts enable you to train DLRM on a synthetic dataset or on the Criteo Terabyte Dataset.
For the Criteo 1TB Dataset, we use a slightly different preprocessing procedure than the one found in the original implementation. Most importantly, we use a technique called frequency thresholding to demonstrate models of different sizes. The smallest model can be trained on a single V100-32GB GPU, while the largest one needs 8xA100-80GB GPUs.
The table below summarizes the model sizes and frequency thresholds used in this repository, for both the synthetic and real datasets supported.
| Dataset | Frequency Threshold | Final dataset size | Intermediate preprocessing storage required | Suitable for accuracy tests | Total download & preprocess time |GPU Memory required for training | Total embedding size | Number of model parameters | |:-------|:-------|:-------|:-------------|:-------------------|:-------------------|:-------------------|:-------------------|:-------------------| | Synthetic T15 |15 | 6 GiB | None | No | ~Minutes | 15.6 GiB | 15.6 GiB | 4.2B | | Synthetic T3 |3 | 6 GiB | None | No | ~Minutes | 84.9 GiB | 84.9 GiB | 22.8B | | Synthetic T0 |0 | 6 GiB | None | No | ~Minutes | 421 GiB | 421 GiB | 113B | | Real Criteo T15 |15 | 370 GiB | ~Terabytes | Yes | ~Hours | 15.6 GiB | 15.6 GiB | 4.2B | | Real Criteo T3 |3 | 370 GiB | ~Terabytes | Yes | ~Hours | 84.9 GiB | 84.9 GiB | 22.8B | | Real Criteo T0 |0 | 370 GiB | ~Terabytes | Yes | ~Hours | 421 GiB | 421 GiB | 113B |
You can find a detailed description of the Criteo dataset preprocessing the preprocessing documentation.
DLRM accepts two types of features: categorical and numerical. For each categorical feature, an embedding table is used to provide a dense representation of each unique value. The dense features enter the model and are transformed by a simple neural network referred to as "Bottom MLP."
This part of the network consists of a series of linear layers with ReLU activations. The output of the bottom MLP and the embedding vectors are then fed into the "dot interaction" operation. The output of "dot interaction" is then concatenated with the features resulting from the bottom MLP and fed into the "top MLP," which is a series of dense layers with activations. The model outputs a single number which can be interpreted as a likelihood of a certain user clicking an ad.
Figure 1. The architecture of DLRM.
To train DLRM perform the following steps. For the specifics concerning training and inference, refer to the Advanced section.
Clone the repository.
git clone https://github.com/NVIDIA/DeepLearningExamples
cd DeepLearningExamples/TensorFlow2/Recommendation/DLRM
Build and run a DLRM Docker container.
docker build -t train_docker_image .
docker run --cap-add SYS_NICE --runtime=nvidia -it --rm --ipc=host -v ${PWD}/data:/data train_docker_image bash
Generate a synthetic dataset.
Downloading and preprocessing the Criteo 1TB dataset requires a lot of time and disk space. Because of this we provide a synthetic dataset generator that roughly matches Criteo 1TB characteristics. This will enable you to benchmark quickly. If you prefer to benchmark on the real data, please follow these instructions to download and preprocess the dataset.
python -m dataloading.generate_feature_spec --variant criteo_t15_synthetic --dst feature_spec.yaml
python -m dataloading.transcribe --src_dataset_type synthetic --src_dataset_path . \
--dst_dataset_path /data/preprocessed --max_batches_train 1000 --max_batches_test 100 --dst_dataset_type tf_raw
After running tree /data/preprocessed you should see the following directory structure:
$ tree /data/preprocessed
/data/preprocessed
├── feature_spec.yaml
├── test
│ ├── cat_0.bin
│ ├── cat_1.bin
│ ├── ...
│ ├── label.bin
│ └── numerical.bin
└── train
├── cat_0.bin
├── cat_1.bin
├── ...
├── label.bin
└── numerical.bin
2 directories, 57 files
single-GPU:
horovodrun -np 1 -H localhost:1 --mpi-args=--oversubscribe numactl --interleave=all -- python -u dlrm.py --dataset_path /data/preprocessed --amp --xla --save_checkpoint_path /data/checkpoint/
multi-GPU:
horovodrun -np 8 -H localhost:8 --mpi-args=--oversubscribe numactl --interleave=all -- python -u dlrm.py --dataset_path /data/preprocessed --amp --xla --save_checkpoint_path /data/checkpoint/
To evaluate a previously trained checkpoint, append --restore_checkpoint_path <path> --mode eval to the command used for training. For example, to test a checkpoint trained on 8xA100 80GB, run:
horovodrun -np 8 -H localhost:8 --mpi-args=--oversubscribe numactl --interleave=all -- python -u dlrm.py --dataset_path /data/preprocessed --amp --xla --restore_checkpoint_path /data/checkpoint --mode eval
The performance measurements in this document were conducted at the time of publication and may not reflect the performance achieved from NVIDIA’s latest software release. For the most up-to-date performance measurements, go to NVIDIA Data Center Deep Learning Product Performance.
The following section shows how to run benchmarks measuring the model performance in training and inference modes.
To benchmark the training performance on a specific batch size, follow the instructions
in the Quick Start Guide. You can also add the --max_steps 1000
if you want to get a reliable throughput measurement without running the entire training.
You can also use synthetic data by running with the --dataset_type synthetic option if you haven't downloaded the dataset yet.
To benchmark the inference performance on a specific batch size, run:
horovodrun -np 1 -H localhost:1 --mpi-args=--oversubscribe numactl --interleave=all -- python -u dlrm.py --dataset_path /data/preprocessed/ --amp --restore_checkpoint_path <checkpoint_path> --mode inference
The main training scripts resides in dlrm.py. The training speed is measured by throughput, i.e.,
the number of samples processed per second.
We use mixed precision training with static loss scaling for the bottom and top MLPs
while embedding tables are stored in FP32 format.
The following sections provide details on how we achieved our performance and accuracy in training and inference.
We used three model size variants to show memory scalability in a multi-GPU setup (4.2B params, 22.8B params, and 113B params). Refer to the Model overview section for detailed information about the model variants.
Our results were obtained by running training scripts as described in the Quick Start Guide in the DLRM Docker container.
| GPUs | Model size | Batch size / GPU | Accuracy (AUC) - TF32 | Accuracy (AUC) - mixed precision | Time to train - TF32 [minutes] | Time to train - mixed precision [minutes] | Time to train speedup (TF32 to mixed precision) |
|---|---|---|---|---|---|---|---|
| 1 | small | 64k | 0.8025 | 0.8025 | 26.75 | 16.27 | 1.64 |
| 8 | large | 8k | 0.8027 | 0.8026 | 8.77 | 6.57 | 1.33 |
| 8 | extra large | 8k | 0.8026 | 0.8026 | 10.47 | 9.08 | 1.15 |
Our results were obtained by running training scripts as described in the Quick Start Guide in the DLRM Docker container.
| GPUs | Model size | Batch size / GPU | Accuracy (AUC) - FP32 | Accuracy (AUC) - mixed precision | Time to train - FP32 [minutes] | Time to train - mixed precision [minutes] | Time to train speedup (FP32 to mixed precision) |
|---|---|---|---|---|---|---|---|
| 1 | small | 64k | 0.8027 | 0.8025 | 109.63 | 34.83 | 3.15 |
| 8 | large | 8k | 0.8028 | 0.8026 | 26.01 | 13.73 | 1.89 |
Our results were obtained by running training scripts as described in the Quick Start Guide in the DLRM Docker container.
| GPUs | Model size | Batch size / GPU | Accuracy (AUC) - FP32 | Accuracy (AUC) - mixed precision | Time to train - FP32 [minutes] | Time to train - mixed precision [minutes] | Time to train speedup (FP32 to mixed precision) |
|---|---|---|---|---|---|---|---|
| 1 | small | 64k | 0.8026 | 0.8026 | 105.13 | 33.37 | 3.15 |
| 8 | large | 8k | 0.8027 | 0.8027 | 21.21 | 11.43 | 1.86 |
| 16 | large | 4k | 0.8025 | 0.8026 | 15.52 | 10.88 | 1.43 |
The histograms below show the distribution of ROC AUC results achieved at the end of the training for each precision/hardware platform tested. No statistically significant differences exist between precision, number of GPUs, or hardware platform. Using the larger dataset has a modest, positive impact on the final AUC score.
Figure 4. Results of stability tests for DLRM.
We used throughput in items processed per second as the performance metric.
Our results were obtained by following the commands from the Quick Start Guide in the DLRM Docker container on NVIDIA DGX A100 (8x A100 80GB) GPUs. Performance numbers (in items per second) were averaged over 1000 training steps.
| GPUs | Model size | Batch size / GPU | Throughput - TF32 | Throughput - mixed precision | Throughput speedup (TF32 to mixed precision) |
|---|---|---|---|---|---|
| 1 | small | 64k | 2.84M | 4.55M | 1.60 |
| 8 | large | 8k | 10.9M | 13.8M | 1.27 |
| 8 | extra large | 8k | 9.76M | 11.5M | 1.17 |
To achieve these results, follow the steps in the Quick Start Guide.
For the "extra large" model (113B parameters), we also obtained CPU results for comparison using the same source code
(using the --cpu command line flag for the CPU-only experiments).
We compare three hardware setups:
| Hardware | Throughput [samples / second] | Speedup over CPU |
|---|---|---|
| 2xAMD EPYC 7742 | 17.7k | 1x |
| 1xA100-80GB + 2xAMD EPYC 7742 (large embeddings on CPU) | 768k | 43x |
| DGX A100 (8xA100-80GB) (hybrid parallel) | 11.5M | 649x |
| GPUs | Model size | Batch size / GPU | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) |
|---|---|---|---|---|---|
| 1 | small | 64k | 0.663M | 2.23M | 3.37 |
| 8 | large | 8k | 3.13M | 6.31M | 2.02 |
To achieve the same results, follow the steps in the Quick Start Guide.
| GPUs | Model size | Batch size / GPU | Throughput - FP32 | Throughput - mixed precision | Throughput speedup (FP32 to mixed precision) |
|---|---|---|---|---|---|
| 1 | small | 64k | 0.698M | 2.44M | 3.49 |
| 8 | large | 8k | 3.79M | 7.82M | 2.06 |
| 16 | large | 4k | 6.43M | 10.5M | 1.64 |
To achieve the same results, follow the steps in the Quick Start Guide.
| GPUs | Model size | Batch size / GPU | Throughput - TF32 | Throughput - mixed precision | Average latency - TF32 [ms] | Average latency - mixed precision [ms] | Throughput speedup (mixed precision to TF32) |
|---|---|---|---|---|---|---|---|
| 1 | small | 2048 | 1.38M | 1.48M | 1.49 | 1.38 | 1.07 |
| GPUs | Model size | Batch size / GPU | Throughput - FP32 | Throughput - mixed precision | Average latency - FP32 [ms] | Average latency - mixed precision [ms] | Throughput speedup (mixed precision to FP32) |
|---|---|---|---|---|---|---|---|
| 1 | small | 2048 | 0.871M | 0.951M | 2.35 | 2.15 | 1.09 |
| GPUs | Model size | Batch size / GPU | Throughput - FP32 | Throughput - mixed precision | Average latency - FP32 [ms] | Average latency - mixed precision [ms] | Throughput speedup (mixed precision to FP32) |
|---|---|---|---|---|---|---|---|
| 1 | small | 2048 | 1.15M | 1.37M | 1.78 | 1.50 | 1.19 |