|
|
@@ -735,10 +735,10 @@ Our results were obtained by running the `scripts/run_squad.sh` and `scripts/run
|
|
|
|
|
|
##### Pre-training loss results: NVIDIA DGX A100 (8x A100 80GB)
|
|
|
|
|
|
-| DGX System | GPUs / Node | Accumulated Batch size / GPU (Phase 1 and Phase 2) | Accumulation steps (Phase 1 and Phase 2) | Final Loss - TF32 | Final Loss - mixed precision | Time to train(hours) - TF32 | Time to train(hours) - mixed precision | Time to train speedup (TF32 to mixed precision) |
|
|
|
-|--------------------|-------------|----------------------------------------------------|------------------------------------------|-------------------|------------------------------|-----------------------------|----------------------------------------|-------------------------------------------------|
|
|
|
-| 32 x DGX A100 80GB | 8 | 256 and 128 | 1 and 4 | --- | 1.2437 | --- | 1.2 | 1.9 |
|
|
|
-| 32 x DGX A100 80GB | 8 | 256 and 128 | 2 and 8 | 1.2465 | --- | 2.4 | --- | --- |
|
|
|
+| DGX System | GPUs / Node | Batch size / GPU (Phase 1 and Phase 2) | Accumulated Batch size / GPU (Phase 1 and Phase 2) | Accumulation steps (Phase 1 and Phase 2) | Final Loss - TF32 | Final Loss - mixed precision | Time to train(hours) - TF32 | Time to train(hours) - mixed precision | Time to train speedup (TF32 to mixed precision) |
|
|
|
+|--------------------|-------------|----------------------------------------------------|------------------------------------------|-------------------|------------------------------|-----------------------------|----------------------------------------|-------------------------------------------------|-----|
|
|
|
+| 32 x DGX A100 80GB | 8 | 256 and 32 | 256 and 128 | 1 and 4 | --- | 1.2437 | --- | 1.2 | 1.9 |
|
|
|
+| 32 x DGX A100 80GB | 8 | 128 and 16 | 256 and 128 | 2 and 8 | 1.2465 | --- | 2.4 | --- | --- |
|
|
|
|
|
|
|
|
|
##### Pre-training loss curves
|
|
|
@@ -808,29 +808,29 @@ Our results were obtained by running the `scripts run_pretraining.sh` training s
|
|
|
|
|
|
###### Pre-training NVIDIA DGX A100 (8x A100 80GB)
|
|
|
|
|
|
-| GPUs | Batch size / GPU (TF32 and FP16) | Accumulation steps (TF32 and FP16) | Sequence length | Throughput - TF32(sequences/sec) | Throughput - mixed precision(sequences/sec) | Throughput speedup (TF32 - mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
|
|
|
-|------|----------------------------------|------------------------------------|-----------------|----------------------------------|---------------------------------------------|---------------------------------------------|---------------------|--------------------------------|
|
|
|
-| 1 | 8192 and 8192 | 64 and 32 | 128 | 317 | 580 | 1.83 | 1.00 | 1.00 |
|
|
|
-| 8 | 8192 and 8192 | 64 and 32 | 128 | 2505 | 4591 | 1.83 | 7.90 | 7.91 |
|
|
|
-| 1 | 4096 and 4096 | 256 and 128 | 512 | 110 | 210 | 1.90 | 1.00 | 1.00 |
|
|
|
-| 8 | 4096 and 4096 | 256 and 128 | 512 | 860 | 1657 | 1.92 | 7.81 | 7.89 |
|
|
|
+| GPUs | Batch size / GPU (TF32 and FP16) | Accumulated Batch size / GPU (TF32 and FP16) | Accumulation steps (TF32 and FP16) | Sequence length | Throughput - TF32(sequences/sec) | Throughput - mixed precision(sequences/sec) | Throughput speedup (TF32 - mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
|
|
|
+|------|----------------------------------|------------------------------------|-----------------|----------------------------------|---------------------------------------------|---------------------------------------------|---------------------|--------------------------------|----|
|
|
|
+| 1 | 128 and 256 | 8192 and 8192 | 64 and 32 | 128 | 317 | 580 | 1.83 | 1.00 | 1.00 |
|
|
|
+| 8 | 128 and 256 | 8192 and 8192 | 64 and 32 | 128 | 2505 | 4591 | 1.83 | 7.90 | 7.91 |
|
|
|
+| 1 | 16 and 32 | 4096 and 4096 | 256 and 128 | 512 | 110 | 210 | 1.90 | 1.00 | 1.00 |
|
|
|
+| 8 | 16 and 32 | 4096 and 4096 | 256 and 128 | 512 | 860 | 1657 | 1.92 | 7.81 | 7.89 |
|
|
|
|
|
|
###### Pre-training NVIDIA DGX A100 (8x A100 80GB) Multi-node Scaling
|
|
|
|
|
|
-| Nodes | GPUs / node | Batch size / GPU (TF32 and FP16) | Accumulation steps (TF32 and FP16) | Sequence length | Mixed Precision Throughput | Mixed Precision Strong Scaling | TF32 Throughput | TF32 Strong Scaling | Speedup (Mixed Precision to TF32) |
|
|
|
-|-------|-------------|----------------------------------|------------------------------------|-----------------|----------------------------|--------------------------------|-----------------|---------------------|-----------------------------------|
|
|
|
-| 1 | 8 | 8192 and 8192 | 32 and 64 | 128 | 4553 | 1 | 2486 | 1 | 1.83 |
|
|
|
-| 2 | 8 | 4096 and 4096 | 16 and 32 | 128 | 9191 | 2.02 | 4979 | 2.00 | 1.85 |
|
|
|
-| 4 | 8 | 2048 and 2048 | 8 and 16 | 128 | 18119 | 3.98 | 9859 | 3.97 | 1.84 |
|
|
|
-| 8 | 8 | 1024 and 1024 | 4 and 8 | 128 | 35774 | 7.86 | 19815 | 7.97 | 1.81 |
|
|
|
-| 16 | 8 | 512 and 512 | 2 and 4 | 128 | 70555 | 15.50 | 38866 | 15.63 | 1.82 |
|
|
|
-| 32 | 8 | 256 and 256 | 1 and 2 | 128 | 138294 | 30.37 | 75706 | 30.45 | 1.83 |
|
|
|
-| 1 | 8 | 4096 and 4096 | 128 and 256 | 512 | 1648 | 1 | 854 | 1 | 1.93 |
|
|
|
-| 2 | 8 | 2048 and 2048 | 64 and 128 | 512 | 3291 | 2.00 | 1684 | 1.97 | 1.95 |
|
|
|
-| 4 | 8 | 1024 and 1024 | 32 and 64 | 512 | 6464 | 3.92 | 3293 | 3.86 | 1.96 |
|
|
|
-| 8 | 8 | 512 and 512 | 16 and 32 | 512 | 13005 | 7.89 | 6515 | 7.63 | 2.00 |
|
|
|
-| 16 | 8 | 256 and 256 | 8 and 16 | 512 | 25570 | 15.51 | 12131 | 14.21 | 2.11 |
|
|
|
-| 32 | 8 | 128 and 128 | 4 and 8 | 512 | 49663 | 30.13 | 21298 | 24.95 | 2.33 |
|
|
|
+| Nodes | GPUs / node | Batch size / GPU (TF32 and FP16) | Accumulated Batch size / GPU (TF32 and FP16) | Accumulation steps (TF32 and FP16) | Sequence length | Mixed Precision Throughput | Mixed Precision Strong Scaling | TF32 Throughput | TF32 Strong Scaling | Speedup (Mixed Precision to TF32) |
|
|
|
+|-------|-------------|----------------------------------|------------------------------------|-----------------|----------------------------|--------------------------------|-----------------|---------------------|-----------------------------------|-----|
|
|
|
+| 1 | 8 | 126 and 256 | 8192 and 8192 | 64 and 32 | 128 | 4553 | 1 | 2486 | 1 | 1.83 |
|
|
|
+| 2 | 8 | 126 and 256 | 4096 and 4096 | 32 and 16 | 128 | 9191 | 2.02 | 4979 | 2.00 | 1.85 |
|
|
|
+| 4 | 8 | 126 and 256 | 2048 and 2048 | 16 and 18 | 128 | 18119 | 3.98 | 9859 | 3.97 | 1.84 |
|
|
|
+| 8 | 8 | 126 and 256 | 1024 and 1024 | 8 and 4 | 128 | 35774 | 7.86 | 19815 | 7.97 | 1.81 |
|
|
|
+| 16 | 8 | 126 and 256 | 512 and 512 | 4 and 2 | 128 | 70555 | 15.50 | 38866 | 15.63 | 1.82 |
|
|
|
+| 32 | 8 | 126 and 256 | 256 and 256 | 2 and 1 | 128 | 138294 | 30.37 | 75706 | 30.45 | 1.83 |
|
|
|
+| 1 | 8 | 16 and 32 | 4096 and 4096 | 256 and 128 | 512 | 1648 | 1 | 854 | 1 | 1.93 |
|
|
|
+| 2 | 8 | 16 and 32 | 2048 and 2048 | 128 and 64 | 512 | 3291 | 2.00 | 1684 | 1.97 | 1.95 |
|
|
|
+| 4 | 8 | 16 and 32 | 1024 and 1024 | 64 and 32 | 512 | 6464 | 3.92 | 3293 | 3.86 | 1.96 |
|
|
|
+| 8 | 8 | 16 and 32 | 512 and 512 | 32 and 16 | 512 | 13005 | 7.89 | 6515 | 7.63 | 2.00 |
|
|
|
+| 16 | 8 | 16 and 32 | 256 and 256 | 16 and 8 | 512 | 25570 | 15.51 | 12131 | 14.21 | 2.11 |
|
|
|
+| 32 | 8 | 16 and 32 | 128 and 128 | 8 and 4 | 512 | 49663 | 30.13 | 21298 | 24.95 | 2.33 |
|
|
|
|
|
|
###### Fine-tuning NVIDIA DGX A100 (8x A100 80GB)
|
|
|
|