3 лет назад · a07d20a124
--- a/PyTorch/LanguageModeling/BERT/README.md
+++ b/PyTorch/LanguageModeling/BERT/README.md
@@ -735,10 +735,10 @@ Our results were obtained by running the `scripts/run_squad.sh` and `scripts/run
 
				  
			
 
				 ##### Pre-training loss results: NVIDIA DGX A100 (8x A100 80GB)
			
 
				 
			
 
				-| DGX System         | GPUs / Node | Accumulated Batch size / GPU (Phase 1 and Phase 2) | Accumulation steps (Phase 1 and Phase 2) | Final Loss - TF32 | Final Loss - mixed precision | Time to train(hours) - TF32 | Time to train(hours) - mixed precision | Time to train speedup (TF32 to mixed precision) |
			
 
				-|--------------------|-------------|----------------------------------------------------|------------------------------------------|-------------------|------------------------------|-----------------------------|----------------------------------------|-------------------------------------------------|
			
 
				-| 32 x DGX A100 80GB | 8           | 256 and 128                                        | 1 and 4                                  | ---               | 1.2437                       | ---                         | 1.2                                    | 1.9                                             |
			
 
				-| 32 x DGX A100 80GB | 8           | 256 and 128                                        | 2 and 8                                  | 1.2465            | ---                          | 2.4                         | ---                                    | ---                                             |
			
 
				+| DGX System         | GPUs / Node | Batch size / GPU (Phase 1 and Phase 2) | Accumulated Batch size / GPU (Phase 1 and Phase 2) | Accumulation steps (Phase 1 and Phase 2) | Final Loss - TF32 | Final Loss - mixed precision | Time to train(hours) - TF32 | Time to train(hours) - mixed precision | Time to train speedup (TF32 to mixed precision) |
			
 
				+|--------------------|-------------|----------------------------------------------------|------------------------------------------|-------------------|------------------------------|-----------------------------|----------------------------------------|-------------------------------------------------|-----|
			
 
				+| 32 x DGX A100 80GB | 8           | 256 and 32 | 256 and 128                                        | 1 and 4                                  | ---               | 1.2437                       | ---                         | 1.2                                    | 1.9                                             |
			
 
				+| 32 x DGX A100 80GB | 8           | 128 and 16 | 256 and 128                                        | 2 and 8                                  | 1.2465            | ---                          | 2.4                         | ---                                    | ---                                             |
			
 
				 
			
 
				 
			
 
				 ##### Pre-training loss curves
			
@@ -808,29 +808,29 @@ Our results were obtained by running the `scripts run_pretraining.sh` training s
 
				 
			
 
				 ###### Pre-training NVIDIA DGX A100 (8x A100 80GB)
			
 
				 
			
 
				-| GPUs | Batch size / GPU (TF32 and FP16) | Accumulation steps (TF32 and FP16) | Sequence length | Throughput - TF32(sequences/sec) | Throughput - mixed precision(sequences/sec) | Throughput speedup (TF32 - mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
			
 
				-|------|----------------------------------|------------------------------------|-----------------|----------------------------------|---------------------------------------------|---------------------------------------------|---------------------|--------------------------------|
			
 
				-| 1    | 8192 and 8192                    | 64 and 32                          | 128             | 317                              | 580                                         | 1.83                                        | 1.00                | 1.00                           |
			
 
				-| 8    | 8192 and 8192                    | 64 and 32                          | 128             | 2505                             | 4591                                        | 1.83                                        | 7.90                | 7.91                           |
			
 
				-| 1    | 4096 and 4096                    | 256 and 128                        | 512             | 110                              | 210                                         | 1.90                                        | 1.00                | 1.00                           |
			
 
				-| 8    | 4096 and 4096                    | 256 and 128                        | 512             | 860                              | 1657                                        | 1.92                                        | 7.81                | 7.89                           |
			
 
				+| GPUs | Batch size / GPU (TF32 and FP16) | Accumulated Batch size / GPU (TF32 and FP16) | Accumulation steps (TF32 and FP16) | Sequence length | Throughput - TF32(sequences/sec) | Throughput - mixed precision(sequences/sec) | Throughput speedup (TF32 - mixed precision) | Weak scaling - TF32 | Weak scaling - mixed precision |
			
 
				+|------|----------------------------------|------------------------------------|-----------------|----------------------------------|---------------------------------------------|---------------------------------------------|---------------------|--------------------------------|----|
			
 
				+| 1    | 128 and 256 | 8192 and 8192                    | 64 and 32                          | 128             | 317                              | 580                                         | 1.83                                        | 1.00                | 1.00                           |
			
 
				+| 8    | 128 and 256 | 8192 and 8192                    | 64 and 32                          | 128             | 2505                             | 4591                                        | 1.83                                        | 7.90                | 7.91                           |
			
 
				+| 1    | 16 and 32   | 4096 and 4096                    | 256 and 128                        | 512             | 110                              | 210                                         | 1.90                                        | 1.00                | 1.00                           |
			
 
				+| 8    | 16 and 32   | 4096 and 4096                    | 256 and 128                        | 512             | 860                              | 1657                                        | 1.92                                        | 7.81                | 7.89                           |
			
 
				 
			
 
				 ###### Pre-training NVIDIA DGX A100 (8x A100 80GB) Multi-node Scaling
			
 
				 
			
 
				-| Nodes | GPUs / node | Batch size / GPU (TF32 and FP16) | Accumulation steps (TF32 and FP16) | Sequence length | Mixed Precision Throughput | Mixed Precision Strong Scaling | TF32 Throughput | TF32 Strong Scaling | Speedup (Mixed Precision to TF32) |
			
 
				-|-------|-------------|----------------------------------|------------------------------------|-----------------|----------------------------|--------------------------------|-----------------|---------------------|-----------------------------------|
			
 
				-| 1     | 8           | 8192 and 8192                    | 32 and 64                          | 128             | 4553                       | 1                              | 2486            | 1                   | 1.83                              |
			
 
				-| 2     | 8           | 4096 and 4096                    | 16 and 32                          | 128             | 9191                       | 2.02                           | 4979            | 2.00                | 1.85                              |
			
 
				-| 4     | 8           | 2048 and 2048                    | 8 and 16                           | 128             | 18119                      | 3.98                           | 9859            | 3.97                | 1.84                              |
			
 
				-| 8     | 8           | 1024 and 1024                    | 4 and 8                            | 128             | 35774                      | 7.86                           | 19815           | 7.97                | 1.81                              |
			
 
				-| 16    | 8           | 512 and 512                      | 2 and 4                            | 128             | 70555                      | 15.50                          | 38866           | 15.63               | 1.82                              |
			
 
				-| 32    | 8           | 256 and 256                      | 1 and 2                            | 128             | 138294                     | 30.37                          | 75706           | 30.45               | 1.83                              |
			
 
				-| 1     | 8           | 4096 and 4096                    | 128 and 256                        | 512             | 1648                       | 1                              | 854             | 1                   | 1.93                              |
			
 
				-| 2     | 8           | 2048 and 2048                    | 64 and 128                         | 512             | 3291                       | 2.00                           | 1684            | 1.97                | 1.95                              |
			
 
				-| 4     | 8           | 1024 and 1024                    | 32 and 64                          | 512             | 6464                       | 3.92                           | 3293            | 3.86                | 1.96                              |
			
 
				-| 8     | 8           | 512 and 512                      | 16 and 32                          | 512             | 13005                      | 7.89                           | 6515            | 7.63                | 2.00                              |
			
 
				-| 16    | 8           | 256 and 256                      | 8 and 16                           | 512             | 25570                      | 15.51                          | 12131           | 14.21               | 2.11                              |
			
 
				-| 32    | 8           | 128 and 128                      | 4 and 8                            | 512             | 49663                      | 30.13                          | 21298           | 24.95               | 2.33                              |
			
 
				+| Nodes | GPUs / node | Batch size / GPU (TF32 and FP16) | Accumulated Batch size / GPU (TF32 and FP16) | Accumulation steps (TF32 and FP16) | Sequence length | Mixed Precision Throughput | Mixed Precision Strong Scaling | TF32 Throughput | TF32 Strong Scaling | Speedup (Mixed Precision to TF32) |
			
 
				+|-------|-------------|----------------------------------|------------------------------------|-----------------|----------------------------|--------------------------------|-----------------|---------------------|-----------------------------------|-----|
			
 
				+| 1     | 8           | 126 and 256 | 8192 and 8192                    | 64 and 32                          | 128             | 4553                       | 1                              | 2486            | 1                   | 1.83                              |
			
 
				+| 2     | 8           | 126 and 256 | 4096 and 4096                    | 32 and 16                          | 128             | 9191                       | 2.02                           | 4979            | 2.00                | 1.85                              |
			
 
				+| 4     | 8           | 126 and 256 | 2048 and 2048                    | 16 and 18                           | 128             | 18119                      | 3.98                           | 9859            | 3.97                | 1.84                              |
			
 
				+| 8     | 8           | 126 and 256 | 1024 and 1024                    | 8 and 4                            | 128             | 35774                      | 7.86                           | 19815           | 7.97                | 1.81                              |
			
 
				+| 16    | 8           | 126 and 256 | 512 and 512                      | 4 and 2                            | 128             | 70555                      | 15.50                          | 38866           | 15.63               | 1.82                              |
			
 
				+| 32    | 8           | 126 and 256 | 256 and 256                      | 2 and 1                            | 128             | 138294                     | 30.37                          | 75706           | 30.45               | 1.83                              |
			
 
				+| 1     | 8           | 16  and 32  | 4096 and 4096                    | 256 and 128                        | 512             | 1648                       | 1                              | 854             | 1                   | 1.93                              |
			
 
				+| 2     | 8           | 16  and 32  | 2048 and 2048                    | 128 and 64                         | 512             | 3291                       | 2.00                           | 1684            | 1.97                | 1.95                              |
			
 
				+| 4     | 8           | 16  and 32  | 1024 and 1024                    | 64 and 32                          | 512             | 6464                       | 3.92                           | 3293            | 3.86                | 1.96                              |
			
 
				+| 8     | 8           | 16  and 32  | 512 and 512                      | 32 and 16                          | 512             | 13005                      | 7.89                           | 6515            | 7.63                | 2.00                              |
			
 
				+| 16    | 8           | 16  and 32  | 256 and 256                      | 16 and 8                           | 512             | 25570                      | 15.51                          | 12131           | 14.21               | 2.11                              |
			
 
				+| 32    | 8           | 16  and 32  | 128 and 128                      | 8 and 4                            | 512             | 49663                      | 30.13                          | 21298           | 24.95               | 2.33                              |
			
 
				 
			
 
				 ###### Fine-tuning NVIDIA DGX A100 (8x A100 80GB)