Given the scale of modern LLMs, GPU network fabric is one of the most important factor for your training speed. The ability to efficiently communicate across hundreds of nodes allows you to maximize GPU utilization and explore more complex training parallelisms such as in Deepspeed or Fully Sharded Data Parallel. These critically rely on high performance networking solutions requiring specialized hardware and software like Infiniband or RDMA over Converged Ethernet (RoCE) with current vendor options often going up to now 3200 Gbps.

Estimating Step Time for GPT language models

To illustrate, how important GPU fabric is to training, let's consider training an 7B GPT-style language model trained at 2048 context length using a pair of H100x8 nodes and performing an all-reduce operation which is used to average gradients over GPUs. Typical model flops utilization (MFU) for falls around ~40%. For an H100, that roughly translates to 396 TFLOP/second. To perform, forward and backward pass of the model with reactivation using BF16, the number of TFLOP can be estimated as:

tflop = model_size_in_B * 4 * 2 * seqlen * global_batch_size / (total_gpus * 1e3)

We already have seqlen = 2048 and model_size_in_B = 7. A typical batch size might be global_batch_size = 16. This gives us 114 TFLOP. So we expect, computing each step to take roughly 114 TFLOP / 396 TFLOP/second = 0.29 seconds.

Now we need to estimate the time it takes to average the gradients over all the GPUs. For a 7B model, we'll pass gradients in fp32 meaning we need to pass S := 7 Billion x 4 Bytes = 28 GB in gradients every iteration. With n GPUs, for all-reduce, each GPU communicates its gradients to all the others resulting in 2(n - 1) data transfer operations over n links of bandwidth B to perform them on. So the time to perform one round of all-reduce can be estimated as

t = (S/B) * (2*(n-1)/n)

Consider networking with B = 1000Gbps = 125GB/second and n = 16, we have t = 0.42 seconds to perform one reduction. So we are network bound since the time to compute one step is longer than the time spent communicating between GPUs 0.29 seconds < 0.42 seconds. If we had chosen B = 3200Gbps = 400GB/second, t = 0.13 seconds and we enter the compute limited regime. Advanced algorithms will try to overlap the communication of gradients with the computation of later layers. Even if communication and computation perfectly overlapped every step, in the network bound case, there is at least 0.13 seconds of idling, which is almost 0.13 seconds / 0.42 seconds = 30% of the total step time! Very wasteful.

NCCL Test

NCCL (NVIDIA Collective Communications Library) is a library developed by NVIDIA to provide in collective communication primitives for GPU-GPU communication, which is used while training large models. If you’ve profiled your model training before, you’ve probably noticed the ’ncclKernel_AllReduce’ call. It's common to run the all reduce benchmark which is included as part of NVIDIA's nccl-test to QA a GPU fabric. We ran this on a pair of 1600 Gbps machines, which showed the following:

out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum      -1    29.73    0.00    0.00      0    29.65    0.00    0.00      0
          16             4     float     sum      -1    28.43    0.00    0.00      0    28.02    0.00    0.00      0
          32             8     float     sum      -1    28.60    0.00    0.00      0    28.48    0.00    0.00      0
          64            16     float     sum      -1    29.02    0.00    0.00      0    29.01    0.00    0.00      0
         128            32     float     sum      -1    29.14    0.00    0.01      0    29.13    0.00    0.01      0
         256            64     float     sum      -1    29.47    0.01    0.02      0    29.56    0.01    0.02      0
         512           128     float     sum      -1    31.29    0.02    0.03      0    30.83    0.02    0.03      0
        1024           256     float     sum      -1    32.81    0.03    0.06      0    32.54    0.03    0.06      0
        2048           512     float     sum      -1    35.57    0.06    0.11      0    35.71    0.06    0.11      0
        4096          1024     float     sum      -1    37.27    0.11    0.21      0    36.39    0.11    0.21      0
        8192          2048     float     sum      -1    38.52    0.21    0.40      0    38.15    0.21    0.40      0
       16384          4096     float     sum      -1    42.18    0.39    0.73      0    41.70    0.39    0.74      0
       32768          8192     float     sum      -1    43.50    0.75    1.41      0    42.53    0.77    1.44      0
       65536         16384     float     sum      -1    48.09    1.36    2.56      0    45.70    1.43    2.69      0
      131072         32768     float     sum      -1    56.49    2.32    4.35      0    55.72    2.35    4.41      0
      262144         65536     float     sum      -1    103.7    2.53    4.74      0    93.15    2.81    5.28      0
      524288        131072     float     sum      -1    97.47    5.38   10.09      0    97.48    5.38   10.08      0
     1048576        262144     float     sum      -1    101.6   10.32   19.35      0    101.4   10.34   19.39      0
     2097152        524288     float     sum      -1    115.6   18.15   34.03      0    110.6   18.97   35.57      0
     4194304       1048576     float     sum      -1    148.7   28.21   52.88      0    149.5   28.06   52.62      0
     8388608       2097152     float     sum      -1    223.3   37.57   70.44      0    221.7   37.84   70.96      0
    16777216       4194304     float     sum      -1    303.1   55.35  103.77      0    303.8   55.23  103.55      0
    33554432       8388608     float     sum      -1    511.7   65.58  122.96      0    510.9   65.68  123.15      0
    67108864      16777216     float     sum      -1    953.1   70.41  132.03      0    945.6   70.97  133.07      0
   134217728      33554432     float     sum      -1   1581.3   84.88  159.15      0   1578.5   85.03  159.43      0
   268435456      67108864     float     sum      -1   2945.4   91.14  170.88      0   2950.6   90.98  170.58      0
   536870912     134217728     float     sum      -1   5852.4   91.73  172.00      0   5821.6   92.22  172.91      0
  1073741824     268435456     float     sum      -1    11567   92.83  174.05      0    11580   92.72  173.86      0
  2147483648     536870912     float     sum      -1    22518   95.37  178.81      0    22437   95.71  179.46      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 48.8808

For training large models, we really only care the value of the busbw in the last row (which in this case is 178.81GB/s) since during training, gradients are bucketed and sent in large payloads such as the 2147483648 = 2GB message size in the last row. It's possible to achieve better efficiency through further tuning, with 95% of the theoretical bandwidth commonly achieved. Usually network tuning will just be handled by your cloud provider, but due diligence and silent faults with GPU fabrics say that you should health check your fabric periodically.

Torch DDP Throughput

Next, a quick DDP test helps ensure we can efficiently train models end-to-end. To iterate through datasets faster, Distributed Data Parallel (DDP) is commonly used to shard batches across data parallel workers by copying the model onto each GPU and communicating gradients between workers. To make this easier to test across clouds, we added this torch DDP benchmark to Skypilot. After running this, we can get the following summary.

Benchmark: resnet101 with batch size 32

                            sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec      sec/iter    ex/sec
   1 GPUs --   no ddp:  p50:  0.064s     501/s  p75:  0.064s     499/s  p90:  0.064s     497/s  p95:  0.065s     491/s
   1 GPUs --    1M/1G:  p50:  0.064s     502/s  p75:  0.064s     502/s  p90:  0.064s     502/s  p95:  0.064s     501/s
   2 GPUs --    1M/2G:  p50:  0.066s     486/s  p75:  0.066s     486/s  p90:  0.066s     484/s  p95:  0.066s     482/s
   4 GPUs --    1M/4G:  p50:  0.068s     468/s  p75:  0.069s     464/s  p90:  0.070s     457/s  p95:  0.077s     417/s
   8 GPUs --    1M/8G:  p50:  0.069s     465/s  p75:  0.069s     464/s  p90:  0.069s     463/s  p95:  0.069s     463/s
  16 GPUs --    2M/8G:  p50:  0.089s     359/s  p75:  0.090s     356/s  p90:  0.091s     350/s  p95:  0.094s     340/s

Here we see that our throughput is not scaling ideally linearly, which is indicative of poor communications bandwidth.

Conclusion

The dream of any ML engineer is linear scaling. 2x the GPUs should mean half the training time. However, the only way to practically achieve this is with networking capabilities that can communicate gradients as faster than your GPU can compute them.

Is your team or organization struggling with distributed training? I’d love to chat and see if Trainy can help! Send me a message on Linkedin or reach out at founders@trainy.ai