1

I've done mostly single GPU training using PyTorch. I've decided recently I wish to use a distributed approach for model training on a cluster with GPUs. But I'm unsure what framework to use. I gather that, while Spark is often a preferred tool for many distributed computing needs, it doesn't work great with PyTorch.

What frameworks are most commonly used? Please rank them so I know which one is most common.

I asked Google Bard and it replied with the following:

  • DeepSpeed is a distributed training framework that is built on top of PyTorch. It is designed to scale PyTorch models to large-scale clusters. DeepSpeed includes a number of features that can improve training performance, such as model parallelism, mixed precision training, and pipeline parallelism.
  • Horovod is a distributed training framework that is compatible with a variety of deep learning frameworks, including PyTorch. Horovod provides a high-performance communication layer that can significantly improve the speed of distributed training.
  • NCCL is a high-performance library for collective communication that is used by a number of deep learning frameworks, including PyTorch. NCCL can be used to improve the speed of distributed training by reducing the amount of time spent communicating between processes.
  • PyTorch Lightning is a high-level library that makes it easy to build and train PyTorch models. PyTorch Lightning provides a number of features that can improve the speed and reproducibility of distributed training, such as automatic checkpointing and distributed logging.
  • Ray is a distributed computing framework that can be used to train PyTorch models on large-scale clusters. Ray provides a number of features that can improve the speed and scalability of distributed training, such as automatic resource management and fault tolerance.
  • Fairscale is a distributed training framework that is built on top of torch.distribute. It provides a number of features that are not available in torch.distribute, such as automatic mixed precision training, distributed logging, and distributed checkpointing. Fairscale is a good choice for users who want to take advantage of the latest features and optimizations for distributed training.
  • torch.distribute is a distributed training framework that provides a high-level API for PyTorch. It supports a variety of training patterns, including data parallelism, model parallelism, and pipeline parallelism.

I recognize there are different use cases potentially for different solutions, but, just as one can use both Tensorflow and PyTorch yet PyTorch has grown far more common (source), similarly, I would imagine all these frameworks aren't used equally, so I'm looking for the most common one so I can follow industry best practices.

Let me know! Thanks.

0 Answers0