References
Please go through the following content for more information on a few topics covered in the chapter:
- SMP library: https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-sm-sdk.html#model-parallel-customize-container
- Amazon SageMaker example: https://github.com/aws/amazon-sagemaker-examples/blob/main/training/distributed_training/pytorch/model_parallel/gpt2/smp-train-gpt-simple.ipynb
- SM DDP library: https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html
- Amazon SageMaker Model Parallism: A General and Flexible Framework for Large Model Training https://arxiv.org/pdf/2111.05972.pdf
- Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning: https://www.usenix.org/conference/osdi22/presentation/zheng-lianmin
- Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer: https://arxiv.org/pdf/1701.06538.pdf
- Pangu-Σ: Towards Trillion Parameter Language Model With Sparse Heterogenous...