GPUs are designed for parallel computing
A CPU (central processing unit) is optimized for serial processing. But if we run the attention heads through serial processing, it would take far longer to train an efficient transformer model. Very small educational transformers can run on CPUs. However, they do not qualify as state-of-the-art models.
A GPU (graphics processing unit) is designed for parallel processing. Transformer models were designed for parallel processing (GPUs), not serial processing (CPUs).