Freezing layers
The first technique we introduce here is called layer freezing. At a high level, we have the assumption that different layers of a model may converge at different stages of the training process. Thus, we can freeze the layers that converge earlier.
Here, freezing refers to the following two operations:
- We abandon the intermediate results on particular layers during forward propagation.
- We may also avoid generating gradients during backward propagation.
We illustrate this technique in the following diagram:
As shown in the preceding diagram, we assume the input data has already been tokenized and can be directly fed into the model for either model training or model serving stages. We have a three-layer model. Each layer is an independent transformer layer, and each single transformer layer is allocated on a separate GPU.
Now, let's discuss...