iTransformer
We have already talked at length about the inadequacies of Transformer architectures in handling multivariate time series, namely the inefficient capture of locality, the order-agnostic attention mechanism muddling up information across time steps, and so on. In 2024, Yong Liu et al. took a slightly different view of this problem and, in their own words, “an extreme case of patching.”
The architecture of iTransformer
They proposed that it is not that the Transformer architecture is ineffective for time series forecasting, but rather it is improperly used. The authors suggested that we flip the inputs to the Transformer architecture so that the attention isn’t applied across time steps but rather across variates or different series/features on the time series. Figure 16.6 shows the difference clearly.
Figure 16.6: Transformers vs iTransformers—the difference
In vanilla Transformers, we use the input as (Batch x Time steps...