In all the deep learning models covered in this book, irrespective of the learning paradigm, three basic calculations were necessary: multiplication, addition, and application of an activation function.
The first two components are part of matrix multiplication: the weight matrix W needs to be multiplied to input matrix X, generally expressed as WTX; matrix multiplication is computationally expensive on a CPU, and though a GPU parallelizes the operation, still there is scope for improvement.
The TPU has a 65,536 8-bit integer matrix multiplier unit (MXU) that gives a peak throughput of 92 TOPS. The major difference between GPU and TPU multiplication is that GPUs contain floating point multipliers, while TPUs contain 8-bit integer multipliers. TPUs also contain a Unified Buffer (UB), 24 MB of SRAM that works as registers, and an Activation Unit (AU), which...