Optimizing SelectionDAG
After converting the IR into SelectionDAG
, many opportunities may arise to optimize the DAG itself. These optimization takes place in the DAGCombiner
phase. These opportunities may arise due to set of architecture specific instructions.
Let's take an example:
#include <arm_neon.h> unsigned hadd(uint32x4_t a) { return a[0] + a[1] + a[2] + a[3]; }
The preceding example in IR looks like the following:
define i32 @hadd(<4 x i32> %a) nounwind { %vecext = extractelement <4 x i32> %a, i32 3 %vecext1 = extractelement <4 x i32> %a, i32 2 %add = add i32 %vecext, %vecext1 %vecext2 = extractelement <4 x i32> %a, i32 1 %add3 = add i32 %add, %vecext2 %vecext4 = extractelement <4 x i32> %a, i32 0 %add5 = add i32 %add3, %vecext4 ret i32 %add5 }
The example is basically extracting single element from a vector of <4xi32>
and adding each element of the vector to give a scalar result.
Advanced architectures such as ARM has one single...