When it comes to performance, executing a single instruction at a time on a single-core processor is essentially the slowest way you can implement an algorithm or other functionality. From here, you can scale this singular execution flow to multiple flows using simultaneous scheduling on a single processor core's individual functional units.
The next step to increase performance is to add more cores, which of course complicates the scheduling even more, and introduces potential latency issues with critical tasks being postponed because less critical tasks are blocking resources. The use of general purpose processors is also very limiting for certain tasks, especially those that are embarrassingly parallel.
For tasks where a single large dataset has to be processed using the same algorithm applied to each element in the set, the use of general-purpose...