We started this chapter by learning about dynamic parallelism, which is a paradigm that allows us to launch and manage kernels directly on the GPU from other kernels. We saw how we can use this to implement a quicksort algorithm on the GPU directly. We then learned about vectorized datatypes in CUDA, and saw how we can use these to speed up memory reads from global device memory. We then learned about CUDA Warps, which are small units of 32 threads or less on the GPU, and we saw how threads within a single Warp can directly read and write to each other's registers using Warp Shuffling. We then looked at how we can write a few basic operations in PTX assembly, including import operations such as determining the lane ID and splitting a 64-bit variable into two 32-bit variables. Finally, we ended this chapter by writing a new performance-optimized summation kernel that...
United States
Great Britain
India
Germany
France
Canada
Russia
Spain
Brazil
Australia
Singapore
Hungary
Philippines
Mexico
Thailand
Ukraine
Luxembourg
Estonia
Lithuania
Norway
Chile
South Korea
Ecuador
Colombia
Taiwan
Switzerland
Indonesia
Cyprus
Denmark
Finland
Poland
Malta
Czechia
New Zealand
Austria
Turkey
Sweden
Italy
Egypt
Belgium
Portugal
Slovenia
Ireland
Romania
Greece
Argentina
Malaysia
South Africa
Netherlands
Bulgaria
Latvia
Japan
Slovakia