This article is an excerpt from the book, "Mastering Graphics Programming with Vulkan", by Marco Castorina, Gabriel Sassone. Mastering Graphics Programming with Vulkan starts by familiarizing you with the foundations of a modern rendering engine. This book will guide you through GPU-driven rendering and show you how to drive culling and rendering from the GPU to minimize CPU overhead. Finally, you’ll explore advanced rendering techniques like temporal anti-aliasing and ray tracing.
In modern graphics pipelines, optimizing the geometry stage can have a significant impact on overall rendering performance. This article delves into the concept of meshlets—an approach to breaking down large meshes into smaller, more manageable chunks for efficient GPU processing. By subdividing meshes into meshlets, we can enhance culling techniques, reduce unnecessary shading, and better handle complex geometry. Join us as we explore how meshlets work, their benefits, and practical steps to implement them.
Breaking down large meshes into meshlets
In this article, we are going to focus primarily on the geometry stage of the pipeline, the one before the shading stage. Adding some complexity to the geometry stage of the pipeline will pay dividends in later stages as we’ll reduce the number of pixels that need to be shaded.
Note
When we refer to the geometry stage of the graphics pipeline, we don’t mean geometry shaders. Th e geometry stage of the pipeline refers to input assembly (IA), vertex processing, and primitive assembly (PA). Vertex processing can, in turn, run one or more of the following shaders: vertex, geometry, tessellation, task, and mesh shaders.
Content geometry comes in many shapes, sizes, and complexity. A rendering engine must be able to deal with meshes from small, detailed objects to large terrains. Large meshes (think terrain or buildings) are usually broken down by artists so that the rendering engine can pick out the diff erent levels of details based on the distance from the camera of these objects.
Breaking down meshes into smaller chunks can help cull geometry that is not visible, but some of these meshes are still large enough that we need to process them in full, even if only a small portion is visible.
Meshlets have been developed to address these problems. Each mesh is subdivided into groups of vertices (usually 64) that can be more easily processed on the GPU.
The following image illustrates how meshes can be broken down into meshlets:
Figure 6.1 – A meshlet subdivision example
These vertices can make up an arbitrary number of triangles, but we usually tune this value according to the hardware we are running on. In Vulkan, the recommended value is 126 (as written in https://developer.nvidia.com/blog/introduction-turing-mesh-shaders/, the number is needed to reserve some memory for writing the primitive count with each meshlet).
Note
At the time of writing, mesh and task shaders are only available on Nvidia hardware through its extension. While some of the APIs described in this chapter are specifi c to this extension, the concepts can be generally applied and implemented using generic compute shaders. A more generic version of this extension is currently being worked on by the Khronos committee so that mesh and task shaders should soon be available from other vendors!
Now that we have a much smaller number of triangles, we can use them to have much finer-grained control by culling meshlets that are not visible or are being occluded by other objects.
Together with the list of vertices and triangles, we also generate some additional data for each meshlet that will be very useful later on to perform back-face, frustum, and occlusion culling.
One additional possibility (that will be added in the future) is to choose the level of detail (LOD) of a mesh and, thus, a different subset of meshlets based on any wanted heuristic.
The first of this additional data represents the bounding sphere of a meshlet, as shown in the following screenshot:
Figure 6.2 – A meshlet bounding spheres example; some of the larger spheres have been hidden for clarity
Some of you might ask: why not AABBs? AABBs require at least two vec3 of data: one for the center and one for the half-size vector. Another encoding could be to store the minimum and maximum corners. Instead, spheres can be encoded with a single vec4: a vec3 for the center plus the radius.
Given that we might need to process millions of meshlets, each saved byte counts! Spheres can also be more easily tested for frustum and occlusion culling, as we will describe later in the chapter.
The next additional piece of data that we’re going to use is the meshlet cone, as shown in the following screenshot:
Figure 6.3 – A meshlet cone example; not all cones are displayed for clarity
The cone indicates the direction a meshlet is facing and will be used for back-face culling.
Now we have a better understanding of why meshlets are useful and how we can use them to improve the culling of larger meshes, let’s see how we generate them in code!
We are using an open source library, called MeshOptimizer (https://github.com/zeux/meshoptimizer) to generate the meshlets. An alternative library is meshlete (https:// github.com/JarkkoPFC/meshlete) and we encourage you to try both to find the one that best suits your needs.
After we have loaded the data (vertices and indices) for a given mesh, we are going to generate the list of meshlets. First, we determine the maximum number of meshlets that could be generated for our mesh and allocate memory for the vertices and indices arrays that will describe the meshlets:
const sizet max_meshlets = meshopt_buildMeshletsBound(
indices_accessor.count, max_vertices, max_triangles );
Array<meshopt_Meshlet> local_meshlets;
local_meshlets.init( temp_allocator, max_meshlets,
max_meshlets );
Array<u32> meshlet_vertex_indices;
meshlet_vertex_indices.init( temp_allocator, max_meshlets *
max_vertices, max_meshlets* max_vertices );
Array<u8> meshlet_triangles;
meshlet_triangles.init( temp_allocator, max_meshlets *
max_triangles * 3, max_meshlets* max_triangles * 3 );
Notice the types for the indices and triangle arrays. We are not modifying the original vertex or index buffer, but only generating a list of indices in the original buffers. Another interesting aspect is that we only need 1 byte to store the triangle indices. Again, saving memory is very important to keep meshlet processing efficient!
The next step is to generate our meshlets:
const sizet max_vertices = 64;
const sizet max_triangles = 124;
const f32 cone_weight = 0.0f;
sizet meshlet_count = meshopt_buildMeshlets(
local_meshlets.data,
meshlet_vertex_indices.data,
meshlet_triangles.data, indices,
indices_accessor.count,
vertices,
position_buffer_accessor.count,
sizeof( vec3s ),
max_vertices,
max_triangles,
cone_weight );
As mentioned in the preceding step, we need to tell the library the maximum number of vertices and triangles that a meshlet can contain. In our case, we are using the recommended values for the Vulkan API. The other parameters include the original vertex and index buffer, and the arrays we have just created that will contain the data for the meshlets.
Let’s have a better look at the data structure of each meshlet:
struct meshopt_Meshlet
{
unsigned int vertex_offset;
unsigned int triangle_offset;
unsigned int vertex_count;
unsigned int triangle_count;
};
Each meshlet is described by two offsets and two counts, one for the vertex indices and one for the indices of the triangles. Note that these off sets refer to meshlet_vertex_indices and meshlet_ triangles that are populated by the library, not the original vertex and index buff ers of the mesh.
Now that we have the meshlet data, we need to upload it to the GPU. To keep the data size to a minimum, we store the positions at full resolution while we compress the normals to 1 byte for each dimension and UV coordinates to half-float for each dimension. In pseudocode, this is as follows:
meshlet_vertex_data.normal = ( normal + 1.0 ) * 127.0;
meshlet_vertex_data.uv_coords = quantize_half( uv_coords );
The next step is to extract the additional data (bounding sphere and cone) for each meshlet:
for ( u32 m = 0; m < meshlet_count; ++m ) {
meshopt_Meshlet& local_meshlet = local_meshlets[ m ];
meshopt_Bounds meshlet_bounds =
meshopt_computeMeshletBounds(
meshlet_vertex_indices.data +
local_meshlet.vertex_offset,
meshlet_triangles.data +
local_meshlet.triangle_offset,
local_meshlet.triangle_count,
vertices,
position_buffer_accessor
.count,
sizeof( vec3s ) );
...
}
We loop over all the meshlets and we call the MeshOptimizer API that computes the bounds for each meshlet. Let’s see in more detail the structure of the data that is returned:
struct meshopt_Bounds
{
float center[3];
float radius;
float cone_apex[3];
float cone_axis[3];
float cone_cutoff;
signed char cone_axis_s8[3];
signed char cone_cutoff_s8;
};
The first four floats represent the bounding sphere. Next, we have the cone definition, which is comprised of the cone direction (cone_axis) and the cone angle (cone_cutoff). We are not using the cone_apex value as it makes the back-face culling computation more expensive. However, it can lead to better results.
Once again, notice that quantized values (cone_axis_s8 and cone_cutoff_s8) help us reduce the size of the data required for each meshlet.
Finally, meshlet data is copied into GPU buff ers and it will be used during the execution of task and mesh shaders.
For each processed mesh, we will also save an offset and count of meshlets to add a coarse culling based on the parent mesh: if the mesh is visible, then its meshlets will be added.
In this article, we have described what meshlets are and why they are useful to improve the culling of geometry on the GPU.
Meshlets represent a powerful tool for optimizing the rendering of complex geometries. By subdividing meshes into small, efficient chunks and incorporating additional data like bounding spheres and cones, we can achieve finer-grained control over visibility and culling processes. Whether you're leveraging advanced shader technologies or applying these concepts with compute shaders, adopting meshlets can lead to significant performance improvements in your graphics pipeline. With libraries like MeshOptimizer at your disposal, implementing this technique has never been more accessible.
Marco Castorina first became familiar with Vulkan while working as a driver developer at Samsung. Later, he developed a 2D and 3D renderer in Vulkan from scratch for a leading media server company. He recently joined the games graphics performance team at AMD. In his spare time, he keeps up to date with the latest techniques in real-time graphics. He also likes cooking and playing guitar.
Gabriel Sassone is a rendering enthusiast currently working as a principal rendering engineer at The Multiplayer Group. Previously working for Avalanche Studios, where he first encountered Vulkan, they developed the Vulkan layer for the proprietary Apex Engine and its Google Stadia port. He previously worked at ReadyAtDawn, Codemasters, FrameStudios, and some other non-gaming tech companies. His spare time is filled with music and rendering, gaming, and outdoor activities.