When Fast Code Runs Slow in Game Engines – ██FR█████ █INTELL███████████

This content originally appeared on Level Up Coding – Medium and was authored by Alexander Korostin

When you’re building software that needs to be really fast, especially for game engines, you spend a lot of time making your code efficient. You might tweak algorithms, improve how graphics are drawn, or use multiple CPU cores to get things done quicker. But there’s a hidden problem that can quietly slow down even well-made programs: how your computer’s CPU handles memory. This is where the idea of locality of reference becomes super important. It’s all about how efficiently your code talks to the computer’s memory, and that directly affects how smoothly your game runs. If you don’t pay attention to it, your hard work on optimizations might be undone by your computer constantly having to fetch data slowly from main memory.

Let’s look at a simple-looking C++ code example. Imagine we have a large grid, like one that represents a game world or a texture. In C++, a 2D array like float grid[HEIGHT][WIDTH] stores its data row by row in memory, meaning all the numbers in one row are placed right next to each other.

// Imagine a large 2D array representing a game world grid or a texture
const int WIDTH = 4096;
const int int HEIGHT = 4096;
float grid[HEIGHT][WIDTH]; // Stored row-major

void processGridBadly() {
    // This loop order is not good for how memory is stored
    for (int j = 0; j < WIDTH; ++j) { // Go through columns first
        for (int i = 0; i < HEIGHT; ++i) { // Then go through rows
            grid[i][j] *= 0.5f;
        }
    }
}

void processGridWell() {
    // This loop order works well with how memory is stored
    for (int i = 0; i < HEIGHT; ++i) { // Go through rows first
        for (int j = 0; j < WIDTH; ++j) { // Then go through columns
            grid[i][j] *= 0.5f;
        }
    }
}

Both processGridBadly() and processGridWell() do the exact same calculations on the same set of numbers. You might expect them to take about the same amount of time. But on modern computers, processGridWell() will almost always run much, much faster (sometimes many times quicker) than processGridBadly(). Why is this the case? What’s going on behind the scenes that makes such a big difference? We’ll dive deep into locality of reference, show its big effect on game engine math and graphics, and give you practical ways to use it to make your games really fast and smooth.

Understanding Locality of Reference

Understanding why our “well-behaved” grid processing function performed so much better requires us to look at how modern CPUs handle data. At the heart of it are two key principles of memory access: temporal locality and spatial locality. These concepts explain how CPU caches, which are small, super-fast memory areas right on the CPU chip, try to predict what data your program will need next.

First, let’s talk about temporal locality. This principle suggests that if your program uses a piece of data, it’s very likely to use that same piece of data again very soon. Think of it like this: if you just picked up a specific tool from your toolbox, you’re probably going to use that tool again for the next few steps of your task. CPUs exploit this by keeping recently used data in their fast caches. When your program asks for data that’s already in the cache, it’s a “cache hit,” and the CPU gets it almost instantly. If it’s not there, it’s a “cache miss,” and the CPU has to go to slower main memory, which takes much longer.

Here’s a simple example illustrating good temporal locality:

// Good Temporal Locality: 'sum' is repeatedly accessed and stays in cache
long long calculateSum(const std::vector<int>& data) {
    long long sum = 0; // 'sum' is accessed in every iteration
    for (int x : data) {
        sum += x; // 'sum' is read and written repeatedly, keeping it in cache
    }
    return sum;
}

In this calculateSum function, the sum variable is accessed in every single loop iteration. Because it’s used so frequently and consecutively, the CPU’s cache keeps it readily available, leading to very fast updates. While it’s harder to show a truly “bad” temporal locality example without a complex scenario involving unpredictable jumps in data access, imagine a situation where a variable is used, then not used for a very long time, then used again. It would likely be evicted from the cache in between uses, leading to a cache miss when it’s finally needed again.

Next, we have spatial locality. This principle states that if your program accesses a particular piece of data, it’s highly probable that it will soon access data located near it in memory. This is a crucial concept because CPUs don’t just fetch one byte of data at a time. Instead, when a cache miss occurs, the CPU brings in a whole chunk of memory, typically 64 bytes, called a cache line. It assumes that if you needed one byte from that chunk, you’ll probably need other bytes from the same chunk very soon.

Let’s revisit our 2D array example to see spatial locality in action:

// Good Spatial Locality (iterating through contiguous memory)
void processGridWell() {
    for (int i = 0; i < HEIGHT; ++i) { // Go through rows first
        for (int j = 0; j < WIDTH; ++j) {
            // Accessing numbers that are right next to each other
            grid[i][j] *= 0.5f;
        }
    }
}

In processGridWell(), the inner loop iterates through j, meaning it accesses grid[i][0], then grid[i][1], then grid[i][2], and so on. Since these elements are stored contiguously in memory for a row-major array, when grid[i][0] is loaded, the entire cache line containing grid[i][0], grid[i][1], grid[i][2], etc., is brought into the cache. Subsequent accesses to grid[i][1], grid[i][2], etc., become fast cache hits.

Now, compare that to processGridBadly():

// Bad Spatial Locality (accessing elements with a large stride)
void processGridBadly() {
    for (int j = 0; j < WIDTH; ++j) { // Go through columns first
        for (int i = 0; i < HEIGHT; ++i) {
            // Accessing numbers that are far apart in memory
            grid[i][j] *= 0.5f;
        }
    }
}

Here, the inner loop iterates through i. This means it accesses grid[0][j], then grid[1][j], then grid[2][j], and so on. Because the array is row-major, grid[0][j] and grid[1][j] are not next to each other in memory; they are separated by an entire row’s worth of data. Each access to grid[i][j] (for a fixed j and changing i) will likely result in a new cache miss, as the data needed is not within the currently loaded cache line. The CPU constantly has to fetch new cache lines, leading to a significant slowdown.

AoS vs SoA and ECS

When we talk about game engines, mathematics is fundamental. And just like with our simple grid example, how you organize and access your data can have a huge impact on performance due to locality of reference. This brings us to a very common and important design choice in game development: how you lay out your data in memory, specifically the difference between Array of Structs (AoS) and Struct of Arrays (SoA). This choice directly impacts how data is stored and, consequently, how efficiently your CPU’s cache can work.

Let’s imagine we’re building a particle system, where each particle has a position, a velocity, and a remaining lifetime.

In the Array of Structs (AoS) approach, you create a list (or array) where each item in the list is a complete “struct” or “object” representing one particle. So, all the data for a single particle — its position, velocity, and life — are grouped together in memory.

struct ParticleAoS {
    glm::vec3 position; // x, y, z coordinates
    glm::vec3 velocity; // x, y, z components of speed and direction
    float life;         // How much time left before it disappears
    // ... other data for this particle
};

std::vector<ParticleAoS> particlesAoS;

void updateParticlesAoS(float deltaTime) {
    for (auto& p : particlesAoS) {
        // Update position and decrease remaining life
        p.position += p.velocity * deltaTime;
        p.life -= deltaTime;
        // ... potentially other operations that use all of p's data
    }
}

When you use AoS and access one particle (like p in the loop above), all of its members (position, velocity, life) are typically loaded into the CPU’s cache at the same time, because they are stored right next to each other in memory. This is fantastic if your code frequently needs to work with all the data of a single particle together. For example, if your particle update function always modifies position, velocity, and life, then AoS provides excellent spatial locality. You get a “cache hit” for all the data you need for that particle.

In contrast, the Struct of Arrays (SoA) approach organizes data by component type. Instead of a list of full particle structs, you’d have separate lists (or arrays) for all the positions, all the velocities, and all the lifetimes.

struct ParticleSoA {
    std::vector<glm::vec3> positions;  // All particle positions
    std::vector<glm::vec3> velocities; // All particle velocities
    std::vector<float> lives;          // All particle lifetimes
    // ... other vectors for other attributes
};

ParticleSoA particlesSoA;

void updateParticlePositionsSoA(float deltaTime) {
    for (size_t i = 0; i < particlesSoA.positions.size(); ++i) {
        // Accessing only positions and velocities for this update
        particlesSoA.positions[i] += particlesSoA.velocities[i] * deltaTime;
    }
}

SoA really shines when you only need to process a subset of a particle’s data across many particles. For instance, if you have a function that only updates particle positions based on their velocity (like updateParticlePositionsSoA above), using SoA means you only load the positions and velocities arrays into the CPU’s cache. The lives data, which isn’t needed for this specific task, stays out of the cache. This saves valuable cache space and memory bandwidth, as the CPU isn’t wasting time bringing in data it doesn’t immediately need.

This is why patterns like the Entity-Component-System (ECS) architecture are so popular in modern, high-performance game engines. While we won’t delve into the full workings of ECS here (you can find a detailed explanation in my previous article on ECS), it’s important to understand why it’s so beneficial for locality of reference. ECS naturally encourages an SoA-like data layout. For instance, when a Movement System needs to update all positions based on velocities, it can iterate through contiguous blocks of PositionComponent data and VelocityComponent data. The crucial advantage of this separation is that the system only loads the data it needs for that specific operation. Any other components, like a RenderComponent or a HealthComponent, are entirely ignored by the Movement System, meaning their data is never pulled into the CPU’s precious cache. This avoids “cache pollution” and ensures that the cache is filled only with the most relevant information, leading to highly efficient processing.

It’s worth noting, however, that while ECS promotes granular component separation, in real-world applications, it can sometimes be a good idea to group frequently co-accessed data together into a single component. For example, if PositionComponent and VelocityComponent are almost always read and written together by a movement system, it might be more efficient to combine them into a single KinematicsComponent that contains both position and velocity.

// A component that groups position and velocity data.
// This is useful when these two pieces of data are almost always
// accessed and modified together by movement or physics systems.
struct KinematicsComponent {
    float position[3];
    float velocity[3];
};

This practical approach is something you’ll frequently see in action across leading game engines and frameworks. Consider Unity’s Data-Oriented Technology Stack (DOTS), for instance. Their ECS implementation features a LocalTransform component that combines position, rotation, and scale data. While it’s certainly possible to have these as separate components, grouping them together is a very sensible design choice because these three pieces of information are almost always accessed simultaneously when you’re performing any kind of spatial transformation.

Vertex Buffer Data Layout

The way you arrange the various pieces of data within your vertex buffer significantly influences how efficiently the GPU’s internal caches operate. There are two primary approaches: Interleaved and Separated layouts.

In an Interleaved Vertex Data Layout, all the attributes for a single vertex (its position, normal, texture coordinates, and so on) are stored contiguously in memory. This approach is conceptually very similar to the Array of Structs (AoS) pattern we discussed earlier for CPU-side data. Each “struct” in this context is a complete vertex, packed with all its relevant attributes.

// Interleaved Vertex Buffer Layout.
// All attributes for one vertex are packed together in memory.
struct VertexInterleaved {
    glm::vec3 position;
    glm::vec3 normal;
    glm::vec2 texCoord;
    // ... potentially other attributes like color, tangent, etc.
};
// A single large vector holding many such interleaved vertex structs
std::vector<VertexInterleaved> interleavedVertices;

This layout offers excellent spatial locality when your vertex shader needs all of a vertex’s attributes. When the GPU fetches the position of a vertex, it’s highly likely to pull in its normal and texture coordinates in the same cache line because they’re all packed together. This approach is often simpler to manage for general rendering where all vertex attributes are consistently used.

Alternatively, with a Separated Vertex Data Layout, each attribute type is stored in its own distinct, contiguous array. So, all vertex positions would be in one array, all normals in another, all texture coordinates in a third, and so on. This directly mirrors the Struct of Arrays (SoA) pattern we discussed earlier.

// Separated Vertex Buffer Layout.
// Each vector holds one specific attribute type for all vertices.
struct MeshDataSeparated {
    std::vector<glm::vec3> positions;
    std::vector<glm::vec3> normals;
    std::vector<glm::vec2> texCoords;
    // ... other attribute vectors
};
// A struct that groups these separate attribute vectors
MeshDataSeparated separatedMeshData;

This layout provides superior spatial locality when only specific attributes are needed across many vertices. For example, if you have a render pass that only reads positions, such as during shadow map generation or a depth-only pass, the GPU can fetch just the positions array. This is indeed a beneficial use case for separated layouts, as it avoids the overhead of loading normals or texture coordinates that aren’t relevant for that particular pass. This can significantly improve memory bandwidth utilization and cache efficiency for specialized rendering passes. The trade-off is that it can be more complex to manage, as you’re dealing with multiple distinct buffers that need to be synchronized.

Index Buffer Order

Beyond just how we arrange vertex data, the order in which we instruct the GPU to draw triangles using the index buffer also significantly impacts rendering performance. This is primarily due to the GPU Vertex Cache.

Modern GPUs are equipped with a small, high-speed cache specifically designed to store the output of vertices that have recently passed through the vertex shader. When your vertex shader finishes its work on a particular vertex, the computed results are temporarily stored in this cache. The key benefit here is that if that exact same vertex is needed again very soon for another triangle, and its processed data is still available in this specialized cache, the GPU can retrieve these pre-computed results instantly. This means the GPU completely bypasses the need to re-execute the vertex shader for that specific vertex, which is a significant win for performance, as vertex shader execution can be computationally intensive.

The core optimization strategy involves ordering your triangles, and consequently the indices within your index buffer, such that vertices are reused as frequently as possible within a tight sequence of operations. This approach maximizes the likelihood of a cache hit in the GPU’s vertex cache. When the GPU processes a series of triangles, if a vertex is shared between consecutively drawn triangles, its processed data remains readily available in the cache. This allows the GPU to efficiently retrieve those pre-computed results for shared vertices, only requiring full processing for newly encountered ones. Conversely, if triangles are ordered arbitrarily, a jump to an unrelated triangle might cause previously processed vertices to be evicted from the cache before they can be reused, thereby necessitating their re-processing from scratch.

// Original indices for a simple mesh:
// 0---1
// | / |
// 2---3
std::vector<unsigned int> originalIndices = {
    1, 0, 2, // First triangle
    1, 3, 2  // Second triangle. This order might not be ideal for cache
             // if 1,3,2 are far from 0,1,2 in memory
};

// Optimized indices for the same quad, reordered for better cache locality.
// The algorithm tries to keep recently used vertices (0, 1, 2) "hot" in
// the cache.
std::vector<unsigned int> optimizedIndices = {
    0, 1, 2,
    1, 2, 3  // Reordered to share 1 and 2 with the first triangle
};

To improve cache efficiency, developers often use algorithms that optimize the order of vertex indices. These tools are typically run offline as part of your asset pipeline, transforming your raw mesh data into a format that’s much more friendly to the GPU’s internal workings.

One of the most widely adopted algorithms for this task is Tom Forsyth’s Linear-Speed Vertex Cache Optimisation. This algorithm is designed to work efficiently even with very large meshes, scaling linearly with the number of triangles, which is crucial for the complex models found in modern games.

The core of Forsyth’s algorithm is a greedy approach driven by a dynamic scoring system. It simulates a simplified GPU vertex cache (often a Least Recently Used or First-In, First-Out queue) to predict which vertices are “hot” or “cold.” Each vertex receives a score based on its recency in this simulated cache, with higher scores for more recently used vertices. A small boost is also given to vertices with few remaining triangles that still need them, encouraging their “completion” before cache eviction. Triangles are then scored based on the combined scores of their three vertices. The algorithm repeatedly selects the un-drawn triangle with the highest score, adds its indices to the optimized output list, and updates the simulated cache and affected vertex scores. This greedy, score-based mechanism ensures the algorithm always picks the triangle offering the best immediate cache benefit, avoiding complex and slow look-aheads.

In addition to Tom Forsyth’s algorithm, another notable approach that has significantly contributed to vertex cache optimization is Tipsify. It offers a different strategy for optimizing triangle lists. While Forsyth’s algorithm builds its sequence by scoring and picking the “best” next triangle from the entire remaining set, Tipsify often employs a more vertex-centric “fanning out” approach. It typically works by selecting a vertex and then processing all triangles that share that vertex, effectively “fanning out” from that central point. After exhausting the triangles around one vertex, it moves to an adjacent vertex and continues the process. This local grouping helps maintain cache locality.

A key distinguishing feature of Tipsify is its frequent combination of vertex cache optimization with overdraw reduction. Overdraw occurs when multiple triangles are rendered to the same pixel, wasting GPU cycles. Tipsify aims to minimize this by sorting triangles in a way that reduces redundant pixel processing, often by prioritizing geometry that is likely to be visible first. Tipsify is known for being very fast and scalable, particularly effective for larger meshes. It has been integrated into practical tools like AMD’s Triangle Order Optimization Tool (Tootle), making it a popular choice, especially within the AMD ecosystem.

Wrapping up

I hope this deep dive has offered valuable insights into how understanding hardware fundamentals can truly elevate your game development. It’s all about working smarter with the systems at hand, aligning your code and data to unlock impressive performance gains. Keep experimenting, keep profiling, and enjoy pushing the boundaries of what’s possible in your creations!

When Fast Code Runs Slow in Game Engines was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

This content originally appeared on Level Up Coding – Medium and was authored by Alexander Korostin