Writing Software That Works with the Machine, Not Against It in GO – ██FR█████ █INTELL███████████

This content originally appeared on DEV Community and was authored by Walid LAGGOUNE

Hi, my name is Walid, a backend developer who’s currently learning Go and sharing my journey by writing about it along the way.
Resource :

The Go Programming Language book by Alan A. A. Donovan & Brian W. Kernighan
Matt Holiday go course

In high-performance computing, mechanical sympathy refers to writing software that aligns with the way hardware operates, ensuring that the code leverages CPU architecture efficiently rather than working against it. Go, while being a high-level language, allows developers to write performance-sensitive code when necessary.

This article dives deep into key performance principles, covering CPU architecture, memory hierarchies, data structures, false sharing, allocation strategies, and execution patterns—all with a focus on writing Go code that maximizes hardware efficiency.

Understanding CPU Cores and Memory Hierarchy

Modern CPUs are built with multiple cores, each with its own cache hierarchy:

L1 Cache (Fastest, ~1ns latency) – Closest to the CPU core but small in size (32KB–64KB per core).
L2 Cache (Fast, ~4-10ns latency) – Larger (256KB–1MB per core), shared between threads within the same core.
L3 Cache (Slower, ~10-30ns latency) – Shared across multiple cores, much larger (4MB–128MB).
RAM (Much slower, ~100ns latency) – Main system memory, much larger but orders of magnitude slower than caches.

When a program accesses data, the CPU first looks for it in L1, then L2, then L3, before going to RAM. Cache misses (when the requested data isn’t in L1/L2/L3) cause significant slowdowns.

GOAL: Structure data for locality of reference so that frequently accessed data stays in cache as long as possible.

Dynamic Dispatch: Why Too Many Short Method Calls Are Expensive

In Go, calling methods through an interface involves dynamic dispatch, which introduces an extra layer of indirection:

The vtable (virtual method table) lookup adds an overhead.
Indirect calls prevent the CPU from prefetching the next instruction, leading to branch mispredictions.
The function call may jump to a memory location outside the CPU cache, causing cache misses.

Example: Indirect vs. Direct Method Calls

type Shape interface {
    Area() float64
}

type Circle struct {
    radius float64
}

func (c Circle) Area() float64 {
    return 3.14 * c.radius * c.radius
}

func main() {
    c := Circle{radius: 10}
    var s Shape = c  // Dynamic dispatch (interface method call)

    _ = s.Area() // Indirect call via interface (slower)
    _ = c.Area() // Direct call (faster)
}

Optimizing for Mechanical Sympathy:

Avoid interfaces in hot loops if possible.
Use concrete types when performance matters.
Inlining small functions can reduce function call overhead.

Struct Layout: Contiguous Fields vs. Pointers

A struct with contiguous fields performs better than a struct containing multiple pointers due to cache locality.

Bad Example: Struct with Pointers (Fragmented in Memory)

type Node struct {
    data int
    next *Node  // Pointer to another memory location
}

Each Node might be scattered in memory, causing cache misses when traversing a linked list.

Good Example: Struct with Contiguous Fields (Better Cache Usage)

type Employee struct {
    id    int
    age   int
    score int
}

Since fields are stored together in memory, accessing them stays within the CPU cache, leading to fewer cache misses and faster execution.

Best Practices:

Store data contiguously in memory.
Minimize pointers in frequently accessed structs.
Use arrays or slices over linked lists when possible.

Why Slices Beat Linked Lists

A slice (backed by an array) is better than a pointer-based linked list because of cache efficiency:

Slices store elements sequentially in memory → better cache utilization.
Linked lists scatter nodes across memory → cache misses on every node access.

Example: Iterating Over a Slice vs. Linked List

numbers := []int{1, 2, 3, 4, 5} // Contiguous in memory (cache-friendly)

for _, n := range numbers {
    fmt.Println(n)
}

Takeaway: Use slices (or arrays) when performance matters, avoid linked lists unless there’s a strong reason (e.g., frequent insertions/removals).

False Sharing: When Cores Fight Over a Cache Line

False sharing occurs when multiple CPU cores modify different variables that happen to share the same cache line, causing unnecessary cache invalidations.

Example: False Sharing Issue

var data [2]int64 // Both integers may share the same cache line

func worker(i int, wg *sync.WaitGroup) {
    for j := 0; j < 1000000; j++ {
        data[i]++  // Two goroutines modifying different array indices
    }
    wg.Done()
}

Fix: Use padding or align variables on separate cache lines to prevent contention.

type PaddedData struct {
    val int64
    _   [56]byte // Padding to avoid false sharing (64-byte cache line)
}

Optimizing Memory: Reducing Allocations and Embedded Pointers

1. Reduce Unnecessary Allocations

Avoid creating short-lived objects frequently—use sync.Pool when possible.
Prefer value types over heap-allocated types for small objects.

2. Reduce Embedded Pointers in Objects

Every pointer access incurs an extra memory lookup, which can slow things down.
If a struct is small, pass it by value instead of by pointer.

Bad (Heap Allocation, Extra Indirection):

type User struct {
    name *string // Pointer requires an extra memory fetch
}

Better (Stored Inline, More Efficient):

type User struct {
    name string // Directly stored, no pointer overhead
}

Heap Considerations: You Want a Larger Heap

If your program frequently allocates memory, ensure that the heap is large enough to reduce GC (Garbage Collection) overhead.

Use GOGC=off in extreme cases (e.g., low-latency systems).
Avoid excessive short-lived allocations—use object pools instead.

Execution Efficiency: Do Less, Do Often, Do It Faster

1Do Less → Optimize algorithms to reduce redundant computations.

2Do Often → Keep workloads balanced across CPU cores (e.g., use worker pools).

3Do It Faster → Minimize memory latency, leverage CPU caches, and use efficient data structures.

Final Thoughts: Writing Go Code with Mechanical Sympathy

By understanding the CPU, memory hierarchy, and performance trade-offs, Go developers can write code that runs efficiently with the machine rather than against it.

Key Takeaways:

Keep data contiguous → fewer cache misses.
Avoid excessive dynamic dispatch → prefer direct calls.
Reduce pointer indirection → avoid unnecessary heap allocations.
Prevent false sharing → align data correctly.
Use slices over linked lists → leverage cache-friendly structures.
Optimize heap allocations → use sync.Pool and minimize GC pressure.

By applying these principles, your Go applications will not only run faster but will also make optimal use of modern CPU architectures—embodying true mechanical sympathy.

This content originally appeared on DEV Community and was authored by Walid LAGGOUNE