This content originally appeared on Level Up Coding – Medium and was authored by Wadix Technologies

1. Introduction:
Why carry a backpack full of tools when all you need is a screwdriver?
When an exception occurs, the CPU saves the current context by pushing certain register values onto the stack. This creates a snapshot of its state so it can work with those registers during the interrupt and then restore execution exactly where it left off.
But here’s the thing — not all of those registers are actually needed in the Interrupt Service Routine (ISR). Saving and restoring unused registers wastes time and energy. To fix this, ARM designers came up with a smart trick called lazy stacking
2. The Stack in ARM Cortex-M Exceptions
When an interrupt occurs on an ARM Cortex-M processor, the hardware automatically pushes a specific set of registers onto the stack. These include R0 to R3, R12, LR (link register), PC (program counter), and xPSR (program status register).
This process is handled entirely by the CPU’s exception mechanism — no extra instructions are needed. On exception entry, it saves these registers to the stack; on exception return, it restores them so the program can resume exactly where it was interrupted.
The group of registers saved at exception entry is called the stack frame.
If the Floating Point Unit (FPU) is enabled, the stack frame gets bigger — it also includes S0 to S15 (floating-point registers) and FPSCR (floating-point status/control register). This extra work increases interrupt latency, because more data needs to be pushed and popped from memory.
Figure 1: Cortex-M exception stack frames: with FPU registers saved (left) and without FPU registers (right).

For a deeper dive into what interrupt latency is and how it’s measured, see our earlier article on
3. What is Lazy Stacking?
Lazy stacking is a feature linked to the stacking of registers in the Floating Point Unit (FPU), so it only applies to Cortex-M devices that actually have an FPU. When the FPU is available, enabled, and has been used, its register bank contains data that might need to be saved during an exception. If the processor had to stack these floating-point registers every time an exception occurred, it would need to perform an extra 17 memory pushes, which would increase the interrupt latency from around 12 cycles to about 29. To avoid this penalty, ARM added lazy stacking. By default, it is enabled. When an exception occurs with the FPU enabled and in use — indicated by bit 2 (FPCA) in the CONTROL register — the processor switches to the longer stack frame format, but instead of immediately writing all floating-point registers into memory, it only reserves the space for them and pushes just the core registers R0–R3, R12, LR, the return address, and xPSR. This keeps the entry latency at 12 cycles. The processor also sets the LSPACT (Lazy Stacking Preservation Active) bit and stores the reserved stack space address in the FPCAR (Floating Point Context Address Register). If the exception handler never uses floating-point instructions, the reserved space is left untouched and no restore happens on exit. If floating-point instructions are executed, the processor detects this, stalls execution, writes the FPU registers into the reserved space, clears LSPACT, and then resumes. Lazy stacking can itself be interrupted; if another interrupt arrives while it is in progress, the processor abandons the lazy operation and performs normal stacking instead. In such a case, the floating-point instruction that triggered lazy stacking has not yet been executed, so the program counter saved for the new interrupt points back to that instruction, and when execution resumes it will trigger lazy stacking again. If the current thread or handler is not using the FPU (FPCA = 0), the shorter stack frame format is used from the start.

4. Benefits of Lazy Stacking: Measuring FPU Context Save Overhead
To verify the practical benefits of lazy stacking, we measured the execution time of an SVC exception handler that does not use the FPU, with lazy stacking disabled and enabled. The test was run on an STM32H7 (Cortex-M7F) using the DWT cycle counter for high-resolution timing. The SVC handler was written to be completely FP-free and to read DWT->CYCCNT as its very first instruction, capturing the exact entry time. Two measurements were taken: one with lazy stacking disabled (ASPEN=1, LSPEN=0), which forces the hardware to always push the full extended FP context on exception entry, and one with lazy stacking enabled (ASPEN=1, LSPEN=1), which only reserves space for the FP registers without writing them unless an FP instruction is executed.
static volatile uint32_t t0, t1;
static volatile uint32_t seen_EXCRET;
typedef struct
{
uint32_t dt_cycles; /* latency = t1 - t0 */
uint32_t control; /* CONTROL at SVC entry */
uint32_t fpccr; /* FPCCR at SVC entry */
uint32_t exc_return; /* EXC_RETURN (LR) captured in handler */
} meas_t;
volatile meas_t res_lazy_off, res_lazy_on;
__attribute__((naked)) void SVC_Handler(void)
{
__ASM volatile(
"ldr r2,=0xE0001004 \n" /* DWT_CYCCNT */
"ldr r3,[r2] \n"
"ldr r2,=t1 \n"
"str r3,[r2] \n" /* t1 = first handler timestamp */
"mov r2,lr \n"
"ldr r3,=seen_EXCRET\n"
"str r2,[r3] \n"
"bx lr \n"
);
}
static inline void dwt_init(void)
{
CoreDebug->DEMCR |= CoreDebug_DEMCR_TRCENA_Msk;
DWT->CYCCNT = 0;
DWT->CTRL |= DWT_CTRL_CYCCNTENA_Msk;
}
static inline uint32_t rdcycle(void)
{
return DWT->CYCCNT;
}
static inline void fpu_enable(void)
{
SCB->CPACR |= (0xFU << 20); __DSB(); __ISB();
}
static inline void lazy_on(void)
{
FPU->FPCCR |= (FPU_FPCCR_ASPEN_Msk | FPU_FPCCR_LSPEN_Msk);
}
static inline void lazy_off(void)
{
FPU->FPCCR = (FPU->FPCCR | FPU_FPCCR_ASPEN_Msk) & ~FPU_FPCCR_LSPEN_Msk;
}
__STATIC_INLINE void call_svc(void)
{
__ASM volatile("SVC #0");
}
static void measure(meas_t *out)
{
out->control = __get_CONTROL();
out->fpccr = FPU->FPCCR;
t0 = rdcycle();
call_svc();
out->dt_cycles = t1 - t0;
out->exc_return = seen_EXCRET;
}
int main(void)
{
HAL_Init(); SystemClock_Config();
dwt_init();
fpu_enable();
/* A) lazy OFF */
lazy_off();
measure((meas_t*)&res_lazy_off);
/* B) lazy ON */
lazy_on();
measure(&res_lazy_on);
while(1)
{
}
}
build flags :-mcpu=cortex-m7 -mfpu=fpv5-d16 -mfloat-abi=hard
In the lazy-off case, the measured latency from SVC trigger to the first handler instruction was 67 cycles, reflecting the additional memory pushes required to save the FP context. With lazy stacking enabled, the latency dropped to 50 cycles because the FP registers were not actually stored — the reserved stack space remained untouched since the ISR never executed floating-point instructions. This 17-cycle difference clearly illustrates the time savings that lazy stacking provides when the FPU is not used in the exception handler.

5. When to Use Lazy Stacking — And When Not To
Lazy stacking is great when exceptions happen frequently but most of them don’t use the FPU. In this case, it avoids unnecessary memory pushes, reduces exception entry time, and frees up cycles for real work. This is especially useful in applications where interrupt latency matters, such as motor control loops, fast ADC sampling, or networking stacks that run at high interrupt rates.
However, lazy stacking offers no benefit if your interrupt service routines always perform floating-point operations. In that case, the FPU registers will be saved anyway — just later — so you don’t actually save any work, and the extra stall when the first FP instruction is hit might even make timing less predictable. Lazy stacking can also be undesirable in systems where deterministic interrupt entry time is critical, because the “lazy” save is triggered mid-ISR, not at entry, introducing a delay right where you might not expect it.
6. Conclusion
Lazy stacking is a small hardware feature that can make a big difference in interrupt-heavy systems with infrequent FPU use. By knowing when it helps and when it hurts, you can tune your Cortex-M system for both speed and predictability.
If you enjoyed this article, please make sure to Subscribe, Clap, Comment and Check out our online courses today!
Lazy Stacking on ARM Cortex-M: Smarter FPU Context Saving was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.
This content originally appeared on Level Up Coding – Medium and was authored by Wadix Technologies