2026-03-04 · numa / tail latency / rust

When NUMA actually matters: a recipe for the p99 hunter

Most workloads don’t need explicit NUMA control. The ones that do, very badly do. A practical recipe for figuring out which one you’re running.

There is a particular kind of performance bug that wastes weeks of engineering time before anyone says the word NUMA out loud. The shape of it is always the same: median latency looks fine; p99 wanders; p99.9 looks like noise; and the people who own the service start adding retries. By the time someone runs numactl --hardware, the team has shipped three patches that made nothing worse, and the on-call is convinced the issue is “the network”.

This piece is a recipe for noticing earlier. It is not an argument that every workload needs NUMA-aware code. Most do not. The point is to know, in your bones, which one you’re running.

The two-question filter

Before you change a single line of Rust, answer two questions about the machine.

1. Is the box actually NUMA? On a single-socket server with one memory controller, there is exactly one node. The kernel reports one node. Cross-node traffic is zero by definition. If numactl --hardware shows available: 1 nodes (0), you can close this tab. Your tail comes from somewhere else — probably the allocator, the scheduler, or a lock that wakes the wrong CPU.

2. Is the workload latency-critical? Throughput-bound workloads (batch ETL, training, compilation) usually do not care about NUMA placement enough to justify the code. The kernel’s interleave-ish defaults plus a NUMA-aware allocator buy you most of the win. NUMA primitives earn their keep at the latency end: HFT, packet processing, real-time control loops, latency SLOs measured in microseconds with tail percentiles that matter.

If you answered “yes” to both, keep going. Otherwise, this is not your bottleneck.

What the README tells you and what it doesn’t

numaperf describes its problem this way: “On multi-socket servers, memory access latency varies by 2-3x depending on which CPU accesses which memory.” That ratio is the upper bound on what you can save by getting placement right. If your hot loop is bottlenecked on memory bandwidth or random reads from a buffer that lives on the wrong node, you can be looking at a 2x slowdown on every access. That compounds.

What the README does not promise is a specific number on your workload. We have no benchmark from the project tree to quote, and we are not going to make one up. If you want to know what numaperf buys you on your service, you have to measure it on your service. The library gives you the primitives to do that measurement honestly; see Reading numastat without lying to yourself.

The recipe

Here is the order I run things in when I suspect NUMA on a Rust service.

Step 1. Map the topology, then trust it

Before any tuning, ask the box what it is.

$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0-23 48-71
node 0 size: 191921 MB
node 1 cpus: 24-47 72-95
node 1 size: 193487 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

That 21 is the relative cost of touching node 1 memory from node 0. The kernel calls 10 “local”. numaperf’s Topology::discover() surfaces this as numa_nodes() with inter-node distances; you can branch on the distance table at startup to decide whether to bother with sharding at all.

If the distances are weirdly symmetric or you only see one node on a box you know is dual-socket, suspect a BIOS setting (Node Interleaving = Enabled is the classic). Fix the firmware before you fix the code.

Step 2. Pin one thread, measure the difference

The cheapest experiment in the world: pin your hottest worker thread to a single node, allocate its hot buffer with MemPolicy::Bind to the same node, prefault the pages, and rerun your latency test.

let topo = Topology::discover()?;
let node0 = topo.numa_nodes()[0].id();
let _pin = ScopedPin::to_node(&topo, node0)?;
let region = NumaRegion::anon(
    HOT_BUFFER_BYTES,
    MemPolicy::Bind(NodeMask::single(node0)),
    Default::default(),
    Prefault::Touch,
)?;

If your p99 moves — in either direction — you’ve found a real signal. If it doesn’t move at all, NUMA is not your problem on this workload. Either way, you’ve learned something in an afternoon.

The Prefault::Touch is doing serious work in that snippet. Without it, the kernel allocates physical pages lazily on first write, and the policy you set may not bind the way you expect under memory pressure. Touching the pages at allocation time makes the placement decision sticky.

Step 3. Decide whether you’re sharding or migrating

Once you know NUMA matters here, you have a strategic choice.

Shard. Partition state per node. Each worker owns its slice of memory, runs only on its node’s CPUs, and never reads remote data. numaperf’s NumaSharded<T> exists for this; ShardedCounter is the canonical example of why you want per-node lock-free state instead of one global atomic.
Migrate. Keep the data central, but route work to the node where the data lives. Useful when your data set is too big to shard cleanly, or when access patterns are too skewed for a static partition. numaperf’s NumaExecutor with per-node worker pools is the tool here — you submit work tagged with the node you want it to run on, and the executor lands it there.

In practice, services that need NUMA at all usually do both: shard the state that can be sharded, migrate the work that has to cross.

Step 4. Make device locality non-optional

If you’re doing packet processing or NVMe-heavy work, the NIC or the SSD is wired to a specific node. Reading the buffer from the wrong node turns “process this packet” into “process this packet, after a round trip across the QPI link”. numaperf exposes this through device locality APIs so you can ask the system which node the NIC lives on, and allocate your ring buffers there. If you do not do this, your p99 will be dominated by whichever cores happened to win the wakeup lottery.

Step 5. Observe locality, not just latency

The trap is shipping the code and declaring victory when median improves. The thing that actually tells you the fix landed is the locality ratio: of all the memory accesses from worker N, what fraction hit local memory? numaperf’s observability surface tracks this; you want it on a dashboard alongside latency. When the ratio drops, your tail will follow within a release or two.

When to leave it alone

Some honest cases for not using NUMA primitives:

Single-socket boxes. As above. No nodes, no problem.
You haven’t measured anything yet. Adding ScopedPin to code that does not have a latency SLO is just extra surface area for bugs.
Your runtime is a black box. If you are deep inside a framework that owns its own thread pool and you cannot intercept worker startup, you cannot pin reliably. Fix the runtime story first, or work outside it.
You’re memory-bound on a single node. No amount of placement helps you here; you need either more memory bandwidth or a smaller working set.

The short version

NUMA is one of those topics where the cost of getting it wrong is high and the cost of getting it right is “a weekend, plus humility”. The recipe above won’t make your service fast on its own. What it will do is keep you from inventing the wrong fix — the retry, the larger pool, the bigger box — when the real problem is that the memory is in the wrong place.

If you decide it is, in fact, the wrong place, that’s where numaperf comes in. The quickstart takes about five minutes. The diagnostic recipe above takes about an afternoon. Together, they will tell you something honest about your workload — which is more than most performance work manages.

Found a mistake or want to argue about a number? Open an issue.