2026-05-21 · observability / numa / tooling

Reading numastat without lying to yourself

numastat will happily tell you a story that confirms whatever you wanted to believe. Here’s how to read it like a hostile witness.

numastat is the cheapest NUMA observability tool you have. It is also the easiest one to misread. The default output looks reassuring — big numbers in the numa_hit column, small numbers in numa_miss — and it’s tempting to glance at it, declare your service “mostly local”, and move on. Most of the time that conclusion is wrong, or at least incomplete. This is a short field guide for reading it like a hostile witness.

What the columns actually mean

When you run numastat with no arguments, you get the per-node memory statistics for the whole system, since boot. The columns are:

numa_hit — pages allocated on this node where the policy intended this node. The thing you want to be big.
numa_miss — pages allocated on this node, but the policy intended a different node. You wanted node 1, you got node 0, because node 1 was out of memory.
numa_foreign — the dual of numa_miss: pages this node would have wanted, but got somewhere else.
interleave_hit — pages allocated under an interleave policy where the round-robin happened to land here. Mostly informational.
local_node — pages allocated on this node by a process running on this node.
other_node — pages allocated on this node by a process running on a different node. This is the one you usually care about.

The first lie people tell themselves with numastat is reading numa_miss as “remote access”. It isn’t. It’s “policy not honored at allocation time”. A process that allocates on node 0 because node 1 was full and then keeps running on node 1 is paying the cross-node cost on every access, but numa_miss only ticks once — at allocation. Your latency cost is paid in remote access, not in the counter.

The since-boot problem

The second lie is timescale. By default these counters are cumulative since boot. Your hot path may have flipped to “mostly remote” three hours ago, and the totals still look fine because the first six weeks of uptime were healthy. Always sample, never read a snapshot.

The minimum-viable diagnostic loop:

$ numastat -p $(pgrep your-service) > before.txt
# induce load
$ numastat -p $(pgrep your-service) > after.txt
$ diff before.txt after.txt

The -p form gives you per-process stats, which is what you actually want. The diff between before and after is the truth; the absolute values are folklore.

The per-process view, and its limits

numastat -p PID shows you, for each NUMA node, how much memory of various kinds (heap, stack, huge pages, anonymous) the process has resident there. This is useful. It tells you where the working set is.

It does not tell you where the process is running. A process can have all its memory on node 0 and execute entirely on node 1; numastat will show a beautifully local-looking memory footprint while every access pays the cross-node tax. To catch this, pair it with taskset -p PID (to see the affinity mask) and ps -o psr (to see which CPU each thread last ran on). If the CPU is on a different node than the memory, you have your bug.

The huge-page trap

If your service uses transparent huge pages or explicit hugetlbfs, the per-node accounting gets noisier. THP merges 4 KiB pages into 2 MiB pages; the placement of a 2 MiB page is decided once, and you don’t get a second chance to fix it short of explicit migration. numastat -m shows hugepage stats separately:

                          Node 0     Node 1
HugePages_Total            1024       1024
HugePages_Free              512        512

If your HugePages_Total is wildly asymmetric across nodes, that’s a configuration issue at boot, not a runtime one. If HugePages_Free is wildly asymmetric, you have a placement problem and you’re going to start seeing allocation failures or transparent fallback to small pages, which will show up as latency variance.

What numaperf gives you that numastat doesn’t

numastat is system-wide. It cannot tell you which of your workers is suffering, which of your allocations are misplaced, or which of your hot data structures is taking the cross-node hit. The whole point of building observability into the application is to answer those questions without making you correlate PIDs and timestamps by hand.

numaperf’s perf surface (see numaperf-perf) is designed around the same questions, but at the granularity that matters to a service owner:

Locality ratio per region. Of the accesses to this NumaRegion, what fraction hit local memory? If you bound a region to node 0 with Prefault::Touch, you want this to be ~100%. If it isn’t, your policy didn’t take.
Cross-node traffic counters. Bytes moved between nodes by your workers, attributable to a specific shard or pool. This is the number that tracks tail latency.
Health reports. Aggregated views you can put on a dashboard. The README describes “track locality ratios, generate health reports, identify cross-node traffic” — those are the three operations you want, in roughly that order of value.

These are not magic. They are the same counters the kernel exposes, plus the context your application has and the kernel doesn’t: which worker, which region, which workload.

A few non-obvious checks

Some failure modes that don’t show up in a casual numastat reading:

First-touch surprise. A process whose allocations happen on a startup thread, then is handed off to workers on a different node, will look local in numastat -p but be entirely remote in practice. The README calls out first-touch as a fragile policy; this is what it means.
THP defrag mismatches. If khugepaged collapses small pages into a huge page on a node different from where the data was being used, you can see latency tank for reasons that have nothing to do with your code. Check /sys/kernel/mm/transparent_hugepage/defrag and consider tuning it.
Per-thread affinity that doesn’t match per-process affinity. taskset -p shows the process mask; individual threads can have tighter masks. If you set affinity in the runtime but not in third-party threads (the GC, the metrics scraper, the profiler), those threads can be the ones touching the wrong node.
NUMA balancing. Some kernels enable automatic NUMA balancing by default; it migrates pages over time based on access patterns. This can save you, or it can fight your explicit policy. Check /proc/sys/kernel/numa_balancing. If you’ve bound pages with MemPolicy::Bind, balancing should leave you alone, but it’s worth confirming.

A reading habit

When I’m staring at numastat output trying to figure out whether NUMA is the bug, the habit that helps me is to read every column with the question “what could make this number lie?”

High numa_hit? Sample window matters — what’s the rate, not the total?
Low numa_miss? Could still be remote — numa_miss is allocation, not access.
Looks local per process? Check where the threads are running, not just where the memory is.
Huge pages look fine? Check whether THP migration is fighting you.
Locality ratio improved after a fix? Confirm the change is post-restart, not just a quieter sample window.

numastat is a perfectly good tool. It just was not designed to answer “is my Rust service NUMA-local”. It was designed to report what the kernel knows. The gap between those two questions is where the bugs live.

That gap is what numaperf is for: enough observability inside your process to ask the right question, then enough control to fix the answer.

Found a mistake or want to argue about a number? Open an issue.