· runtime / scheduling / rust

Pinning vs scheduling: where Rust runtimes leave perf on the table

Async runtimes are excellent at one thing and indifferent at another. Knowing which is which is the difference between a clean p99 and an angry pager.

If you ask a Rust async runtime what it is good at, the honest answer is “moving futures between worker threads”. They have spent a decade getting brilliant at it. Work-stealing schedulers, low-overhead wakeups, careful cache prefetch on poll — the engineering is excellent. The problem is that “moving futures between worker threads” is precisely the wrong primitive when you care about NUMA. So you end up with a runtime that is fast at the thing that hurts you.

This post is about that gap, and what to do about it without rewriting your service.

Two different jobs

There are two related but distinct concerns hiding under the word “scheduling” in a typical Rust service.

Pinning is about where the work runs. It is a constraint: “this thread executes only on CPUs in node 0”, or “this future polls only on the worker bound to NIC 1”. Pinning is cheap to express, easy to verify, and entirely orthogonal to whether the work is sync or async.

Scheduling is about which work runs next. It is a policy: priorities, deadlines, queue disciplines, fairness, work stealing. Scheduling decides which of N runnable things gets the next CPU slice. Both kernel and userland make these decisions, often in conflict.

Most async runtimes implement scheduling beautifully and ignore pinning entirely. That is fine for a web service. It is not fine for a packet processor.

What general-purpose runtimes get right

The default model for something like Tokio is: a thread pool whose size is the number of CPUs, a global injector queue, and per-worker local queues with work stealing. When worker A is idle, it steals from worker B. When a future wakes, it is scheduled on whichever worker is most convenient — sometimes the one that woke it, sometimes the one with the shortest queue. The result is excellent CPU utilisation and low overhead per task.

For a workload where most tasks are IO-bound and short-lived, this is exactly the right design. The cost of an occasional task running on the “wrong” core is dwarfed by the cost of leaving cores idle.

Where it breaks

The problems start when:

  1. Tasks operate on a working set big enough to spill out of L2/L3.
  2. Memory bandwidth or remote-node latency starts mattering on the critical path.
  3. Tail-latency SLOs are tight enough that “the worker that picked up this task happened to be on a different socket from the data” is a sin you can’t afford.

Then the work-stealing scheduler — the thing that was a feature for throughput — actively works against you. A future that was carefully placed near its data on worker 0 gets stolen by an idle worker 8 on the other socket. Now every memory access on that task is paying the cross-node tax described in the numaperf README: 2–3x compared to local access. The runtime didn’t know the data was on node 0. From its point of view, all CPUs are interchangeable.

This is not a bug in Tokio. It’s a deliberate decision: optimise for the common case. The common case is not yours.

Pinning as a foundation

The first move, if you’re doing NUMA-sensitive work, is to take pinning back from the runtime. This is what ScopedPin exists for in numaperf: a RAII handle that pins the current thread to a set of CPUs and restores the previous affinity when it drops. The “scoped” part matters. You don’t want a global affinity that leaks across worker-pool generations; you want affinity that exists for the lifetime of the work you’re doing.

let topo = Topology::discover()?;
let node0 = topo.numa_nodes()[0].id();
let _pin = ScopedPin::to_node(&topo, node0)?;
// This thread now executes only on node 0's CPUs.
// When _pin drops, prior affinity is restored.

The honest caveat: if you call this from inside a generic worker pool that the runtime owns, you are pinning a worker that the runtime expected to be free to move. Some runtimes cope. Some get confused. In general, you want to either own the worker thread yourself, or use a runtime that exposes a “spawn on a specific worker” hook.

Scheduling, when you have to do it yourself

Once threads are pinned and data is placed, you usually want a scheduler that respects both. This is the gap that numaperf’s NumaExecutor is designed to fill: per-node worker pools, with explicit submission of work to a particular node, and optional work stealing within a node only.

The model:

  • Each NUMA node has its own pool of workers, all pinned to that node’s CPUs.
  • Each pool has its own local queue.
  • Submitting work takes a node hint; the executor places it on a worker in that node’s pool.
  • Work stealing, if enabled, is constrained to within the same node. A node-1 worker never steals from a node-0 worker.

The cost of this model is that you give up some throughput. A worker can be idle while another node’s queue is backed up, and the scheduler will not let it help. In return, you get a strong locality guarantee: a task submitted to node 0 will run on node 0’s CPUs, accessing node 0’s memory, end of story. For workloads where tail latency matters more than peak throughput, that’s a trade you’d take every time.

A pragmatic hybrid

You do not have to choose. The pattern that works in practice is to keep your general-purpose async runtime for most of the application — the parts that handle config, control planes, observability, RPC — and carve out a dedicated set of workers for the latency-critical work.

Concretely:

  1. Run Tokio (or your runtime of choice) with its default settings for the control plane.
  2. At startup, discover topology and spin up a numaperf-style executor for the hot path, with workers pinned to specific nodes.
  3. The control plane hands off work to the per-node executor via a bounded channel.
  4. The hot path never blocks on, or shares state with, the control plane’s threads.

This costs you a bit of architecture: you now have two scheduling regimes inside one binary. What you get is the ergonomics of a general-purpose runtime where ergonomics matter, and a hard-real-time-ish locality guarantee where latency matters.

Symptoms that say “pinning is your fix”

Some heuristics, from the field:

  • Latency tracks load non-linearly. p99 holds flat for a while and then suddenly explodes when a fast worker steals from a slow one. This is the classic work-stealing-on-NUMA shape.
  • You see cores going idle on one socket while a queue piles up on the other. The runtime is balancing globally; your data is not.
  • perf stat -e node-loads,node-load-misses shows a high ratio of remote-node loads on threads that “should” be touching local memory. Your futures are migrating away from their data.
  • Adding more cores makes p99 worse. More workers, more opportunities to steal, more locality violations.

If you recognise any of these, the runtime is making a choice it does not have enough information to make. Give it that information.

What this does not solve

Pinning and per-node scheduling do not help if:

  • Your data structure is contested across nodes (one big global cache; a lock-free queue everyone hits). You need to shard the data first.
  • Your workload is single-threaded and bottlenecked on memory latency, not bandwidth. NUMA is irrelevant; you need a smaller working set.
  • You are running on a single-socket box. There is exactly one node. Pin all you want; it won’t matter.

For everything else — multi-socket boxes, real tail-latency SLOs, data sets that won’t fit in one node’s cache — the move is the same. Take pinning back from the runtime, place memory deliberately, and schedule with a tool that knows about both. The runtime you’ve been using is excellent at what it does. It just was never asked to think about which socket your packets came in on.


Found a mistake or want to argue about a number? Open an issue.