SteerPipe / Predictive Expert Offload

Start typing. The right expert is already in VRAM by the time you stop.

SteerPipe reads your prompt character by character, predicts which expert it will need with an n-gram classifier, and streams that 2GB shard over PCIe on a separate CUDA stream while you keep typing. By the time you hit send, decoding starts with zero VRAM fetch stalls. Try it below.

Prompt input

0 charsGeneral / Chat Expert

Try

N-gram classifier

softmax / top-1

Code Expert18.7%

Math Expert18.7%

Creative Expert18.7%

General / Chat Expert44.0%

Waiting for input. The classifier scores character n-grams and keyword tokens on every keystroke.

CUDA stream timeline

async / overlapped

Stream 0GPU computetrunk + attention layers

idle

Stream 1PCIe DMAno specialist queued

resident

Expert weights in VRAMshared trunk only

2.00 / 2.00 GB31.5 GB/s · PCIe 4.0 x16 · ~63ms

Stream 1 issues the transfer the instant the prediction firms up, so it overlaps entirely with the compute on Stream 0 and your typing. The weights land before you finish the prompt.

Decode phase

Reactive baseline tier

Type a prompt and run decode to compare SteerPipe against reactive offloading.

How it works

Prediction hides the transfer

Reactive offloading waits until decode needs an expert, then stalls the GPU for the fetch. SteerPipe moves that fetch earlier and runs it on a parallel stream, so the cost overlaps with work you were doing anyway.

Read keystrokes

Every character feeds an n-gram + keyword classifier that scores four expert categories in real time.

Predict the expert

Once one category pulls ahead with enough margin, SteerPipe commits to prefetching that specialist.

Prefetch on Stream 1

A cudaMemcpyAsync streams the 2GB shard over PCIe while Stream 0 keeps computing.

Decode with no stall

By send time the expert is resident, so prefill starts instantly instead of blocking on a fetch.