Start typing. The right expert is already in VRAM by the time you stop.
SteerPipe reads your prompt character by character, predicts which expert it will need with an n-gram classifier, and streams that 2GB shard over PCIe on a separate CUDA stream while you keep typing. By the time you hit send, decoding starts with zero VRAM fetch stalls. Try it below.
Prompt input
N-gram classifier
softmax / top-1Waiting for input. The classifier scores character n-grams and keyword tokens on every keystroke.
CUDA stream timeline
async / overlappedStream 1 issues the transfer the instant the prediction firms up, so it overlaps entirely with the compute on Stream 0 and your typing. The weights land before you finish the prompt.
Decode phase
Prediction hides the transfer
Reactive offloading waits until decode needs an expert, then stalls the GPU for the fetch. SteerPipe moves that fetch earlier and runs it on a parallel stream, so the cost overlaps with work you were doing anyway.
Read keystrokes
Every character feeds an n-gram + keyword classifier that scores four expert categories in real time.
Predict the expert
Once one category pulls ahead with enough margin, SteerPipe commits to prefetching that specialist.
Prefetch on Stream 1
A cudaMemcpyAsync streams the 2GB shard over PCIe while Stream 0 keeps computing.
Decode with no stall
By send time the expert is resident, so prefill starts instantly instead of blocking on a fetch.