An Nvidia GPU Now Runs AI Inference on a Raspberry Pi 5 at 121 Tokens Per Second

Community patches enable Nvidia GPU compute on the Pi 5 via PCIe, running a 3B language model at 121 tok/s with llama.cpp and Vulkan acceleration.

Alex Circuit★Mar 2, 2026★5 min read

Strapping a workstation-class GPU to a fifty-dollar single board computer sounds like a stunt. The benchmark results suggest otherwise. Thanks to community-developed kernel patches, Nvidia GPU compute is now functional on the Raspberry Pi 5, and the performance numbers are genuinely impressive.

How It Works

Jeff Geerling, the prolific Pi tester and content creator, compiled the open-source Nvidia ARM64 kernel modules and successfully got an RTX A4000 recognized via the PCIe interface on a Raspberry Pi Compute Module 5. The CM5's single PCIe Gen 2 lane provides the physical connection, while the patched drivers handle the rest.

Display output is not yet functional — this is strictly a compute card configuration. But for AI inference workloads, that limitation is irrelevant. The GPU's CUDA cores and VRAM are fully accessible for number crunching.

The Benchmark That Matters

Using llama.cpp with Vulkan acceleration, the RTX A4000 running on the Pi 5 achieved 121 tokens per second with a 3-billion-parameter language model. For context, a Raspberry Pi 5 running the same model on its CPU alone manages roughly 5 to 8 tokens per second. That is a fifteen-fold improvement by adding external GPU compute to a platform that costs under a hundred dollars.

The PCIe Gen 2 x1 bandwidth does create a bottleneck for larger models, but for small to medium language models, image classification, and other inference tasks, the throughput is more than adequate for interactive use.

Why This Opens New Doors

The practical applications extend beyond benchmarking curiosity. A Pi 5 paired with a used workstation GPU creates a local AI inference node for a fraction of what a dedicated inference server costs. Small businesses, researchers, and hobbyists who need local AI processing — whether for privacy reasons, latency requirements, or simple cost management — now have a remarkably affordable option.

Cluster configurations are another intriguing possibility. Multiple Pi units with external GPUs could form a distributed inference mesh, with each node handling requests independently. The economics become compelling when used server GPUs can be found for a few hundred dollars on the secondary market.

The Path Forward

Display output support is the next major milestone, which would transform this from a headless compute configuration into a full desktop GPU experience. The community kernel patches continue to evolve, and driver maturity improves with each iteration.

What started as an experiment has become a proof of concept that challenges assumptions about what affordable hardware can accomplish. When a fifty-dollar board can drive a professional GPU at 121 tokens per second, the definition of accessible AI infrastructure shifts meaningfully.

Sources: Jeff Geerling Blog, February 2026; Tom's Hardware, February 2026