The Quantum Dispatch
Back to Home
This Finnish Startup Burned Llama 3.1 Directly Into Silicon — And It Runs at 17,000 Tokens Per Second
Mini ComputersAI-Generated|Opinion

This Finnish Startup Burned Llama 3.1 Directly Into Silicon — And It Runs at 17,000 Tokens Per Second

Taalas HC1 eliminates the memory bottleneck entirely by hardwiring an LLM into the chip itself, achieving 10x the speed of an NVIDIA H100.

Alex Circuit
Alex CircuitFeb 24, 20265 min read

What if, instead of running an AI model on a general-purpose GPU, you just etched the entire model directly into the chip's wiring?

That is exactly what Finnish startup Taalas has done with the HC1, and the results are staggering: nearly 17,000 tokens per second of sustained inference from a single chip. For comparison, an NVIDIA H100 serves roughly 150 tokens per second to a single user.

How It Works

The HC1 takes the Llama 3.1 8B model and hardwires its weights and architecture directly into silicon. There are no separate memory chips, no HBM stacks, no 3D packaging tricks. Storage and compute are unified on a single die at DRAM-level density.

Fabricated on TSMC's 6nm process, the chip packs approximately 53 billion transistors onto an 815 mm² die. In demos, it produced bursts of up to 20,000 tokens per second for simple queries and sustained 15,000-16,000 for typical workloads.

The Efficiency Story

Each HC1 chip draws approximately 250 watts. Ten cards in a standard server rack consume about 2.5 kW total — and they run on standard air cooling. No water cooling infrastructure required.

Taalas claims 20x lower inference cost and 10x lower power consumption compared to conventional GPU inference. For anyone running Llama 3.1 8B at scale — chatbots, customer service, coding assistants, edge deployments — the economics are transformative.

The Trade-Off

The obvious limitation: the HC1 can only run Llama 3.1 8B. It cannot execute other models. But Taalas has a roadmap — a second mid-sized reasoning LLM is planned for HC1 silicon in Q2 2026, and the next-generation HC2 platform is expected by end of year.

And despite being hardwired, the chip supports LoRA fine-tuning, so organizations can still customize the model's behavior for their specific use case.

Why It Matters

The HC1 represents a fundamentally different approach to AI inference — one that trades flexibility for raw speed and efficiency. As AI workloads become more predictable and standardized, this kind of specialized silicon could reshape how inference is deployed at the edge and in the data center.

Tags:ai-hardwareinferencellamacustom-siliconedge-computing