Your browser has a supercomputer in it. We put it to work.
Every privacy wallet needs to hash. A lot. Every note in the NixPool is a Poseidon2 hash. Every branch of the Merkle tree? Poseidon2. Every time your wallet syncs, it's cranking through thousands of these permutations to figure out which notes are yours.
On a single CPU thread in JavaScript, that's fine for one hash. But when you need to rebuild a Merkle tree with a million leaves, or scan a few thousand transactions for incoming notes, you're sitting there waiting. And waiting.
Here's the thing though: your GPU has thousands of cores, and each Poseidon2 hash is completely independent. No hash needs to know about any other hash. That's the dream workload for a GPU.
So we wrote Poseidon2 as a WebGPU compute shader and benchmarked it against the CPU. The results were kind of wild.
What we actually built
We wrote two separate GPU implementations in WGSL (that's WebGPU's shader language):
BabyBear (31-bit field) - a small STARK-friendly prime used by newer proving systems like Plonky3 and SP1. It's the easy one. The prime fits in 31 bits, so Montgomery multiplication is basically just some 32-bit math with careful carry handling. Width-16 state, 8 external rounds, 13 internal rounds, x^7 S-box.
Grumpkin (254-bit field) - the one NixProtocol actually uses. This is the hard one. Each field element takes 256 bits, which means 8 separate 32-bit "limbs" on the GPU. Every single multiplication becomes this massive operation where you're juggling 64 partial products and tracking carries across all 8 limbs. Width-3 state, 8 external rounds, 56 internal rounds, x^5 S-box.
Both use Montgomery multiplication to keep things fast in the inner loop. And before any benchmarking happens, we run a correctness check: compute the same hashes on the CPU with JavaScript BigInt, compare every result against the GPU output, and only proceed if everything matches.
How we measured
- GPU: WebGPU compute shaders, 64 threads per workgroup
- CPU: Plain single-threaded JavaScript with BigInt. No WASM, no tricks
- Timing: GPU time averaged over 3 runs. CPU time is a single run with
performance.now() - Inputs: Deterministic random field elements so results are reproducible
Why use unoptimized JavaScript as the baseline? Because that's what a browser wallet actually uses. We're not trying to prove GPUs beat optimized C++. We're answering a practical question: if your wallet is doing all this hashing in JavaScript right now, how much faster would it be if you offloaded it to the GPU that's just sitting there idle?
The BabyBear numbers
| Batch Size | CPU (ms) | GPU (ms) | Speedup | GPU H/s |
|---|---|---|---|---|
| 256 | 15.40 | 0.67 | 22.9x | 384.0K |
| 1,024 | 15.88 | 0.97 | 16.4x | 1.1M |
| 4,096 | 108.00 | 8.33 | 13.0x | 12.2M |
| 16,384 | 119.00 | 3.23 | 51.6x | 7.8M |
| 65,536 | 134.00 | 8.41 | 15.9x | 7.8M |
7.8 million hashes per second. On a GPU. In a browser. For a ZK-friendly hash function over a prime field. That's not bad.
The 51x speedup at 16K hashes is the sweet spot - enough work to keep the GPU busy, not so much that we're hitting memory limits. BabyBear is basically designed for this kind of thing. Each Montgomery multiply is a handful of 32-bit operations. The GPU chews through it.
The Grumpkin numbers (the one we actually care about)
| Batch Size | CPU (ms) | GPU (ms) | Speedup | GPU H/s |
|---|---|---|---|---|
| 64 | 10.60 | 3.00 | 3.5x | 21.4K |
| 256 | 24.80 | 7.00 | 3.1x | 31.4K |
| 1,024 | 93.00 | 9.47 | 9.6x | 105.9K |
| 4,096 | 409.00 | 12.67 | 32.3x | 323.3K |
OK, so Grumpkin is way heavier. Makes sense - you're doing 256-bit arithmetic on hardware that only speaks 32-bit. Every single field multiplication is ~128 partial products (compared to ~4 for BabyBear). It's like doing long multiplication by hand, except each number is 8 limbs wide and each limb is a 32-bit integer.
But even with all that overhead, at 4,096 hashes we're getting 323K hashes/sec and a 32x speedup. The CPU takes 409ms for that same batch. The GPU does it in under 13ms. The bigger the batch, the more the GPU flexes.
Why this works so well
Every hash is independent. This is what GPU people call "embarrassingly parallel." Hash #47 doesn't need to know anything about hash #48. There's no shared state, no synchronization, no locks. You just fire off thousands of threads and each one does its own thing. GPUs were built for exactly this.
It's almost all math. Poseidon2's inner loop is field multiplications and additions. Barely any memory access. GPUs love compute-heavy workloads with minimal memory chatter.
Montgomery form pays for itself. You convert to Montgomery representation once at the start, do all your multiplications in Montgomery space (which is cheaper), then convert back at the end. One-time cost, dozens of rounds of savings.
Where it falls short
Small batches aren't worth it. At 64 Grumpkin hashes, you only get 3.5x. The overhead of setting up the GPU dispatch, encoding commands, and waiting for synchronization eats most of the gains. You need enough work to make it worthwhile.
WGSL doesn't have 64-bit integers. This is the big one. Every 32x32 multiply has to be faked with four 16x16 multiplies to avoid overflow. If WGSL ever gets u64 support, Grumpkin performance would roughly double overnight. We're doing twice the work we should need to.
256 bits on 32-bit hardware is just hard. There's no way around it. A native CUDA or Metal kernel with proper 64-bit integer support would smoke these numbers. But we're in a browser, and WebGPU is what we've got.
So what does this mean for your wallet?
This is where it gets practical. NixProtocol uses Poseidon2 over the Grumpkin field for everything: note commitments, Merkle trees, nullifier derivation.
Syncing your wallet. A depth-20 Merkle tree has up to ~1 million leaves. At 323K GPU hashes/sec, rebuilding it takes about 3 seconds. In plain JavaScript? Over 90 seconds. That's the difference between "fast" and "go make coffee."
Finding your notes. Your wallet needs to check incoming transactions to see if any notes belong to you. That involves Poseidon2 hashes for commitment verification. Batch those up, ship them to the GPU, get answers back fast.
Helping the prover. The actual UltraHonk proof runs in Barretenberg's WASM backend. But there's preprocessing work (computing commitment trees, deriving nullifiers) that's all Poseidon2 hashing. The GPU can crunch through that while the CPU prepares the prover.
Under the hood: making multiplication work on a GPU
This is the nerdy part. Skip it if you don't care about carry propagation. But if you've ever wondered what it takes to do 256-bit modular arithmetic on a chip that maxes out at 32 bits, read on.
BabyBear: the easy one
The prime is 2,013,265,921. Fits in 31 bits. Montgomery multiplication computes (a * b * R^-1) mod P where R = 2^32.
Split each 32-bit input into two 16-bit halves. Four partial products, all safely under 32 bits. Combine with carries to get a 64-bit result in (hi, lo) form. Then Montgomery reduction: multiply the low word by (-P^-1) mod 2^32, add the correction, shift right by 32. Maybe subtract P once at the end. About 20 instructions total. Clean.
Grumpkin: the hard one
A 256-bit number lives in 8 x 32-bit limbs. To multiply two of these, we use the CIOS algorithm (Coarsely Integrated Operand Scanning - the name sounds made up but it's real).
Loop 8 times (once per limb of the multiplier). Each iteration:
- Multiply and accumulate: 8 limb-by-limb products added into a running total
- Reduce: compute a correction factor, multiply it by the prime (another 8 limb products), add and shift
That's 16 limb multiplies per iteration, 8 iterations, and each limb multiply is actually 4 sub-products because we have to split into 16-bit pieces. So roughly 512 tiny multiplications per field multiply, plus a mountain of carry tracking.
The S-box computes x^5 in three multiplications: x^2 = xx, x^4 = x^2x^2, x^5 = x^4*x. In external rounds the S-box hits all 3 state elements (9 field muls per round). In internal rounds it only hits element 0 (3 field muls). That's 72 + 168 = 240 full 256-bit multiplications per hash just for S-boxes, plus additions for the linear layers. And the GPU still wins by 32x. Parallelism is a hell of a thing.
Side by side
| BabyBear (31-bit) | Grumpkin (254-bit) | |
|---|---|---|
| Peak GPU throughput | 7.8M H/s | 323K H/s |
| Best speedup vs. JS | 51.6x | 32.3x |
| Limbs per element | 1 | 8 |
| Work per field multiply | ~4 sub-products | ~128 sub-products |
| S-box | 4 field muls (x^7) | 3 field muls (x^5) |
| Rounds | 8 external + 13 internal | 8 external + 56 internal |
BabyBear is about 24x faster on the GPU. More limbs per element (8 vs. 1), more rounds (64 vs. 21), and carry propagation across all 8 limbs for every operation - it all adds up.
What we're looking at next
Better CPU baselines. Our CPU number is unoptimized JavaScript. A WASM build with SIMD would close the gap, but the GPU should still win for large batches.
More precise timing. Right now we're timing from JavaScript. WebGPU has timestamp query support that would give us exact GPU-side measurements, especially useful for small batches where dispatch overhead muddies things.
CPU + GPU together. The smart play for proof generation: let the GPU crunch Merkle trees and commitments while the CPU runs the UltraHonk prover. Pipeline them so nothing's sitting idle.
u64 support in WGSL. If this ever ships, Grumpkin performance roughly doubles. We'd go from faking 64-bit math with four 16-bit products to just... doing 64-bit math. It's the single biggest improvement waiting to happen.
Try it yourself
We put the benchmark online so you can run it on your own hardware. Click the button, watch your GPU go to work, and see how your numbers compare.
Run the Poseidon2 WebGPU Benchmark
You'll need Chrome 113+, Edge 113+, or Safari 18+ (WebGPU required). Discrete GPUs will crush it. Integrated graphics will still beat the CPU. Apple Silicon does surprisingly well on Grumpkin thanks to unified memory.
The bottom line
Your browser has a GPU. That GPU can hash. Fast.
BabyBear: 7.8M hashes/sec. Grumpkin (the field we actually use): 323K hashes/sec. Both blow away single-threaded JavaScript. Both scale beautifully with batch size.
For a privacy wallet, this is the difference between "syncing... please wait" and it just working. The hardware is already there. We just need to use it.