Your browser has a supercomputer in it. We put it to work.

Every privacy wallet needs to hash. A lot. Every note in the NixPool is a Poseidon2 hash. Every branch of the Merkle tree? Poseidon2. Every time your wallet syncs, it's cranking through thousands of these permutations to figure out which notes are yours.

On a single CPU thread in JavaScript, that's fine for one hash. But when you need to rebuild a Merkle tree with a million leaves, or scan a few thousand transactions for incoming notes, you're sitting there waiting. And waiting.

Here's the thing though: your GPU has thousands of cores, and each Poseidon2 hash is completely independent. No hash needs to know about any other hash. That's the dream workload for a GPU.

So we wrote Poseidon2 as a WebGPU compute shader and benchmarked it against the CPU. The results were kind of wild.

What we actually built

We wrote two separate GPU implementations in WGSL (that's WebGPU's shader language):

BabyBear (31-bit field) - a small STARK-friendly prime used by newer proving systems like Plonky3 and SP1. It's the easy one. The prime fits in 31 bits, so Montgomery multiplication is basically just some 32-bit math with careful carry handling. Width-16 state, 8 external rounds, 13 internal rounds, x^7 S-box.

Grumpkin (254-bit field) - the one NixProtocol actually uses. This is the hard one. Each field element takes 256 bits, which means 8 separate 32-bit "limbs" on the GPU. Every single multiplication becomes this massive operation where you're juggling 64 partial products and tracking carries across all 8 limbs. Width-3 state, 8 external rounds, 56 internal rounds, x^5 S-box.

Both use Montgomery multiplication to keep things fast in the inner loop. And before any benchmarking happens, we run a correctness check: compute the same hashes on the CPU with JavaScript BigInt, compare every result against the GPU output, and only proceed if everything matches.

How we measured

GPU: WebGPU compute shaders, 64 threads per workgroup
CPU: Plain single-threaded JavaScript with BigInt. No WASM, no tricks
Timing: GPU time averaged over 3 runs. CPU time is a single run with performance.now()
Inputs: Deterministic random field elements so results are reproducible

Why use unoptimized JavaScript as the baseline? Because that's what a browser wallet actually uses. We're not trying to prove GPUs beat optimized C++. We're answering a practical question: if your wallet is doing all this hashing in JavaScript right now, how much faster would it be if you offloaded it to the GPU that's just sitting there idle?

The BabyBear numbers

Batch Size	CPU (ms)	GPU (ms)	Speedup	GPU H/s
256	15.40	0.67	22.9x	384.0K
1,024	15.88	0.97	16.4x	1.1M
4,096	108.00	8.33	13.0x	12.2M
16,384	119.00	3.23	51.6x	7.8M
65,536	134.00	8.41	15.9x	7.8M

7.8 million hashes per second. On a GPU. In a browser. For a ZK-friendly hash function over a prime field. That's not bad.

The 51x speedup at 16K hashes is the sweet spot - enough work to keep the GPU busy, not so much that we're hitting memory limits. BabyBear is basically designed for this kind of thing. Each Montgomery multiply is a handful of 32-bit operations. The GPU chews through it.

The Grumpkin numbers (the one we actually care about)

Batch Size	CPU (ms)	GPU (ms)	Speedup	GPU H/s
64	10.60	3.00	3.5x	21.4K
256	24.80	7.00	3.1x	31.4K
1,024	93.00	9.47	9.6x	105.9K
4,096	409.00	12.67	32.3x	323.3K

OK, so Grumpkin is way heavier. Makes sense - you're doing 256-bit arithmetic on hardware that only speaks 32-bit. Every single field multiplication is ~128 partial products (compared to ~4 for BabyBear). It's like doing long multiplication by hand, except each number is 8 limbs wide and each limb is a 32-bit integer.

But even with all that overhead, at 4,096 hashes we're getting 323K hashes/sec and a 32x speedup. The CPU takes 409ms for that same batch. The GPU does it in under 13ms. The bigger the batch, the more the GPU flexes.

Why this works so well

Every hash is independent. This is what GPU people call "embarrassingly parallel." Hash #47 doesn't need to know anything about hash #48. There's no shared state, no synchronization, no locks. You just fire off thousands of threads and each one does its own thing. GPUs were built for exactly this.

It's almost all math. Poseidon2's inner loop is field multiplications and additions. Barely any memory access. GPUs love compute-heavy workloads with minimal memory chatter.

Montgomery form pays for itself. You convert to Montgomery representation once at the start, do all your multiplications in Montgomery space (which is cheaper), then convert back at the end. One-time cost, dozens of rounds of savings.

Where it falls short

Small batches aren't worth it. At 64 Grumpkin hashes, you only get 3.5x. The overhead of setting up the GPU dispatch, encoding commands, and waiting for synchronization eats most of the gains. You need enough work to make it worthwhile.

WGSL doesn't have 64-bit integers. This is the big one. Every 32x32 multiply has to be faked with four 16x16 multiplies to avoid overflow. If WGSL ever gets u64 support, Grumpkin performance would roughly double overnight. We're doing twice the work we should need to.

256 bits on 32-bit hardware is just hard. There's no way around it. A native CUDA or Metal kernel with proper 64-bit integer support would smoke these numbers. But we're in a browser, and WebGPU is what we've got.

So what does this mean for your wallet?

This is where it gets practical. NixProtocol uses Poseidon2 over the Grumpkin field for everything: note commitments, Merkle trees, nullifier derivation.

Syncing your wallet. A depth-20 Merkle tree has up to ~1 million leaves. At 323K GPU hashes/sec, rebuilding it takes about 3 seconds. In plain JavaScript? Over 90 seconds. That's the difference between "fast" and "go make coffee."

Finding your notes. Your wallet needs to check incoming transactions to see if any notes belong to you. That involves Poseidon2 hashes for commitment verification. Batch those up, ship them to the GPU, get answers back fast.

Helping the prover. The actual UltraHonk proof runs in Barretenberg's WASM backend. But there's preprocessing work (computing commitment trees, deriving nullifiers) that's all Poseidon2 hashing. The GPU can crunch through that while the CPU prepares the prover.

Under the hood: making multiplication work on a GPU

This is the nerdy part. Skip it if you don't care about carry propagation. But if you've ever wondered what it takes to do 256-bit modular arithmetic on a chip that maxes out at 32 bits, read on.

BabyBear: the easy one

The prime is 2,013,265,921. Fits in 31 bits. Montgomery multiplication computes (a * b * R^-1) mod P where R = 2^32.

Split each 32-bit input into two 16-bit halves. Four partial products, all safely under 32 bits. Combine with carries to get a 64-bit result in (hi, lo) form. Then Montgomery reduction: multiply the low word by (-P^-1) mod 2^32, add the correction, shift right by 32. Maybe subtract P once at the end. About 20 instructions total. Clean.

Grumpkin: the hard one

A 256-bit number lives in 8 x 32-bit limbs. To multiply two of these, we use the CIOS algorithm (Coarsely Integrated Operand Scanning - the name sounds made up but it's real).

Loop 8 times (once per limb of the multiplier). Each iteration:

Multiply and accumulate: 8 limb-by-limb products added into a running total
Reduce: compute a correction factor, multiply it by the prime (another 8 limb products), add and shift

That's 16 limb multiplies per iteration, 8 iterations, and each limb multiply is actually 4 sub-products because we have to split into 16-bit pieces. So roughly 512 tiny multiplications per field multiply, plus a mountain of carry tracking.

The S-box computes x^5 in three multiplications: x^2 = xx, x^4 = x^2x^2, x^5 = x^4*x. In external rounds the S-box hits all 3 state elements (9 field muls per round). In internal rounds it only hits element 0 (3 field muls). That's 72 + 168 = 240 full 256-bit multiplications per hash just for S-boxes, plus additions for the linear layers. And the GPU still wins by 32x. Parallelism is a hell of a thing.

Side by side

	BabyBear (31-bit)	Grumpkin (254-bit)
Peak GPU throughput	7.8M H/s	323K H/s
Best speedup vs. JS	51.6x	32.3x
Limbs per element	1	8
Work per field multiply	~4 sub-products	~128 sub-products
S-box	4 field muls (x^7)	3 field muls (x^5)
Rounds	8 external + 13 internal	8 external + 56 internal

BabyBear is about 24x faster on the GPU. More limbs per element (8 vs. 1), more rounds (64 vs. 21), and carry propagation across all 8 limbs for every operation - it all adds up.

What we're looking at next

Better CPU baselines. Our CPU number is unoptimized JavaScript. A WASM build with SIMD would close the gap, but the GPU should still win for large batches.

More precise timing. Right now we're timing from JavaScript. WebGPU has timestamp query support that would give us exact GPU-side measurements, especially useful for small batches where dispatch overhead muddies things.

CPU + GPU together. The smart play for proof generation: let the GPU crunch Merkle trees and commitments while the CPU runs the UltraHonk prover. Pipeline them so nothing's sitting idle.

u64 support in WGSL. If this ever ships, Grumpkin performance roughly doubles. We'd go from faking 64-bit math with four 16-bit products to just... doing 64-bit math. It's the single biggest improvement waiting to happen.

Try it yourself

We put the benchmark online so you can run it on your own hardware. Click the button, watch your GPU go to work, and see how your numbers compare.

Run the Poseidon2 WebGPU Benchmark

You'll need Chrome 113+, Edge 113+, or Safari 18+ (WebGPU required). Discrete GPUs will crush it. Integrated graphics will still beat the CPU. Apple Silicon does surprisingly well on Grumpkin thanks to unified memory.

The bottom line

Your browser has a GPU. That GPU can hash. Fast.

BabyBear: 7.8M hashes/sec. Grumpkin (the field we actually use): 323K hashes/sec. Both blow away single-threaded JavaScript. Both scale beautifully with batch size.

For a privacy wallet, this is the difference between "syncing... please wait" and it just working. The hardware is already there. We just need to use it.

Poseidon2 on WebGPU: Benchmarking ZK-Friendly Hashing in the Browser