tl;dr: Notes on our current use of Groth16 for Aptos Keyless and how we might improve upon it. Should have applications to anonymous payments, confidential assets, zkVM proof wrapping etc.
Research roadmap
Recall the goals discussed before:
- $\le$ 1.4 ms single-threaded individual ZKP verification time
- make it trivial to upgrade the keyless NP relation
- trusted setups are cumbersome
- universal setups are less cumbersome
- transparent are amazing
- safely implement the NP relation so as to avoid relying on training wheels
- client-side proving
- or at least, minimize costs of running a prover service
- small proof sizes (1.5 KiB?)
- somewhat easily-upgradeable to post-quantum security
If we were to prioritize the problem we’d like solved first (our pain points) while accounting for likelihood of solving things.
Path 1: Circuit-based (less technical risk):
- Milestone 1: Remove circuit-specific trusted setup (tame complexity)
- Milestone 2: Prove obliviously via wrapping on a prover service (privacy) $\Rightarrow$ Spartan, wrapped WHIR, [wrapped] HyperPLONK
- Milestone 3: Formally-verify keyless relation DSL implementation (security)
Possible success paths: Optimal SNARK choice + FREpack/Zinc. Perhaps using the scheme above to wrap a transparent zkSNARK like WHIR.
Path 2: VM-based (higher technical risk, slower timelines):
- Milestone 1: Proving time, proof size and verification time similar to Groth16 (cost)
- Milestone 2: Client-side oblivious proving via wrapping with a lower-cost prover service (privacy)
- Milestone 3: Formally-verify zkVM implementation (security)
Possible success paths: Ligero + Groth16 wrapping. Jolt + Groth16 wrapping.
There are several directions 👇 for replacing the keyless ZKP. In fact, no matter which way we go, there may be some common work:
- Formal verification: formally-verify Keyless circuits, or zkVM implementation, or ZKP wrapping circuits
- Cryptography: MLE PCSs, zkSNARKs, efficient circuit representations, etc.
[Ultra]Groth16-based
- Faster delegated proving:
- Milestone 0: Upgrade to UltraGroth16 $\Rightarrow$ no more Fiat-Shamir
- Path 1: modify
ark-groth16intoark-ultra-groth16and rewrite circuit - Path 2: modify
circom
- Path 1: modify
- Milestone 1: Combine with FREpack/Zinc techniques
- Milestone 2: Deploy VKs for different SHA2-message lengths
- Inform optimal message lengths by building a histogram of
iss-specific JWT sizes
- Inform optimal message lengths by building a histogram of
- Milestone 0: Upgrade to UltraGroth16 $\Rightarrow$ no more Fiat-Shamir
- Moonshot client-side proving (dismissed):
- Milestone 0: UltraGroth + FREpack/Zinc such that the # of R1CS constraints for a wrapping circuit is 10,000 (e.g., a circuit that verifies a WHIR proof).
Other directions we can explore here:
- GPU-based Groth16 prover service
- We’d need to optimize the cloud costs
- Cheaper CPU instance + add one or more GPUs to it? Compare:
- 32-core CPU instance is $1.63 / hour
- vs. 4-core CPU instance is $0.20 / hour plus older NVIDIA T4 GPU is $0.35 / hour
- $\Rightarrow$ 3x cheaper, maybe
- Cheaper CPU instance + add one or more GPUs to it? Compare:
- Ingonyama has made great progress on this; repo here
- We’d need to optimize the cloud costs
- There may be a small $\le 10\%$ proving time reduction via a different way to make Groth16 ZK
Somewhat-dismissed directions
Client-side proving: Even if we reduce the # of constraints $n$ to 100,000, there will likely still be $m \approx n$ variables $\Rightarrow$ proving key size will be $\approx 2m \Gr_1 + m\Gr_2 + n\Gr_1 \approx 2m\Gr_1 + 2m\Gr_1 + m\Gr_1 = 5m\Gr_1 \Rightarrow$ 500,000 $\times$ 32 bytes $\approx$ 15 MiB.
If we ignore the proving key issues, there’d be a few ways to speed up client proving:
- Faster EC arithmetic in JavaScript/WASM
- How optimized is
snarkjs? I suspect very well-optimized.
- How optimized is
- Faster prover implementation via WebGPU $\Rightarrow$ prove $<5$ seconds
The only way we can succeed is with a moonshot combination of:
- Right inner SNARK (e.g., WHIR + Poseidon & blake3b + right arity + right rate + right conjecture)
- Super-efficient constraint system for the inner SNARK verification algorithm
Outsource 50% of the Groth16 proving to the client: In principle, the server could do any subset of the proving work while the client could do the rest. For example, the client could do FFTs and MSM for $h(X)$, server could do MSMs for $\one{A},\two{B},\one{C}$
The show stopper is, once again, that we need to ensure the proving key is small on the client because webapps would need to download it. So, this means the client can only do $\Gr_1$ MSMs of size $\min{(m,n)}$, where $m = $ # of R1CS variables and $n = $ # of constraints. So, for $2^{20}$ constraints, this gives $2^{20} \times$ 32 bytes = 32 MiB prover key. Too much communication.
We’d also have to be very careful; with the TW signature, which should only be computed over valid proofs. The client should not trick server into signing a bad proof.
Spartan-based (PQ)
- Research:
- ZK sumcheck
- Path 1: DeKART’s ZK sumcheck
- Dense MLE ZK PCS:
- Path 1: Survey ZK MLE PCS and pick one whose verification involves a constant-number of pairings and minimizes opening time over MLEs with small entries
- Sparse MLE PCS:
- Path 1: Implement and iterate over Cinder
- Path 2: KZH + GIPA
- Path 3: WHIR?
- Path 4: Leverage uniformity in R1CS matrices
- ZK sumcheck
- Client-side proving:
- Main blocker: the size-$(m-2)$ PCS SRS, where $m$ is the # of R1CS variables
- Assuming the worst case where the public statement $(\mathbf{x}, 1)$ is of size 2 and the witness $\mathbf{w}$ takes up the rest of the R1CS variables
- Milestone 1:
- can prove sumcheck and dense ZK MLE PCS opening client-side in < 5s
- can prove sparse MLE PCS opening server-side < 1s
- Milestone 2: can prove fully client-side < 5s
- Main blocker: the size-$(m-2)$ PCS SRS, where $m$ is the # of R1CS variables
PLONK-based
See some PLONK explorations below.
- Research:
- Evaluate prover times on circuit sizes and inputs representative of keyless
- Client-side proving:
- Main blocker: size-$n$ PCS SRS, where $n$ is the size of the circuit (# of additions + # of multiplications)
- Currently, $n \approx$ 6.4 million
- HyperPLONK has the same challenge
- Main blocker: size-$n$ PCS SRS, where $n$ is the size of the circuit (# of additions + # of multiplications)
HyperPLONK-based (PQ)
- Research:
- ZK sumcheck
- Dense (ZK?) MLE PCS (similar difficulty as in Spartan)
- Choice of custom gates to reduce prover time
- Wrap sumcheck
- Client-side proving:
- Same as in PLONK above, except maybe custom gates don’t have as big of an impact on prover time
- (Note: in univariate PLONK, custom gates affect quotient poylnomial degree but do not affect KZG SRS size because, apparently, PLONK splits up the quotient polynomial into size-$n$ chunks)
WHIR-based (PQ)
See some WHIR explorations below.
- Research:
- Understand what the $\Sigma$-IOP for (G)R1CS in the WHIR paper looks like.
- Add ZK
- Wrapping WHIR fast
- Client-side proving: < 5s
- Milestone 1:
- prove WHIR client-side in < 5s
- wrap server-side < 1s
- Milestone 2: can prove fully client-side < 5s
- Milestone 1:
zkVM-based (PQ)
Jolt, Ligero could be viable options very soon.
$\langle\langle$ new transparent SNARK with $\omega(\mathsf{polylog{n}})$ proof size and $o(n)$ verification time $\rangle\rangle$
Cannot do more than $2^{17}$ field multiplications
Bulletproofs
The large verification time will be a big issue. I think it can be addressed but it re-introduce the size-$n$ SRS problem.
GKR
Log-space uniform circuits.
Engineering roadmap
There’s several things that, if we spend effort on, they are likely to be fruitful no matter what research direction we take:
- Reducing circuit size from 1.5M to 1M
- Registration-based, monetarily-incentivized, curve-agnostic, on-chain powers-of-$\tau$ and Groth16 setups
Groth16
Circuit size
As of October 29th, 2025:
- 1,438,805 constraints
- 1,406,686 variables
Not relevant for Groth16, but for other zkSNARKs like Spartan:
- The matrix $A$ has 4,203,190 non-zero terms
- The matrix $B$ has 3,251,286 non-zero terms
- The matrix $C$ has 2,055,196 non-zero terms
- Total number of nonzero terms: 9,509,672
I think our task.sh script in the keyless-zk-proofs repo can be used to reproduce these “# of non-zero entries” numbers.
rapidsnark (modified) proving time
These are the times taken on a t2d-standard-4 VM for an older version of the circuit with 1.3M constraints and variables.
| Operation | Time (millis) |
|---|---|
| MSM $\one{A}$ | 276 |
| MSM $\one{B}$ | 248 |
| MSM $\two{B}$ | 885 |
| MSM $\one{C}$ | 366 |
| Total $A, B, C$ MSM | 1,775 |
| Calculating C time | 18 |
| iFFT A time | 242 |
| Shift A time | 11 |
| FFT A time | 237 |
| iFFT B time | 240 |
| Shift B time | 10 |
| FFT B time | 237 |
| iFFT C time | 239 |
| Shift C time | 11 |
| FFT C time | 238 |
| Total FFT time | 1,465 |
| ABC time | 21 |
| MSM $h(X)$ time | 2,785 |
| Total | 6,209 |
PLONK
All benchmarks were run on my 10-core Macbook Pro M1 Max.
snarkjs compiler
# of “PLONK constraints” for the keyless relation is 6,421,050 (addition + multiplication gates, I believe.)
How?
Set up a PLONK proving key using a downloaded BN254 12 GiB powers-of-tau file and used ran a snarkjs command:
node --max-old-space-size=$((8192*2)) $(which snarkjs) plonk setup main.r1cs ~/Downloads/powersOfTau28_hez_final_23.ptau plonk.zkey
EspressoSystems/jellyfish
For jellyfish, I modified the benchmarks to fix a bug, exclude batch verification benches and increase the circuit size to $2^{22}$.
(I tried using the exact 6.4M circuit size, but jellyfish borked. Maybe it only likes powers of two.)
The git diff shows:
diff --git a/plonk/benches/bench.rs b/plonk/benches/bench.rs
index 256f0d53..9b8aec21 100644
--- a/plonk/benches/bench.rs
+++ b/plonk/benches/bench.rs
@@ -12,6 +12,7 @@ use ark_bls12_377::{Bls12_377, Fr as Fr377};
use ark_bls12_381::{Bls12_381, Fr as Fr381};
use ark_bn254::{Bn254, Fr as Fr254};
use ark_bw6_761::{Fr as Fr761, BW6_761};
+use ark_serialize::{CanonicalSerialize, Compress};
use ark_ff::PrimeField;
use jf_plonk::{
errors::PlonkError,
@@ -22,8 +23,9 @@ use jf_plonk::{
use jf_relation::{Circuit, PlonkCircuit};
use std::time::Instant;
+const NUM_PROVE_REPETITIONS: usize = 1;
const NUM_REPETITIONS: usize = 10;
-const NUM_GATES_LARGE: usize = 32768;
+const NUM_GATES_LARGE: usize = 4_194_304;
const NUM_GATES_SMALL: usize = 8192;
fn gen_circuit_for_bench<F: PrimeField>(
@@ -58,31 +60,33 @@ macro_rules! plonk_prove_bench {
let start = Instant::now();
- for _ in 0..NUM_REPETITIONS {
+ for _ in 0..NUM_PROVE_REPETITIONS {
let _ = PlonkKzgSnark::<$bench_curve>::prove::<_, _, StandardTranscript>(
rng, &cs, &pk, None,
)
.unwrap();
}
+ let elapsed = start.elapsed();
+ println!(
+ "proving time total {}, {}: {} milliseconds",
+ stringify!($bench_curve),
+ stringify!($bench_plonk_type),
+ elapsed.as_millis() / NUM_PROVE_REPETITIONS as u128
+ );
+
println!(
"proving time for {}, {}: {} ns/gate",
stringify!($bench_curve),
stringify!($bench_plonk_type),
- start.elapsed().as_nanos() / NUM_REPETITIONS as u128 / $num_gates as u128
+ elapsed.as_nanos() / NUM_PROVE_REPETITIONS as u128 / $num_gates as u128
);
};
}
fn bench_prove() {
- plonk_prove_bench!(Bls12_381, Fr381, PlonkType::TurboPlonk, NUM_GATES_LARGE);
- plonk_prove_bench!(Bls12_377, Fr377, PlonkType::TurboPlonk, NUM_GATES_LARGE);
plonk_prove_bench!(Bn254, Fr254, PlonkType::TurboPlonk, NUM_GATES_LARGE);
- plonk_prove_bench!(BW6_761, Fr761, PlonkType::TurboPlonk, NUM_GATES_SMALL);
- plonk_prove_bench!(Bls12_381, Fr381, PlonkType::UltraPlonk, NUM_GATES_LARGE);
- plonk_prove_bench!(Bls12_377, Fr377, PlonkType::UltraPlonk, NUM_GATES_LARGE);
plonk_prove_bench!(Bn254, Fr254, PlonkType::UltraPlonk, NUM_GATES_LARGE);
- plonk_prove_bench!(BW6_761, Fr761, PlonkType::UltraPlonk, NUM_GATES_SMALL);
}
macro_rules! plonk_verify_bench {
@@ -91,7 +95,7 @@ macro_rules! plonk_verify_bench {
let cs = gen_circuit_for_bench::<$bench_field>($num_gates, $bench_plonk_type).unwrap();
let max_degree = $num_gates + 2;
- let srs = PlonkKzgSnark::<$bench_curve>::universal_setup(max_degree, rng).unwrap();
+ let srs = PlonkKzgSnark::<$bench_curve>::universal_setup_for_testing(max_degree, rng).unwrap();
let (pk, vk) = PlonkKzgSnark::<$bench_curve>::preprocess(&srs, &cs).unwrap();
@@ -99,6 +103,14 @@ macro_rules! plonk_verify_bench {
PlonkKzgSnark::<$bench_curve>::prove::<_, _, StandardTranscript>(rng, &cs, &pk, None)
.unwrap();
+ let mut bytes = Vec::with_capacity(proof.serialized_size(Compress::Yes));
+ proof.serialize_with_mode(&mut bytes, Compress::Yes).unwrap();
+
+ println!(
+ "proof size: {} bytes",
+ bytes.len()
+ );
+
let start = Instant::now();
for _ in 0..NUM_REPETITIONS {
@@ -117,14 +129,8 @@ macro_rules! plonk_verify_bench {
}
fn bench_verify() {
- plonk_verify_bench!(Bls12_381, Fr381, PlonkType::TurboPlonk, NUM_GATES_LARGE);
- plonk_verify_bench!(Bls12_377, Fr377, PlonkType::TurboPlonk, NUM_GATES_LARGE);
- plonk_verify_bench!(Bn254, Fr254, PlonkType::TurboPlonk, NUM_GATES_LARGE);
- plonk_verify_bench!(BW6_761, Fr761, PlonkType::TurboPlonk, NUM_GATES_SMALL);
- plonk_verify_bench!(Bls12_381, Fr381, PlonkType::UltraPlonk, NUM_GATES_LARGE);
- plonk_verify_bench!(Bls12_377, Fr377, PlonkType::UltraPlonk, NUM_GATES_LARGE);
- plonk_verify_bench!(Bn254, Fr254, PlonkType::UltraPlonk, NUM_GATES_LARGE);
- plonk_verify_bench!(BW6_761, Fr761, PlonkType::UltraPlonk, NUM_GATES_SMALL);
+ plonk_verify_bench!(Bn254, Fr254, PlonkType::TurboPlonk, NUM_GATES_SMALL);
+ plonk_verify_bench!(Bn254, Fr254, PlonkType::UltraPlonk, NUM_GATES_SMALL);
}
macro_rules! plonk_batch_verify_bench {
@@ -168,19 +174,7 @@ macro_rules! plonk_batch_verify_bench {
};
}
-fn bench_batch_verify() {
- plonk_batch_verify_bench!(Bls12_381, Fr381, PlonkType::TurboPlonk, 1000);
- plonk_batch_verify_bench!(Bls12_377, Fr377, PlonkType::TurboPlonk, 1000);
- plonk_batch_verify_bench!(Bn254, Fr254, PlonkType::TurboPlonk, 1000);
- plonk_batch_verify_bench!(BW6_761, Fr761, PlonkType::TurboPlonk, 1000);
- plonk_batch_verify_bench!(Bls12_381, Fr381, PlonkType::UltraPlonk, 1000);
- plonk_batch_verify_bench!(Bls12_377, Fr377, PlonkType::UltraPlonk, 1000);
- plonk_batch_verify_bench!(Bn254, Fr254, PlonkType::UltraPlonk, 1000);
- plonk_batch_verify_bench!(BW6_761, Fr761, PlonkType::UltraPlonk, 1000);
-}
-
fn main() {
- bench_prove();
+ // bench_prove();
bench_verify();
- bench_batch_verify();
}
Then, to run the benchmarks, in the root of the repo as per the README, I ran:
time cargo bench --bench plonk-benches --features=test-srs
To run single-threaded:
time RAYON_NUM_THREADS=1 cargo bench --bench plonk-benches --features=test-srs
Notes:
- It generates a circuit with just addition gates
- It just insert $n$ gates where gate $i$ simply adds 1 to gate $i-1$
- By default runs multithread on all cores (judging from
htopoutput) - Single-threaded verification is a little bit slower
- TurboPLONK has custom gates and UltraPLONK adds lookups
Uncertainties:
- What is the relation being proved? The public input in the code is the empty vector.
- What values do the gates get assigned? Otherwise, hard to understand the MSM work being done.
- It is unclear what gate 0 is initialized to: note that
cs.zero()does not mean the gate is given the value 0!
- It is unclear what gate 0 is initialized to: note that
- Will the UltraPLONK verifier costs always be 1.36 ms no matter what the circuit is?
- What overhead do custom gates add over vanilla PLONK?
| Library | Scheme | # gates | # threads | Prove | Verify | Proof |
|---|---|---|---|---|---|---|
| jellyfish | TurboPLONK | $2^{22}$ | default | 104.16 secs | 1.36 ms | 769 bytes |
| jellyfish | UltraPLONK | $2^{22}$ | default | 190.86 secs | 1.41 ms | 1,481 bytes |
| jellyfish | TurboPLONK | $2^{13}$ | 1 | don’t care | 1.55 ms | 769 bytes |
| jellyfish | UltraPLONK | $2^{13}$ | 1 | don’t care | 1.72 ms | 1,481 bytes |
Benchmark single-thread jellyfish proving times.
Play with the witness size: figure out how to initialize the first gate to 0 vs. $p/2$.
Benchmark the CAP library to get a sense if custom gates add more verifier time.
EspressoSystems/hyperplonk
Run some benchmarks here
dusk-network/plonk
Benchmark dusk-network/plonk PLONK.
WHIR
Include Yinuo’s numbers.
Ligetron
Notes from the ZKProof presentation1 below:
- Proving WASM executions.
- ZK by default.
- WebGPU acceleration.
- Hash-based + code-interleaving.
- Streaming prover (“witness and constraints can be streamed”).
- Proving key is allegedly small.
- Based on MPCitH. See (Simons?) Berkley talk for more depth, potentially.
- Prover is memory efficient: memory usage is proportional to what the WASM program uses.
- To get memory efficiency they sacrifice proof length (otherwise, it’s hard due to inherent time-space trade-offs in error-correcting codes based on the distance of the code).
- Encodes full witness as a $\sqrt{n}$ by $\sqrt{n}$ matrix
- Can “stream the matrix” into the IOP to ensure memory is $O(\sqrt{n})$
References
For cited works, see below 👇👇