tl;dr: Notes on our current use of Groth16 for Aptos Keyless and how we might improve upon it.
Research roadmap
Recall the goals discussed before:
- $\le$ 1.4 ms single-threaded individual ZKP verification time
- make it trivial to upgrade the keyless NP relation
- trusted setups are cumbersome
- universal setups are less cumbersome
- transparent are amazing
- safely implement the NP relation so as to avoid relying on training wheels
- client-side proving
- or at least, minimize costs of running a prover service
- small proof sizes (1.5 KiB?)
Our goals for keyless that would remove a lot of pain:
- Implement keyless relation safely in Rust, not as a circuit (security) $\Rightarrow$ zkVMs
- Remove circuit-specific trusted setup (tame complexity) $\Rightarrow$ WHIR, Spartan, [Hyper]PLONK
- Remove proving service (tame complexity, reduce costs)
- Prove obliviously via wrapping (privacy) $\Rightarrow$ Spartan, wrapped WHIR, [wrapped] HyperPLONK
- Reduce circuit size from 1.5M to 1M (efficiency)
There are several directions 👇 for replacing the keyless ZKP. In fact, no matter which way we go, there may be some common work:
- Security: formally-verify Keyless circuits, or zkVM implementation, or ZKP wrapping circuits
- Upgradability: registration-based, monetarily-incentivized, curve-agnostic, on-chain universal or trusted setups
- e.g., if Groth16 is involed, or if a KZG-like scheme is involved
- Research: MLE PCSs, zkSNARKs, efficient circuit representations, etc.
Groth16-based
- Temporary prover service scalability: deploy VKs for different SHA2-message lengths
- Inform optimal message lengths by building a histogram of
iss-specific JWT sizes
- Inform optimal message lengths by building a histogram of
- Client-side only proving:
- Milestone 0: Upgrade to UltraGroth16 $\Rightarrow$ no more Fiat-Shamir
- Path 1: modify
ark-groth16intoark-ultra-groth16and rewrite circuit - Path 2: modify
circom
- Path 1: modify
- Milestone 1: Combine with FREpack-like techniques
- Milestone 2: faster EC arithmetic in JavaScript (How optimized is
snarkjs? I suspect very well-optimized.) $\Rightarrow$ could prove in under 10 seconds - Milestone 3: Faster prover implementation via WebGPU $\Rightarrow$ prove $<5$ seconds
- Milestone 0: Upgrade to UltraGroth16 $\Rightarrow$ no more Fiat-Shamir
Spartan-based (PQ)
- Research:
- ZK sumcheck
- Path 1: DeKART’s ZK sumcheck
- Dense MLE ZK PCS:
- Path 1: Survey ZK MLE PCS and pick one whose verification involves a constant-number of pairings and minimizes opening time over MLEs with small entries
- Sparse MLE PCS:
- Path 1: Implement and iterate over Cinder
- Path 2: KZH + GIPA
- Path 3: WHIR?
- Path 4: Leverage uniformity in R1CS matrices
- ZK sumcheck
- Client-side proving:
- Milestone 1:
- can prove sumcheck and dense ZK MLE PCS opening client-side in < 5s
- can prove sparse MLE PCS opening server-side < 1s
- Milestone 2: can prove fully client-side < 5s
- Milestone 1:
PLONK-based
See some PLONK explorations below.
- Research:
- Evaluate prover times on circuit sizes and inputs representative of keyless
HyperPLONK-based (PQ)
- Research:
- ZK sumcheck
- Dense (ZK?) MLE PCS (similar difficulty as in Spartan)
- Choice of custom gates to reduce prover time
- Wrap sumcheck
- Client-side proving: < 5s
WHIR-based (PQ)
See some WHIR explorations below.
- Research:
- WHIR for R1CS
- Add ZK
- Wrapping WHIR fast
- Client-side proving: <
- Milestone 1:
- prove WHIR client-side in < 5s
- wrap server-side < 1s
- Milestone 2: can prove fully client-side < 5s
- Milestone 1:
zkVM-based (PQ)
Jolt, Ligero could be viable options very soon.
Groth16
Circuit size
As of October 29th, 2025:
- 1,438,805 constraints
- 1,406,686 variables
Not relevant for Groth16, but for other zkSNARKs like Spartan:
- The matrix $A$ has 4,203,190 non-zero terms
- The matrix $B$ has 3,251,286 non-zero terms
- The matrix $C$ has 2,055,196 non-zero terms
- Total number of nonzero terms: 9,509,672
I think our task.sh script in the keyless-zk-proofs repo can be used to reproduce these “# of non-zero entries” numbers.
rapidsnark (modified) proving time
These are the times taken on a t2d-standard-4 VM for an older version of the circuit with 1.3M constraints and variables.
| Operation | Time (millis) |
|---|---|
| MSM $\one{A}$ | 276 |
| MSM $\one{B}$ | 248 |
| MSM $\two{B}$ | 885 |
| MSM $\one{C}$ | 366 |
| Total $A, B, C$ MSM | 1,775 |
| Calculating C time | 18 |
| iFFT A time | 242 |
| Shift A time | 11 |
| FFT A time | 237 |
| iFFT B time | 240 |
| Shift B time | 10 |
| FFT B time | 237 |
| iFFT C time | 239 |
| Shift C time | 11 |
| FFT C time | 238 |
| Total FFT time | 1,465 |
| ABC time | 21 |
| MSM $h(X)$ time | 2,785 |
| Total | 6,209 |
PLONK
All benchmarks were run on my 10-core Macbook Pro M1 Max.
snarkjs compiler
# of “PLONK constraints” for the keyless relation is 6,421,050 (addition + multiplication gates, I believe.)
How?
Set up a PLONK proving key using a downloaded BN254 12 GiB powers-of-tau file and used ran a snarkjs command:
node --max-old-space-size=$((8192*2)) $(which snarkjs) plonk setup main.r1cs ~/Downloads/powersOfTau28_hez_final_23.ptau plonk.zkey
EspressoSystems/jellyfish
For jellyfish, I modified the benchmarks to fix a bug, exclude batch verification benches and increase the circuit size to $2^{22}$.
(I tried using the exact 6.4M circuit size, but jellyfish borked. Maybe it only likes powers of two.)
The git diff shows:
diff --git a/plonk/benches/bench.rs b/plonk/benches/bench.rs
index 256f0d53..9b8aec21 100644
--- a/plonk/benches/bench.rs
+++ b/plonk/benches/bench.rs
@@ -12,6 +12,7 @@ use ark_bls12_377::{Bls12_377, Fr as Fr377};
use ark_bls12_381::{Bls12_381, Fr as Fr381};
use ark_bn254::{Bn254, Fr as Fr254};
use ark_bw6_761::{Fr as Fr761, BW6_761};
+use ark_serialize::{CanonicalSerialize, Compress};
use ark_ff::PrimeField;
use jf_plonk::{
errors::PlonkError,
@@ -22,8 +23,9 @@ use jf_plonk::{
use jf_relation::{Circuit, PlonkCircuit};
use std::time::Instant;
+const NUM_PROVE_REPETITIONS: usize = 1;
const NUM_REPETITIONS: usize = 10;
-const NUM_GATES_LARGE: usize = 32768;
+const NUM_GATES_LARGE: usize = 4_194_304;
const NUM_GATES_SMALL: usize = 8192;
fn gen_circuit_for_bench<F: PrimeField>(
@@ -58,31 +60,33 @@ macro_rules! plonk_prove_bench {
let start = Instant::now();
- for _ in 0..NUM_REPETITIONS {
+ for _ in 0..NUM_PROVE_REPETITIONS {
let _ = PlonkKzgSnark::<$bench_curve>::prove::<_, _, StandardTranscript>(
rng, &cs, &pk, None,
)
.unwrap();
}
+ let elapsed = start.elapsed();
+ println!(
+ "proving time total {}, {}: {} milliseconds",
+ stringify!($bench_curve),
+ stringify!($bench_plonk_type),
+ elapsed.as_millis() / NUM_PROVE_REPETITIONS as u128
+ );
+
println!(
"proving time for {}, {}: {} ns/gate",
stringify!($bench_curve),
stringify!($bench_plonk_type),
- start.elapsed().as_nanos() / NUM_REPETITIONS as u128 / $num_gates as u128
+ elapsed.as_nanos() / NUM_PROVE_REPETITIONS as u128 / $num_gates as u128
);
};
}
fn bench_prove() {
- plonk_prove_bench!(Bls12_381, Fr381, PlonkType::TurboPlonk, NUM_GATES_LARGE);
- plonk_prove_bench!(Bls12_377, Fr377, PlonkType::TurboPlonk, NUM_GATES_LARGE);
plonk_prove_bench!(Bn254, Fr254, PlonkType::TurboPlonk, NUM_GATES_LARGE);
- plonk_prove_bench!(BW6_761, Fr761, PlonkType::TurboPlonk, NUM_GATES_SMALL);
- plonk_prove_bench!(Bls12_381, Fr381, PlonkType::UltraPlonk, NUM_GATES_LARGE);
- plonk_prove_bench!(Bls12_377, Fr377, PlonkType::UltraPlonk, NUM_GATES_LARGE);
plonk_prove_bench!(Bn254, Fr254, PlonkType::UltraPlonk, NUM_GATES_LARGE);
- plonk_prove_bench!(BW6_761, Fr761, PlonkType::UltraPlonk, NUM_GATES_SMALL);
}
macro_rules! plonk_verify_bench {
@@ -91,7 +95,7 @@ macro_rules! plonk_verify_bench {
let cs = gen_circuit_for_bench::<$bench_field>($num_gates, $bench_plonk_type).unwrap();
let max_degree = $num_gates + 2;
- let srs = PlonkKzgSnark::<$bench_curve>::universal_setup(max_degree, rng).unwrap();
+ let srs = PlonkKzgSnark::<$bench_curve>::universal_setup_for_testing(max_degree, rng).unwrap();
let (pk, vk) = PlonkKzgSnark::<$bench_curve>::preprocess(&srs, &cs).unwrap();
@@ -99,6 +103,14 @@ macro_rules! plonk_verify_bench {
PlonkKzgSnark::<$bench_curve>::prove::<_, _, StandardTranscript>(rng, &cs, &pk, None)
.unwrap();
+ let mut bytes = Vec::with_capacity(proof.serialized_size(Compress::Yes));
+ proof.serialize_with_mode(&mut bytes, Compress::Yes).unwrap();
+
+ println!(
+ "proof size: {} bytes",
+ bytes.len()
+ );
+
let start = Instant::now();
for _ in 0..NUM_REPETITIONS {
@@ -117,14 +129,8 @@ macro_rules! plonk_verify_bench {
}
fn bench_verify() {
- plonk_verify_bench!(Bls12_381, Fr381, PlonkType::TurboPlonk, NUM_GATES_LARGE);
- plonk_verify_bench!(Bls12_377, Fr377, PlonkType::TurboPlonk, NUM_GATES_LARGE);
- plonk_verify_bench!(Bn254, Fr254, PlonkType::TurboPlonk, NUM_GATES_LARGE);
- plonk_verify_bench!(BW6_761, Fr761, PlonkType::TurboPlonk, NUM_GATES_SMALL);
- plonk_verify_bench!(Bls12_381, Fr381, PlonkType::UltraPlonk, NUM_GATES_LARGE);
- plonk_verify_bench!(Bls12_377, Fr377, PlonkType::UltraPlonk, NUM_GATES_LARGE);
- plonk_verify_bench!(Bn254, Fr254, PlonkType::UltraPlonk, NUM_GATES_LARGE);
- plonk_verify_bench!(BW6_761, Fr761, PlonkType::UltraPlonk, NUM_GATES_SMALL);
+ plonk_verify_bench!(Bn254, Fr254, PlonkType::TurboPlonk, NUM_GATES_SMALL);
+ plonk_verify_bench!(Bn254, Fr254, PlonkType::UltraPlonk, NUM_GATES_SMALL);
}
macro_rules! plonk_batch_verify_bench {
@@ -168,19 +174,7 @@ macro_rules! plonk_batch_verify_bench {
};
}
-fn bench_batch_verify() {
- plonk_batch_verify_bench!(Bls12_381, Fr381, PlonkType::TurboPlonk, 1000);
- plonk_batch_verify_bench!(Bls12_377, Fr377, PlonkType::TurboPlonk, 1000);
- plonk_batch_verify_bench!(Bn254, Fr254, PlonkType::TurboPlonk, 1000);
- plonk_batch_verify_bench!(BW6_761, Fr761, PlonkType::TurboPlonk, 1000);
- plonk_batch_verify_bench!(Bls12_381, Fr381, PlonkType::UltraPlonk, 1000);
- plonk_batch_verify_bench!(Bls12_377, Fr377, PlonkType::UltraPlonk, 1000);
- plonk_batch_verify_bench!(Bn254, Fr254, PlonkType::UltraPlonk, 1000);
- plonk_batch_verify_bench!(BW6_761, Fr761, PlonkType::UltraPlonk, 1000);
-}
-
fn main() {
- bench_prove();
+ // bench_prove();
bench_verify();
- bench_batch_verify();
}
Then, to run the benchmarks, in the root of the repo as per the README, I ran:
time cargo bench --bench plonk-benches --features=test-srs
To run single-threaded:
time RAYON_NUM_THREADS=1 cargo bench --bench plonk-benches --features=test-srs
Notes:
- It generates a circuit with just addition gates
- It just insert $n$ gates where gate $i$ simply adds 1 to gate $i-1$
- By default runs multithread on all cores (judging from
htopoutput) - Single-threaded verification is a little bit slower
- TurboPLONK has custom gates and UltraPLONK adds lookups
Uncertainties:
- What is the relation being proved? The public input in the code is the empty vector.
- What values do the gates get assigned? Otherwise, hard to understand the MSM work being done.
- It is unclear what gate 0 is initialized to: note that
cs.zero()does not mean the gate is given the value 0!
- It is unclear what gate 0 is initialized to: note that
- Will the UltraPLONK verifier costs always be 1.36 ms no matter what the circuit is?
- What overhead do custom gates add over vanilla PLONK?
| Library | Scheme | # gates | # threads | Prove | Verify | Proof |
|---|---|---|---|---|---|---|
| jellyfish | TurboPLONK | $2^{22}$ | default | 104.16 secs | 1.36 ms | 769 bytes |
| jellyfish | UltraPLONK | $2^{22}$ | default | 190.86 secs | 1.41 ms | 1,481 bytes |
| jellyfish | TurboPLONK | $2^{13}$ | 1 | don’t care | 1.55 ms | 769 bytes |
| jellyfish | UltraPLONK | $2^{13}$ | 1 | don’t care | 1.72 ms | 1,481 bytes |
Benchmark single-thread jellyfish proving times.
Play with the witness size: figure out how to initialize the first gate to 0 vs. $p/2$.
Benchmark the CAP library to get a sense if custom gates add more verifier time.
EspressoSystems/hyperplonk
Run some benchmarks here
dusk-network/plonk
Benchmark dusk-network/plonk PLONK.
WHIR
Include Yinuo’s numbers.
Ligetron
Notes from the ZKProof presentation[^muthu-zkproof7] below:
- Proving WASM executions.
- ZK by default.
- WebGPU acceleration.
- Hash-based + code-interleaving.
- Streaming prover (“witness and constraints can be streamed”).
- Proving key is allegedly small.
- Based on MPCitH. See (Simons?) Berkley talk for more depth, potentially.
- Prover is memory efficient: memory usage is proportional to what the WASM program uses.
- To get memory efficiency they sacrifice proof length (otherwise, it’s hard due to inherent time-space trade-offs in error-correcting codes based on the distance of the code).
- Encodes full witness as a $\sqrt{n}$ by $\sqrt{n}$ matrix
- Can “stream the matrix” into the IOP to ensure memory is $O(\sqrt{n})$
References
For cited works, see below 👇👇