tl;dr: Running some secp256k1 MSM benchmarks.
Background
For ECDSA verification and pubkey recovery, the relevant MSM size is $n=2$. For (modified) ECDSA batch verification, we will be interested in $n \ge 4$.
GLV
secp256k1 has a GLV endomorphism $\phi(x,y)=(\beta x, y)$, where $\beta$ is a primitive cube root of 1 over the base field $\F_p$. This lets you split a 256-bit scalar into two ~128-bit halves, speeding up scalar multiplication arithmetic.
The four libraries benchmarked below differ in where they apply GLV:
| Library | Single-mul GLV | MSM GLV |
|---|---|---|
p256k1 (FFI to libsecp256k1) |
✅ | ✅ |
gnark-crypto |
✅ | ❌ |
ark-secp256k1 |
❌ | ❌ |
halo2curves |
❌ | ❌ |
For p256k1, both Strauss-WNAF and Pippenger-WNAF in libsecp256k1 call secp256k1_scalar_split_lambda (resp. secp256k1_ecmult_endo_split) to feed Pippenger $2n$ pairs of half-length scalars instead of $n$ pairs of full-length scalars.
For gnark-crypto, only the per-point mulGLV path uses the endomorphism; MultiExp runs Pippenger over the full 256-bit scalars (no glv/phi/endomorph references in multiexp*.go).
For ark-secp256k1, the GLVConfig trait exists in ark-ec, but no curve in arkworks-rs/algebra implements it (grep -rln "impl GLVConfig\|GLVConfig for" is empty). secp256k1 only implements SWCurveConfig, so both mul and VariableBaseMSM::msm use the generic double-and-add over 256-bit scalars.
For halo2curves, the CurveEndo trait exists and the endo! macro provides a decompose_scalar impl for curves like bn256 (endo!(G1, Fr, ENDO_PARAMS_BN);). But secp256k1/curve.rs never invokes the macro, so Secp256k1 doesn’t implement CurveEndo. Independently, msm.rs doesn’t reference CurveEndo::decompose_scalar at all — msm_serial/msm_parallel/msm_best run Pippenger (with Booth encoding) over full 256-bit scalars regardless of whether the curve has GLV.
p256k1
Bitcoin Core’s libsecp256k1 C library ships a Pippenger-WNAF multi-scalar multiplication (MSM) routine, but neither the Rust secp256k1 crate nor the pure-Rust libsecp256k1 port expose it. i
The p256k1 crate does.
Below I benchmark its MSM.
libsecp256k1 exposes secp256k1_ecmult for individual signature verication (and for pubkey recovery) and exposes secp256k1_ecmult_multi_var for $n$-element MSM, dispatched to either Strauss-WNAF (small $n$) or Pippenger-WNAF (larger $n$). The cutoff is ECMULT_PIPPENGER_THRESHOLD = 88 (line 55 of ecmult_impl.h): for $n < 88$ the library calls secp256k1_ecmult_strauss_wnaf, and for $n \ge 88$ it calls secp256k1_ecmult_pippenger_wnaf. So in the table below, the rows for $n \in \{2, 4, 8, 16, 32, 64\}$ measure Strauss-WNAF + GLV, and the rows for $n \in \{128, 256, 512, 1024\}$ measure Pippenger-WNAF + GLV.
Run the benchmark
I forked the p256k1 crate and added
- a Criterion benchmark that times
Point::multimult - a naive
for i in 0..n { p += scalars[i] * points[i] }loop for $n \in \{2, 4, \ldots, 1024\}$, - a
run-benches.shscript that runs these, parses Criterion’sestimates.jsonfiles and outputs a Markdown table.
git clone --branch msm-benchmark https://github.com/alinush/p256k1.git
cd p256k1
# The script needs `cargo`, `jq`, `awk`, and `bc`.
./run-benches.sh
Benchmarks shoud take ~1-2 minutes to run.
Results
Run on an Apple M4 Max, release build:
| $n$ | MSM (µs) | Naive (µs) | Speedup | µs per scalar mul (MSM) | µs per scalar mul (Naive) |
|---|---|---|---|---|---|
| 2 | 13.130 | 19.994 | 1.52x | 6.565 | 9.997 |
| 4 | 48.409 | 47.073 | 0.97x | 12.102 | 11.768 |
| 8 | 69.829 | 94.050 | 1.34x | 8.728 | 11.756 |
| 16 | 111.468 | 189.588 | 1.70x | 6.966 | 11.849 |
| 32 | 201.065 | 379.832 | 1.88x | 6.283 | 11.869 |
| 64 | 377.774 | 763.196 | 2.02x | 5.902 | 11.924 |
| 128 | 641.421 | 1521.647 | 2.37x | 5.011 | 11.887 |
| 256 | 1133.065 | 3059.156 | 2.69x | 4.426 | 11.949 |
| 512 | 1966.887 | 6151.611 | 3.12x | 3.841 | 12.014 |
| 1024 | 3626.145 | 12418.983 | 3.42x | 3.541 | 12.127 |
Note (size-2 row uses the dedicated path): The $n=2$ row measures the specialized size-2 multiexp that ECDSA verify and pubkey recovery actually call (secp256k1_ecmult in libsecp256k1, exposed in my fork as Point::ecmult).
This leverages precomputed odd-multiples table for the fixed generator $G$.
For reference, passing $n=2$ to the generic secp256k1_ecmult_multi_var instead, which treats $G$ as a random base, would give ≈24.6 µs, about 2× slower.
Note (batch-normalize patch in my fork): The numbers above are from a small patch to Point::multimult: a batch-inversion replacement of $n$ inversions that gives a clean ~30% speedup at $n=1024$ (from 4734 µs down to 3644 µs).
Future optimizations
For large $n$, blst and halo2curves use Pippenger with batched affine addition: they keep Pippenger’s bucket sums in affine coordinates and amortize many independent slope-inversions via Montgomery’s simultaneous-inversion trick.
To understand the potential speed-ups, I vibe-coded a small Rust proof-of-concept timing $K$ independent affine additions with (1) $K$ independent inversions vs. (2) one shared inversion.
(There’s a unit test that verifies both paths agree element-wise.)
Run from the same fork via:
git clone --branch msm-benchmark https://github.com/alinush/p256k1.git
cd p256k1
cargo bench --bench affine_add_bench
| $K$ | Naive (µs) | Batched (µs) | Speedup | Naive µs / add | Batched µs / add |
|---|---|---|---|---|---|
| 2 | 4.99 | 3.79 | 1.32x | 2.50 | 1.90 |
| 4 | 7.35 | 4.23 | 1.74x | 1.84 | 1.06 |
| 8 | 12.52 | 5.20 | 2.41x | 1.57 | 0.65 |
| 16 | 22.65 | 6.74 | 3.36x | 1.42 | 0.42 |
| 32 | 42.81 | 10.07 | 4.25x | 1.34 | 0.32 |
| 64 | 83.45 | 16.69 | 5.00x | 1.30 | 0.26 |
| 128 | 163.94 | 30.41 | 5.39x | 1.28 | 0.24 |
| 256 | 328.90 | 57.49 | 5.72x | 1.29 | 0.23 |
What this means for Pippenger MSMs: The relevant comparison is batched-affine (≈0.23 µs/add) vs. libsecp256k1’s current mixed Jacobian+affine add (secp256k1_gej_add_ge_var, no inversion, ≈0.5-0.6 µs/add) $\Rightarrow$ a ~2-2.5× per-add speedup.
Since bucket-fill is roughly half of Pippenger’s work, plumbing this into secp256k1_ecmult_pippenger_wnaf should give ~25-30% more on top of the batch-normalize patch.
Caveat: This only helps the Pippenger path ($n \ge 88$).
For $n < 88$ (i.e., ECDSA verify, recovery, and most small instances of batch-verify), libsecp256k1 uses Strauss-WNAF, whose serial-accumulator inner loop has no parallel adds to batch.
ark-secp256k1
The pure-Rust ark-secp256k1 crate (part of arkworks) exposes a generic Pippenger MSM via the VariableBaseMSM trait: Projective::msm(&bases, &scalars).
Unlike p256k1, this is pure Rust: no FFI, no hand-tuned assembly.
As a result, things are slower.
Run the benchmark
In the same fork, I added:
- a Criterion benchmark that times
<Projective as VariableBaseMSM>::msm(&bases, &scalars), - a naive
for i in 0..n { p += bases[i] * scalars[i] }loop for $n \in \{2, 4, \ldots, 1024\}$, - a
run-arkworks-benches.shscript that runs these and outputs a Markdown table.
git clone --branch msm-benchmark https://github.com/alinush/p256k1.git
cd p256k1
# The script needs `cargo`, `jq`, `awk`, and `bc`.
./run-arkworks-benches.sh
Benchmarks should take ~1-2 minutes to run.
Results
Run on the same Apple M4 Max, release build:
| $n$ | MSM (µs) | Naive (µs) | Speedup | µs per scalar mul (MSM) | µs per scalar mul (Naive) |
|---|---|---|---|---|---|
| 2 | 88.712 | 88.666 | 0.99x | 44.356 | 44.333 |
| 4 | 131.302 | 173.934 | 1.32x | 32.825 | 43.483 |
| 8 | 187.550 | 355.432 | 1.89x | 23.443 | 44.429 |
| 16 | 292.609 | 775.267 | 2.64x | 18.288 | 48.454 |
| 32 | 491.861 | 1799.569 | 3.65x | 15.370 | 56.236 |
| 64 | 842.343 | 3720.552 | 4.41x | 13.161 | 58.133 |
| 128 | 1373.081 | 7650.410 | 5.57x | 10.727 | 59.768 |
| 256 | 2531.880 | 15444.088 | 6.09x | 9.890 | 60.328 |
| 512 | 4649.651 | 31023.133 | 6.67x | 9.081 | 60.592 |
| 1024 | 8119.509 | 62420.595 | 7.68x | 7.929 | 60.957 |
gnark-crypto
gnark-crypto is Consensys’ Go cryptography library.
Its ecc/secp256k1 package is code-generated (per-curve specialization, including modulus-specific Montgomery arithmetic) and exposes MultiExp, implementing the Pippenger variant from eprint 2012/549.
For a fair comparison, I run MultiExp with MultiExpConfig{NbTasks: 1} (single goroutine), since p256k1 and ark-secp256k1 above are also single-threaded.
Run the benchmark
I forked gnark-crypto and added:
- a Go benchmark
BenchmarkMSMSizesinecc/secp256k1that times(*G1Affine).MultiExp(points, scalars, ecc.MultiExpConfig{NbTasks: 1}), - a naive
for i := 0; i < n; i++ { tmp.ScalarMultiplication(&p[i], s[i]); acc.AddAssign(&tmp) }loop for $n \in \{2, 4, \ldots, 1024\}$, - a
run-gnark-benches.shscript that runs these and parsesgo test -benchoutput into a Markdown table.
git clone --branch secp256k1-msm-benchmark https://github.com/alinush/gnark-crypto.git
cd gnark-crypto
# The script needs `go` and `awk`.
./run-gnark-benches.sh
Benchmarks should take ~1-2 minutes to run.
Results
Run on the same Apple M4 Max, release build (Go’s default; optimizations + inlining are on unless you explicitly pass -gcflags='all=-N -l'):
| $n$ | MSM (µs) | Naive (µs) | Speedup | µs per scalar mul (MSM) | µs per scalar mul (Naive) |
|---|---|---|---|---|---|
| 2 | 162.226 | 53.448 | 0.33x | 81.113 | 26.724 |
| 4 | 242.297 | 125.605 | 0.52x | 60.574 | 31.401 |
| 8 | 317.981 | 273.368 | 0.86x | 39.748 | 34.171 |
| 16 | 442.898 | 561.851 | 1.27x | 27.681 | 35.116 |
| 32 | 654.000 | 1229.385 | 1.88x | 20.438 | 38.418 |
| 64 | 975.187 | 2856.921 | 2.93x | 15.237 | 44.639 |
| 128 | 1683.287 | 5882.025 | 3.49x | 13.151 | 45.953 |
| 256 | 2877.386 | 12113.487 | 4.21x | 11.240 | 47.318 |
| 512 | 4718.920 | 24435.621 | 5.18x | 9.217 | 47.726 |
| 1024 | 8077.244 | 50120.264 | 6.21x | 7.888 | 48.946 |
From speaking with Youssef El Housni, it seems like their secp256k1 implementation is unoptimized; it’s just there for some witness generation when proving ECDSA signatures.
halo2curves
The msm module exposes:
msm_serial(single-threaded),msm_parallel(rayon-based),- simple “chunk the input across threads, run
msm_serialon each chunk, sum” pattern
- simple “chunk the input across threads, run
msm_best(picks based on input size):- for $\lceil \ln(n) \rceil < 10$ (i.e., $n \lesssim 22{,}000$, including all sizes in this post) it calls
msm_parallel, which is rayon-parallelized but runs the same Pippenger asmsm_serialper thread.- $\Rightarrow$ in our $n \leq 1024$ range,
msm_bestwould usemsm_parallel, which we do not want.
- $\Rightarrow$ in our $n \leq 1024$ range,
- for $\lceil \ln(n) \rceil \ge 10$ it switches to a window-parallelized variant with a different memory layout.
- this variant is not exposed as a
msm_*function, apparently
- this variant is not exposed as a
- for $\lceil \ln(n) \rceil < 10$ (i.e., $n \lesssim 22{,}000$, including all sizes in this post) it calls
The MSM is generic over CurveAffine and uses Pippenger with Booth signed-digit encoding.
As noted in the GLV table above, halo2curves’ secp256k1 module does not plug into the library’s CurveEndo trait, and msm.rs ignores CurveEndo regardless. So this is the slowest of the four implementations: pure-Rust, no GLV, and no curve-specific field arithmetic tuning for secp256k1’s pseudo-Mersenne prime $p = 2^{256} - 2^{32} - 977$.
We benchmark msm_serial (not msm_best) so the comparison is apples-to-apples with the other single-threaded benches in this post.
Run the benchmark
I forked halo2curves and added:
- a Criterion benchmark
benches/secp256k1_msm_sizes.rsthat timesmsm_serial(&scalars, &bases, &mut acc), - a naive
for i in 0..n { acc += bases[i] * scalars[i] }loop for $n \in \{2, 4, \ldots, 1024\}$, - a
run-halo2curves-benches.shscript that runs these and outputs a Markdown table.
git clone --branch secp256k1-msm-benchmark https://github.com/alinush/halo2curves.git
cd halo2curves
# The script needs `cargo`, `jq`, `awk`, and `bc`.
./run-halo2curves-benches.sh
Benchmarks should take ~1-2 minutes to run.
Results
Run on the same Apple M4 Max, release build:
| $n$ | MSM (µs) | Naive (µs) | Speedup | µs per scalar mul (MSM) | µs per scalar mul (Naive) |
|---|---|---|---|---|---|
| 2 | 238.247 | 266.017 | 1.11x | 119.123 | 133.008 |
| 4 | 272.565 | 531.791 | 1.95x | 68.141 | 132.947 |
| 8 | 377.622 | 1064.668 | 2.81x | 47.202 | 133.083 |
| 16 | 581.653 | 2131.086 | 3.66x | 36.353 | 133.192 |
| 32 | 883.423 | 4256.997 | 4.81x | 27.606 | 133.031 |
| 64 | 1412.361 | 8522.399 | 6.03x | 22.068 | 133.162 |
| 128 | 2436.448 | 17030.856 | 6.99x | 19.034 | 133.053 |
| 256 | 4065.745 | 34042.049 | 8.37x | 15.881 | 132.976 |
| 512 | 6999.182 | 68019.231 | 9.71x | 13.670 | 132.850 |
| 1024 | 12951.051 | 136053.208 | 10.50x | 12.647 | 132.864 |
References
For cited works, see below 👇👇