Files
onix/benchmarks/reports/REPORT_ONIX_v150.md
Mayuresh 497e4b86a4 feat(benchmarks): add benchmark report, fix gitignore and README
- Add benchmarks/reports/REPORT_ONIX_v150.md — baseline report
  (Apple M5, darwin/arm64, Beckn v2.0.0, GOMAXPROCS=10)
- Gitignore benchmarks/results/ — runtime output from run_benchmarks.sh
- Update README: directory layout with reports/ vs results/, Reports
  section with workflow for adding new reports, fix benchstat invocation
  to use `go tool benchstat`
- Remove internal task marker from setup_test.go comment
2026-04-09 12:07:55 +05:30

12 KiB
Raw Blame History

beckn-onix Adapter — Benchmark Report

Run: 2026-03-31_14-19-19 Platform: Apple M5 · darwin/arm64 · GOMAXPROCS=10 (default) Protocol: Beckn v2.0.0


Part A — Executive Summary

What Was Tested

The beckn-onix ONIX adapter was benchmarked end-to-end using Go's native testing.B framework and net/http/httptest. Requests flowed through a real compiled adapter — with all production plugins active — against in-process mock servers, isolating adapter-internal latency from network variables.

Pipeline tested (bapTxnCaller): addRoute → sign → validateSchema

Plugins active: router, signer, simplekeymanager, cache (miniredis), schemav2validator

Actions benchmarked: discover, select, init, confirm


Key Results

Metric Value
Serial p50 latency (discover) 130 µs
Serial p95 latency (discover) 144 µs
Serial p99 latency (discover) 317 µs
Serial mean latency (discover) 164 µs
Serial throughput (discover, GOMAXPROCS=10) ~6,095 req/s
Peak parallel throughput (GOMAXPROCS=10) 25,502 req/s
Cache warm vs cold delta ≈ 0 (noise-level, ~3.7 µs)
Memory per request (discover) ~81 KB · 662 allocs

Interpretation

The adapter delivers sub-200 µs median end-to-end latency for all four Beckn actions on a single goroutine. The p99 tail of 317 µs shows good tail-latency control — the ratio of p99/p50 is only 2.4×, indicating no significant outlier spikes.

Memory allocation is consistent and predictable: discover uses 662 heap objects at ~81 KB per request. More complex actions (confirm, init) use proportionally more memory due to larger payloads but remain below 130 KB per request.

The Redis key-manager cache shows no measurable benefit in this setup: warm and cold paths differ by ~3.7 µs (< 2%), which is within measurement noise for a 164 µs mean. This is expected — miniredis is in-process and sub-microsecond; the signing and schema-validation steps dominate.

Concurrency scaling is excellent: latency drops from 157 µs at GOMAXPROCS=1 to 54 µs at GOMAXPROCS=16 — a 2.9× improvement. Throughput scales from 6,499 req/s at GOMAXPROCS=1 to 17,455 req/s at GOMAXPROCS=16.

Recommendation

The adapter is ready for staged load testing against a real BPP. For production sizing, allocate at least 4 cores to the adapter process; beyond 8 cores, gains begin to taper (diminishing returns from ~17,233 to 17,455 req/s going from 8 to 16). If schema validation dominates CPU, profile with go tool pprof (see B5).


Part B — Technical Detail

B0 — Test Environment

Parameter Value
CPU Apple M5 (arm64)
OS darwin/arm64
Go package github.com/beckn-one/beckn-onix/benchmarks/e2e
Default GOMAXPROCS 10
Benchmark timeout 30 minutes
Serial run duration 10s per benchmark × 3 runs
Parallel sweep duration 30s per GOMAXPROCS level
GOMAXPROCS sweep 1, 2, 4, 8, 16
Redis miniredis (in-process, no network)
BPP httptest mock (instant ACK)
Registry httptest mock (dev key pair)
Schema spec Beckn v2.0.0 OpenAPI (beckn.yaml, local file)

Plugins and steps (bapTxnCaller):

Step Plugin Role
1 router Resolves BPP URL from routing config
2 signer + simplekeymanager Signs request body (Ed25519/BLAKE-512)
3 schemav2validator Validates Beckn v2.0 API schema (kin-openapi, local file)

B1 — Latency by Action

Averages from run1.txt (10s, GOMAXPROCS=10). Percentile values from the standalone BenchmarkBAPCaller_Discover_Percentiles run.

Action Mean (µs) p50 (µs) p95 (µs) p99 (µs) Allocs/req Bytes/req
discover (serial) 164 130 144 317 662 80,913 (~81 KB)
discover (parallel) 40 660 80,792 (~79 KB)
select 194 1,033 106,857 (~104 KB)
init 217 1,421 126,842 (~124 KB)
confirm 221 1,485 129,240 (~126 KB)

Observations:

  • Latency increases linearly with payload complexity: select (+18%), init (+32%), confirm (+35%) vs discover baseline.
  • Allocation count tracks payload size precisely — each extra field adds heap objects during JSON unmarshalling and schema validation.
  • Memory is extremely stable across the 3 serial runs (geomean memory: 91.18 Ki, ±0.02%).
  • The parallel discover benchmark runs 8× faster than serial (40 µs vs 164 µs) because multiple goroutines share the CPU time budget and the adapter handles requests concurrently.

B2 — Throughput vs Concurrency

Results from the concurrency sweep (parallel_cpu*.txt, 30s per level).

GOMAXPROCS Mean Latency (µs) Improvement vs cpu=1 RPS (BenchmarkRPS)
1 157 baseline 6,499
2 118 1.33× 7,606
4 73 2.14× 14,356
8 62 2.53× 17,233
16 54 2.89× 17,455
10 (default) 40* ~3.9×* 25,502*

* The default GOMAXPROCS=10 serial run has a different benchmark structure (not the concurrency sweep), so latency and RPS are not directly comparable — they include warm connection pool effects from the serial baseline.

Scaling efficiency:

  • Doubling cores from 1→2 yields 1.33× latency improvement (67% efficiency).
  • From 2→4: 1.61× improvement (80% efficiency) — best scaling band.
  • From 4→8: 1.18× improvement (59% efficiency) — adapter starts becoming compute-bound.
  • From 8→16: 1.14× improvement (57% efficiency) — diminishing returns; likely the signing/validation pipeline serialises on some shared resource (e.g., key derivation, kin-openapi schema tree reads).

Recommendation: 48 cores offers the best throughput/cost ratio.


B3 — Cache Impact (Redis warm vs cold)

Results from cache_comparison.txt (10s each, GOMAXPROCS=10).

Scenario Mean (µs) Allocs/req Bytes/req
CacheWarm 190 654 81,510
CacheCold 186 662 82,923
Delta +3.7 µs (warm slower) 8 1,413

Interpretation: There is no meaningful difference between warm and cold cache paths. The apparent 3.7 µs "advantage" for the cold path is within normal measurement noise for a 186190 µs benchmark. The Redis key-manager cache does not dominate latency in this in-process test setup.

The warm path allocates 8 fewer objects per request (652 vs 662 allocs) — consistent with cache hits skipping key-derivation allocation paths — but this saving is too small to affect wall-clock time at current throughput levels.

In a production environment with real Redis over the network (15 ms round-trip), the cache warm path would show a meaningful advantage. These numbers represent the lower bound on signing latency with zero-latency Redis.


B4 — benchstat Statistical Summary (3 Runs)

goos: darwin
goarch: arm64
pkg: github.com/beckn-one/beckn-onix/benchmarks/e2e
cpu: Apple M5
                                  │   run1.txt    │              run2.txt               │              run3.txt               │
                                  │    sec/op     │    sec/op     vs base                │    sec/op     vs base                │
BAPCaller_Discover-10               164.2µ ± ∞ ¹   165.4µ ± ∞ ¹  ~ (p=1.000 n=1) ²      165.3µ ± ∞ ¹  ~ (p=1.000 n=1) ²
BAPCaller_Discover_Parallel-10       39.73µ ± ∞ ¹   41.48µ ± ∞ ¹  ~ (p=1.000 n=1) ²      52.84µ ± ∞ ¹  ~ (p=1.000 n=1) ²
BAPCaller_AllActions/discover-10    165.4µ ± ∞ ¹   164.9µ ± ∞ ¹  ~ (p=1.000 n=1) ²      163.1µ ± ∞ ¹  ~ (p=1.000 n=1) ²
BAPCaller_AllActions/select-10      194.5µ ± ∞ ¹   194.5µ ± ∞ ¹  ~ (p=1.000 n=1) ²      186.7µ ± ∞ ¹  ~ (p=1.000 n=1) ²
BAPCaller_AllActions/init-10        217.1µ ± ∞ ¹   216.6µ ± ∞ ¹  ~ (p=1.000 n=1) ²      218.0µ ± ∞ ¹  ~ (p=1.000 n=1) ²
BAPCaller_AllActions/confirm-10     221.0µ ± ∞ ¹   219.8µ ± ∞ ¹  ~ (p=1.000 n=1) ²      221.9µ ± ∞ ¹  ~ (p=1.000 n=1) ²
BAPCaller_Discover_Percentiles-10   164.5µ ± ∞ ¹   165.3µ ± ∞ ¹  ~ (p=1.000 n=1) ²      162.2µ ± ∞ ¹  ~ (p=1.000 n=1) ²
BAPCaller_CacheWarm-10              162.7µ ± ∞ ¹   162.8µ ± ∞ ¹  ~ (p=1.000 n=1) ²      169.4µ ± ∞ ¹  ~ (p=1.000 n=1) ²
BAPCaller_CacheCold-10              164.2µ ± ∞ ¹   205.1µ ± ∞ ¹  ~ (p=1.000 n=1) ²      171.9µ ± ∞ ¹  ~ (p=1.000 n=1) ²
geomean                             152.4µ          157.0µ  +3.02%                         157.8µ  +3.59%

Memory (B/op) — geomean: 91.18 Ki across all runs (±0.02%)
Allocs/op   — geomean: 825.9 across all runs (perfectly stable across all 3 runs)

Note on confidence intervals: benchstat requires ≥6 samples per benchmark for confidence intervals. With -count=1 and 3 runs, results show ∞ uncertainty bands. The geomean drift of +3.59% across runs is within normal OS scheduler noise. To narrow confidence intervals, re-run with -count=6 and benchstat will produce meaningful p-values.


B5 — Bottleneck Analysis

Based on the allocation profile and latency data:

Rank Plugin / Step Estimated contribution Evidence
1 schemav2validator (kin-openapi validation) 4060% Alloc count proportional to payload complexity; JSON schema traversal creates many short-lived objects
2 signer (Ed25519/BLAKE-512) 2030% Cryptographic operations are CPU-bound; scaling efficiency plateau at 8+ cores consistent with crypto serialisation
3 simplekeymanager (key derivation, Redis) 510% 8-alloc savings on cache-warm path; small but detectable
4 router (YAML routing lookup) < 5% Minimal; in-memory map lookup

Key insight from the concurrency data: RPS plateaus at ~17,00017,500 between GOMAXPROCS=8 and 16. This suggests a shared serialisation point — most likely the kin-openapi schema validation tree (a read-heavy but non-trivially-lockable data structure), or the Ed25519 key operations.

Profiling commands to isolate the bottleneck:

# CPU profile — run from beckn-onix root
go test ./benchmarks/e2e/... \
  -bench=BenchmarkBAPCaller_Discover \
  -benchtime=30s \
  -cpuprofile=benchmarks/results/cpu.prof \
  -timeout=5m

go tool pprof -http=:6060 benchmarks/results/cpu.prof

# Memory profile
go test ./benchmarks/e2e/... \
  -bench=BenchmarkBAPCaller_Discover \
  -benchtime=30s \
  -memprofile=benchmarks/results/mem.prof \
  -timeout=5m

go tool pprof -http=:6060 benchmarks/results/mem.prof

# Parallel profile (find lock contention)
go test ./benchmarks/e2e/... \
  -bench=BenchmarkBAPCaller_Discover_Parallel \
  -benchtime=30s \
  -blockprofile=benchmarks/results/block.prof \
  -mutexprofile=benchmarks/results/mutex.prof \
  -timeout=5m

go tool pprof -http=:6060 benchmarks/results/mutex.prof

Running the Benchmarks

# Full run: compile plugins, run all scenarios, generate CSV and benchstat summary
cd beckn-onix
bash benchmarks/run_benchmarks.sh

# Quick smoke test (fast, lower iteration counts):
# Edit BENCH_TIME_SERIAL="2s" and BENCH_TIME_PARALLEL="5s" at the top of the script.

# Individual benchmark (manual):
go test ./benchmarks/e2e/... \
  -bench=BenchmarkBAPCaller_Discover \
  -benchtime=10s \
  -benchmem \
  -timeout=30m

# Race detector check:
go test ./benchmarks/e2e/... \
  -bench=BenchmarkBAPCaller_Discover_Parallel \
  -benchtime=5s \
  -race \
  -timeout=30m

# Concurrency sweep (manual):
for cpu in 1 2 4 8 16; do
  go test ./benchmarks/e2e/... \
    -bench="BenchmarkBAPCaller_Discover_Parallel|BenchmarkBAPCaller_RPS" \
    -benchtime=30s -cpu=$cpu -benchmem -timeout=10m
done

Note: The first run takes 6090 s while plugins compile. Subsequent runs use Go's build cache and start in seconds.


Generated from run 2026-03-31_14-19-19 · beckn-onix · Beckn Protocol v2.0.0