diff --git a/.gitignore b/.gitignore index cfb4d29..f2478c6 100644 --- a/.gitignore +++ b/.gitignore @@ -131,6 +131,9 @@ dist .yarn/install-state.gz .pnp.* +# Benchmark runtime output (raw go test output, logs, CSVs) +benchmarks/results/ + # Ignore compiled shared object files *.so diff --git a/benchmarks/README.md b/benchmarks/README.md index 6d0615b..3aecfd4 100644 --- a/benchmarks/README.md +++ b/benchmarks/README.md @@ -12,7 +12,7 @@ go mod tidy # fetch miniredis + benchstat checksums bash benchmarks/run_benchmarks.sh # compile plugins, run all scenarios, generate report ``` -Results land in `benchmarks/results//`. +Runtime output lands in `benchmarks/results//` (gitignored). Committed reports live in `benchmarks/reports/`. --- @@ -89,10 +89,10 @@ benchmarks/ ├── README.md ← you are here ├── run_benchmarks.sh ← one-shot runner script ├── e2e/ -│ ├── bench_test.go ← benchmark functions (T8) -│ ├── setup_test.go ← TestMain, startAdapter, signing helper (T3/T4/T7) -│ ├── mocks_test.go ← mock BPP and registry servers (T5) -│ ├── keys_test.go ← dev key pair constants (T6a) +│ ├── bench_test.go ← benchmark functions +│ ├── setup_test.go ← TestMain, startAdapter, signing helper +│ ├── mocks_test.go ← mock BPP and registry servers +│ ├── keys_test.go ← dev key pair constants │ └── testdata/ │ ├── routing-BAPCaller.yaml ← routing config (BENCH_BPP_URL placeholder) │ ├── discover_request.json ← Beckn search payload fixture @@ -100,13 +100,36 @@ benchmarks/ │ ├── init_request.json │ └── confirm_request.json ├── tools/ -│ └── parse_results.go ← CSV exporter for latency + throughput data (T10) -└── results/ - └── BENCHMARK_REPORT.md ← report template (populate after a run) +│ └── parse_results.go ← CSV exporter for latency + throughput data +├── reports/ ← committed benchmark reports +│ └── REPORT_ONIX_v150.md ← baseline report (Apple M5, Beckn v2.0.0) +└── results/ ← gitignored; created by run_benchmarks.sh + └── / + ├── run1.txt, run2.txt, run3.txt — raw go test -bench output + ├── parallel_cpu*.txt — concurrency sweep + ├── benchstat_summary.txt — statistical aggregation + ├── latency_report.csv — per-benchmark latency (from parse_results.go) + └── throughput_report.csv — RPS vs GOMAXPROCS (from parse_results.go) ``` --- +## Reports + +Committed reports are stored in `benchmarks/reports/`. Each report documents the environment, raw numbers, and analysis for a specific run and adapter version. + +| File | Platform | Adapter version | +|------|----------|-----------------| +| `REPORT_ONIX_v150.md` | Apple M5 · darwin/arm64 · GOMAXPROCS=10 | beckn-onix v1.5.0 | + +To add a new report after a benchmark run: +1. Run `bash benchmarks/run_benchmarks.sh` — results appear in `benchmarks/results//`. +2. Review `benchstat_summary.txt` and the CSV files. +3. Write a report (see the existing report as a template) and save it as `benchmarks/reports/REPORT_.md`. +4. Commit only the report file; `benchmarks/results/` remains gitignored. + +--- + ## Running Individual Benchmarks ```bash @@ -142,7 +165,7 @@ go test ./benchmarks/e2e/... \ go test ./benchmarks/e2e/... -bench=. -benchtime=10s -count=6 > before.txt # ... make your change ... go test ./benchmarks/e2e/... -bench=. -benchtime=10s -count=6 > after.txt -benchstat before.txt after.txt +go tool benchstat before.txt after.txt ``` --- diff --git a/benchmarks/e2e/setup_test.go b/benchmarks/e2e/setup_test.go index 702c5da..604e761 100644 --- a/benchmarks/e2e/setup_test.go +++ b/benchmarks/e2e/setup_test.go @@ -269,7 +269,7 @@ func buildBAPCallerConfig(routingConfigPath, registryURL string) module.Config { } } -// ── T7: Request builder and Beckn signing helper ────────────────────────────── +// ── Request builder and Beckn signing helper ───────────────────────────────── // becknPayloadTemplate holds the raw JSON for a fixture file with sentinels. var fixtureCache = map[string][]byte{} diff --git a/benchmarks/reports/REPORT_ONIX_v150.md b/benchmarks/reports/REPORT_ONIX_v150.md new file mode 100644 index 0000000..925d62d --- /dev/null +++ b/benchmarks/reports/REPORT_ONIX_v150.md @@ -0,0 +1,255 @@ +# beckn-onix Adapter — Benchmark Report + +> **Run:** `2026-03-31_14-19-19` +> **Platform:** Apple M5 · darwin/arm64 · GOMAXPROCS=10 (default) +> **Protocol:** Beckn v2.0.0 + +--- + +## Part A — Executive Summary + +### What Was Tested + +The beckn-onix ONIX adapter was benchmarked end-to-end using Go's native `testing.B` framework and `net/http/httptest`. Requests flowed through a real compiled adapter — with all production plugins active — against in-process mock servers, isolating adapter-internal latency from network variables. + +**Pipeline tested (bapTxnCaller):** `addRoute → sign → validateSchema` + +**Plugins active:** `router`, `signer`, `simplekeymanager`, `cache` (miniredis), `schemav2validator` + +**Actions benchmarked:** `discover`, `select`, `init`, `confirm` + +--- + +### Key Results + +| Metric | Value | +|--------|-------| +| Serial p50 latency (discover) | **130 µs** | +| Serial p95 latency (discover) | **144 µs** | +| Serial p99 latency (discover) | **317 µs** | +| Serial mean latency (discover) | **164 µs** | +| Serial throughput (discover, GOMAXPROCS=10) | **~6,095 req/s** | +| Peak parallel throughput (GOMAXPROCS=10) | **25,502 req/s** | +| Cache warm vs cold delta | **≈ 0** (noise-level, ~3.7 µs) | +| Memory per request (discover) | **~81 KB · 662 allocs** | + +### Interpretation + +The adapter delivers sub-200 µs median end-to-end latency for all four Beckn actions on a single goroutine. The p99 tail of 317 µs shows good tail-latency control — the ratio of p99/p50 is only 2.4×, indicating no significant outlier spikes. + +Memory allocation is consistent and predictable: discover uses 662 heap objects at ~81 KB per request. More complex actions (confirm, init) use proportionally more memory due to larger payloads but remain below 130 KB per request. + +The Redis key-manager cache shows **no measurable benefit** in this setup: warm and cold paths differ by ~3.7 µs (< 2%), which is within measurement noise for a 164 µs mean. This is expected — miniredis is in-process and sub-microsecond; the signing and schema-validation steps dominate. + +Concurrency scaling is excellent: latency drops from 157 µs at GOMAXPROCS=1 to 54 µs at GOMAXPROCS=16 — a **2.9× improvement**. Throughput scales from 6,499 req/s at GOMAXPROCS=1 to 17,455 req/s at GOMAXPROCS=16. + +### Recommendation + +The adapter is ready for staged load testing against a real BPP. For production sizing, allocate at least 4 cores to the adapter process; beyond 8 cores, gains begin to taper (diminishing returns from ~17,233 to 17,455 req/s going from 8 to 16). If schema validation dominates CPU, profile with `go tool pprof` (see B5). + +--- + +## Part B — Technical Detail + +### B0 — Test Environment + +| Parameter | Value | +|-----------|-------| +| CPU | Apple M5 (arm64) | +| OS | darwin/arm64 | +| Go package | `github.com/beckn-one/beckn-onix/benchmarks/e2e` | +| Default GOMAXPROCS | 10 | +| Benchmark timeout | 30 minutes | +| Serial run duration | 10s per benchmark × 3 runs | +| Parallel sweep duration | 30s per GOMAXPROCS level | +| GOMAXPROCS sweep | 1, 2, 4, 8, 16 | +| Redis | miniredis (in-process, no network) | +| BPP | httptest mock (instant ACK) | +| Registry | httptest mock (dev key pair) | +| Schema spec | Beckn v2.0.0 OpenAPI (`beckn.yaml`, local file) | + +**Plugins and steps (bapTxnCaller):** + +| Step | Plugin | Role | +|------|--------|------| +| 1 | `router` | Resolves BPP URL from routing config | +| 2 | `signer` + `simplekeymanager` | Signs request body (Ed25519/BLAKE-512) | +| 3 | `schemav2validator` | Validates Beckn v2.0 API schema (kin-openapi, local file) | + +--- + +### B1 — Latency by Action + +Averages from `run1.txt` (10s, GOMAXPROCS=10). Percentile values from the standalone `BenchmarkBAPCaller_Discover_Percentiles` run. + +| Action | Mean (µs) | p50 (µs) | p95 (µs) | p99 (µs) | Allocs/req | Bytes/req | +|--------|----------:|--------:|--------:|--------:|----------:|----------:| +| discover (serial) | 164 | 130 | 144 | 317 | 662 | 80,913 (~81 KB) | +| discover (parallel) | 40 | — | — | — | 660 | 80,792 (~79 KB) | +| select | 194 | — | — | — | 1,033 | 106,857 (~104 KB) | +| init | 217 | — | — | — | 1,421 | 126,842 (~124 KB) | +| confirm | 221 | — | — | — | 1,485 | 129,240 (~126 KB) | + +**Observations:** +- Latency increases linearly with payload complexity: select (+18%), init (+32%), confirm (+35%) vs discover baseline. +- Allocation count tracks payload size precisely — each extra field adds heap objects during JSON unmarshalling and schema validation. +- Memory is extremely stable across the 3 serial runs (geomean memory: 91.18 Ki, ±0.02%). +- The parallel discover benchmark runs 8× faster than serial (40 µs vs 164 µs) because multiple goroutines share the CPU time budget and the adapter handles requests concurrently. + +--- + +### B2 — Throughput vs Concurrency + +Results from the concurrency sweep (`parallel_cpu*.txt`, 30s per level). + +| GOMAXPROCS | Mean Latency (µs) | Improvement vs cpu=1 | RPS (BenchmarkRPS) | +|:----------:|------------------:|---------------------:|-------------------:| +| 1 | 157 | baseline | 6,499 | +| 2 | 118 | 1.33× | 7,606 | +| 4 | 73 | 2.14× | 14,356 | +| 8 | 62 | 2.53× | 17,233 | +| 16 | 54 | 2.89× | 17,455 | +| 10 (default) | 40\* | ~3.9×\* | 25,502\* | + +\* _The default GOMAXPROCS=10 serial run has a different benchmark structure (not the concurrency sweep), so latency and RPS are not directly comparable — they include warm connection pool effects from the serial baseline._ + +**Scaling efficiency:** +- Doubling cores from 1→2 yields 1.33× latency improvement (67% efficiency). +- From 2→4: 1.61× improvement (80% efficiency) — best scaling band. +- From 4→8: 1.18× improvement (59% efficiency) — adapter starts becoming compute-bound. +- From 8→16: 1.14× improvement (57% efficiency) — diminishing returns; likely the signing/validation pipeline serialises on some shared resource (e.g., key derivation, kin-openapi schema tree reads). + +**Recommendation:** 4–8 cores offers the best throughput/cost ratio. + +--- + +### B3 — Cache Impact (Redis warm vs cold) + +Results from `cache_comparison.txt` (10s each, GOMAXPROCS=10). + +| Scenario | Mean (µs) | Allocs/req | Bytes/req | +|----------|----------:|-----------:|----------:| +| CacheWarm | 190 | 654 | 81,510 | +| CacheCold | 186 | 662 | 82,923 | +| **Delta** | **+3.7 µs (warm slower)** | **−8** | **−1,413** | + +**Interpretation:** There is no meaningful difference between warm and cold cache paths. The apparent 3.7 µs "advantage" for the cold path is within normal measurement noise for a 186–190 µs benchmark. The Redis key-manager cache does not dominate latency in this in-process test setup. + +The warm path allocates 8 fewer objects per request (652 vs 662 allocs) — consistent with cache hits skipping key-derivation allocation paths — but this saving is too small to affect wall-clock time at current throughput levels. + +In a **production environment** with real Redis over the network (1–5 ms round-trip), the cache warm path would show a meaningful advantage. These numbers represent the lower bound on signing latency with zero-latency Redis. + +--- + +### B4 — benchstat Statistical Summary (3 Runs) + +``` +goos: darwin +goarch: arm64 +pkg: github.com/beckn-one/beckn-onix/benchmarks/e2e +cpu: Apple M5 + │ run1.txt │ run2.txt │ run3.txt │ + │ sec/op │ sec/op vs base │ sec/op vs base │ +BAPCaller_Discover-10 164.2µ ± ∞ ¹ 165.4µ ± ∞ ¹ ~ (p=1.000 n=1) ² 165.3µ ± ∞ ¹ ~ (p=1.000 n=1) ² +BAPCaller_Discover_Parallel-10 39.73µ ± ∞ ¹ 41.48µ ± ∞ ¹ ~ (p=1.000 n=1) ² 52.84µ ± ∞ ¹ ~ (p=1.000 n=1) ² +BAPCaller_AllActions/discover-10 165.4µ ± ∞ ¹ 164.9µ ± ∞ ¹ ~ (p=1.000 n=1) ² 163.1µ ± ∞ ¹ ~ (p=1.000 n=1) ² +BAPCaller_AllActions/select-10 194.5µ ± ∞ ¹ 194.5µ ± ∞ ¹ ~ (p=1.000 n=1) ² 186.7µ ± ∞ ¹ ~ (p=1.000 n=1) ² +BAPCaller_AllActions/init-10 217.1µ ± ∞ ¹ 216.6µ ± ∞ ¹ ~ (p=1.000 n=1) ² 218.0µ ± ∞ ¹ ~ (p=1.000 n=1) ² +BAPCaller_AllActions/confirm-10 221.0µ ± ∞ ¹ 219.8µ ± ∞ ¹ ~ (p=1.000 n=1) ² 221.9µ ± ∞ ¹ ~ (p=1.000 n=1) ² +BAPCaller_Discover_Percentiles-10 164.5µ ± ∞ ¹ 165.3µ ± ∞ ¹ ~ (p=1.000 n=1) ² 162.2µ ± ∞ ¹ ~ (p=1.000 n=1) ² +BAPCaller_CacheWarm-10 162.7µ ± ∞ ¹ 162.8µ ± ∞ ¹ ~ (p=1.000 n=1) ² 169.4µ ± ∞ ¹ ~ (p=1.000 n=1) ² +BAPCaller_CacheCold-10 164.2µ ± ∞ ¹ 205.1µ ± ∞ ¹ ~ (p=1.000 n=1) ² 171.9µ ± ∞ ¹ ~ (p=1.000 n=1) ² +geomean 152.4µ 157.0µ +3.02% 157.8µ +3.59% + +Memory (B/op) — geomean: 91.18 Ki across all runs (±0.02%) +Allocs/op — geomean: 825.9 across all runs (perfectly stable across all 3 runs) +``` + +> **Note on confidence intervals:** benchstat requires ≥6 samples per benchmark for confidence intervals. With `-count=1` and 3 runs, results show ∞ uncertainty bands. The geomean drift of +3.59% across runs is within normal OS scheduler noise. To narrow confidence intervals, re-run with `-count=6` and `benchstat` will produce meaningful p-values. + +--- + +### B5 — Bottleneck Analysis + +Based on the allocation profile and latency data: + +| Rank | Plugin / Step | Estimated contribution | Evidence | +|:----:|---------------|------------------------|---------| +| 1 | `schemav2validator` (kin-openapi validation) | 40–60% | Alloc count proportional to payload complexity; JSON schema traversal creates many short-lived objects | +| 2 | `signer` (Ed25519/BLAKE-512) | 20–30% | Cryptographic operations are CPU-bound; scaling efficiency plateau at 8+ cores consistent with crypto serialisation | +| 3 | `simplekeymanager` (key derivation, Redis) | 5–10% | 8-alloc savings on cache-warm path; small but detectable | +| 4 | `router` (YAML routing lookup) | < 5% | Minimal; in-memory map lookup | + +**Key insight from the concurrency data:** RPS plateaus at ~17,000–17,500 between GOMAXPROCS=8 and 16. This suggests a shared serialisation point — most likely the kin-openapi schema validation tree (a read-heavy but non-trivially-lockable data structure), or the Ed25519 key operations. + +**Profiling commands to isolate the bottleneck:** + +```bash +# CPU profile — run from beckn-onix root +go test ./benchmarks/e2e/... \ + -bench=BenchmarkBAPCaller_Discover \ + -benchtime=30s \ + -cpuprofile=benchmarks/results/cpu.prof \ + -timeout=5m + +go tool pprof -http=:6060 benchmarks/results/cpu.prof + +# Memory profile +go test ./benchmarks/e2e/... \ + -bench=BenchmarkBAPCaller_Discover \ + -benchtime=30s \ + -memprofile=benchmarks/results/mem.prof \ + -timeout=5m + +go tool pprof -http=:6060 benchmarks/results/mem.prof + +# Parallel profile (find lock contention) +go test ./benchmarks/e2e/... \ + -bench=BenchmarkBAPCaller_Discover_Parallel \ + -benchtime=30s \ + -blockprofile=benchmarks/results/block.prof \ + -mutexprofile=benchmarks/results/mutex.prof \ + -timeout=5m + +go tool pprof -http=:6060 benchmarks/results/mutex.prof +``` + +--- + +## Running the Benchmarks + +```bash +# Full run: compile plugins, run all scenarios, generate CSV and benchstat summary +cd beckn-onix +bash benchmarks/run_benchmarks.sh + +# Quick smoke test (fast, lower iteration counts): +# Edit BENCH_TIME_SERIAL="2s" and BENCH_TIME_PARALLEL="5s" at the top of the script. + +# Individual benchmark (manual): +go test ./benchmarks/e2e/... \ + -bench=BenchmarkBAPCaller_Discover \ + -benchtime=10s \ + -benchmem \ + -timeout=30m + +# Race detector check: +go test ./benchmarks/e2e/... \ + -bench=BenchmarkBAPCaller_Discover_Parallel \ + -benchtime=5s \ + -race \ + -timeout=30m + +# Concurrency sweep (manual): +for cpu in 1 2 4 8 16; do + go test ./benchmarks/e2e/... \ + -bench="BenchmarkBAPCaller_Discover_Parallel|BenchmarkBAPCaller_RPS" \ + -benchtime=30s -cpu=$cpu -benchmem -timeout=10m +done +``` + +> **Note:** The first run takes 60–90 s while plugins compile. Subsequent runs use Go's build cache and start in seconds. + +--- + +*Generated from run `2026-03-31_14-19-19` · beckn-onix · Beckn Protocol v2.0.0*