Benchmarks
Public measured benchmark is in progress. We are not citing headline numbers on this site until it lands.
Public measured run (in progress) · github.com/finsavvyai/clawpipe-booster-benchmark
Pre-registered methodology v1.0 was locked on 2026-05-18. Methodology was published before any results to prevent post-hoc selection of workloads or thresholds. Decision rule (commit / library / archive) was set before the run and is binding on the result.
Pre-registered methodology v1.0 was locked on 2026-05-18. Methodology was published before any results to prevent post-hoc selection of workloads or thresholds. Decision rule (commit / library / archive) was set before the run and is binding on the result.
What the measured run covers
- 4 baselines: raw provider · provider prompt caching · Cloudflare AI Gateway with caching · ClawPipe full pipeline.
- 3 workload buckets: agent / coding (SWE-bench Lite, Aider exercism, MBPP, SWE-Gym, HumanEval, synth) · chat / RAG (LMSYS-Chat-1M) · structured extraction (MMLU).
- 3 independent runs per bucket-baseline with 95% Wilson confidence intervals on the headline metric.
- Triple LLM-judge for quality regression (GPT-5 + Claude Opus 4.7 + Gemini 2.5 Pro must agree for a regression to count).
- Spend cap: $80 hard cap, kill at 25% over.
Read or comment
METHODOLOGY.md v1.0 · DECISION-RULE.md · Public review thread (closed 2026-05-18, methodology locked)
Prior synthetic benchmark (preserved for transparency)
Why this section exists. Before the public measured run, we published a synthetic in-house benchmark. The numbers below were generated against a mocked gateway on 200 unique prompts × 2 passes. They are not customer-measured savings and are not a defensible cost-reduction claim. We are preserving them here for transparency, not citing them on marketing surfaces.
- Dataset: 400 prompts (200 unique × 2 passes). Mix of boostable (math/JSON/dates), cacheable (repeats), and regular LLM workloads — hand-categorised to exercise each pipeline stage.
- Gateway: mocked at realistic provider latency (~1200ms p50). No real provider calls.
- Hardware: single Node.js 20 process, M-series laptop.
- Date: 2026-04-09.
- Raw artifacts: benchmarks/ in the main repo.
Reproduce the prior synthetic run
git clone https://github.com/finsavvyai/clawpipe-sdk cd clawpipe/benchmarks npm install npx tsx run-benchmark.ts open results/summary.md
The dataset (benchmarks/prompt-dataset.json), runner (run-benchmark.ts), and raw results (results/benchmark-results.json) are in the repo. The runner file is explicit at line 5 that the gateway is mocked.