Benchmarks

Public measured benchmark is in progress. We are not citing headline numbers on this site until it lands.

Public measured run (in progress) · github.com/finsavvyai/clawpipe-booster-benchmark
Pre-registered methodology v1.0 was locked on 2026-05-18. Methodology was published before any results to prevent post-hoc selection of workloads or thresholds. Decision rule (commit / library / archive) was set before the run and is binding on the result.

What the measured run covers

4 baselines: raw provider · provider prompt caching · Cloudflare AI Gateway with caching · ClawPipe full pipeline.
3 workload buckets: agent / coding (SWE-bench Lite, Aider exercism, MBPP, SWE-Gym, HumanEval, synth) · chat / RAG (LMSYS-Chat-1M) · structured extraction (MMLU).
3 independent runs per bucket-baseline with 95% Wilson confidence intervals on the headline metric.
Triple LLM-judge for quality regression (GPT-5 + Claude Opus 4.7 + Gemini 2.5 Pro must agree for a regression to count).
Spend cap: $80 hard cap, kill at 25% over.

Read or comment

METHODOLOGY.md v1.0 · DECISION-RULE.md · Public review thread (closed 2026-05-18, methodology locked)

Prior synthetic benchmark (preserved for transparency)

Why this section exists. Before the public measured run, we published a synthetic in-house benchmark. The numbers below were generated against a mocked gateway on 200 unique prompts × 2 passes. They are not customer-measured savings and are not a defensible cost-reduction claim. We are preserving them here for transparency, not citing them on marketing surfaces.

Dataset: 400 prompts (200 unique × 2 passes). Mix of boostable (math/JSON/dates), cacheable (repeats), and regular LLM workloads — hand-categorised to exercise each pipeline stage.
Gateway: mocked at realistic provider latency (~1200ms p50). No real provider calls.
Hardware: single Node.js 20 process, M-series laptop.
Date: 2026-04-09.
Raw artifacts: benchmarks/ in the main repo.

Reproduce the prior synthetic run

git clone https://github.com/finsavvyai/clawpipe-sdk
cd clawpipe/benchmarks
npm install
npx tsx run-benchmark.ts
open results/summary.md

The dataset (benchmarks/prompt-dataset.json), runner (run-benchmark.ts), and raw results (results/benchmark-results.json) are in the repo. The runner file is explicit at line 5 that the gateway is mocked.

Start free → Measured benchmark in progress →