Methodology · provable self-improvement

Does my self-improving AI actually improve? I measured capability, not activity.

Everyone says their AI agent is "self-improving." Almost no one measures whether it's actually getting better — or just busier. I built a small, deterministic ledger to settle that honestly. On its first run it caught a measurement confound that was faking a regression in my own numbers.

The tool is public and runs offline: github.com/Tharven-Security/policy-gate-benchpython capability_ledger.py --results example_results

1. Activity is not capability

My self-improvement loop tracked a tidy "progress" number built from counts: knowledge-base size, number of evolution cycles, number of mutated prompts. It crept upward and felt great.

But that metric measures activity, not capability. A loop can spin, a knowledge base can grow, and the system's real ability to block attacks can stay flat — or quietly regress — while the vanity number keeps climbing. A self-improving security tool that can't prove it's getting better is just an expensive way to feel productive.

The claim "my AI is self-improving" is worthless without a capability curve. So I built the curve.

2. The capability ledger

It reads the benchmark result files the loop already produces ({benchmark, timestamp, metrics}), builds a time series per metric across generations, fits a slope, and reports an honest verdict: compounding, flat, or regressing — with the right polarity (for obf_bypass_rate and fp_rate, lower is better).

Crucially, it compares like-for-like only: a "variant" is (benchmark + config + malicious-corpus SHA). You cannot compare a score on one corpus against a score on a different corpus and call the difference "improvement." That sounds obvious. It is also exactly the mistake the first version made on my own data.

3. The confound it caught (on my own numbers)

First run, the naive ledger flagged my input-time injection detector as regressing hard — block-rate falling from 0.635 to 0.145. Alarming. Also wrong.

Those two numbers were measured on different corpora: 0.635 on my in-domain CTI set (SHA 5b943c47…), 0.145 on a much harder external standard set (SHA cab89306…). Apples vs oranges. On the same corpus, block-rate was actually flat-to-improving (0.6347 → 0.6377). There was no regression — only a measurement bug in how I was comparing.

The ledger's first useful act was to catch a flaw in itself — and in how I read my own results. That is the whole point: the measurement polices the claim.

The fix: segment every series by corpus SHA + config, and mark false-positive metrics confounded-benign-changed whenever the benign control set changed. Now the curve is honest.

4. What the honest curve shows

On a frozen, like-for-like series, the execution-time policy gate is genuinely compounding:

MetricAcross generationsVerdict
clear_block_rate0.82 → 1.00improving
obf_bypass_rate0.95 → 0.62 (lower=better)improving
fp_rate0.00 → 0.00flat (held)
latency_ms_p950.50 → 0.67 msregressing (still sub-ms — flagged honestly)

No vanity. Where it improves, it says so; where latency crept up, it says that too. The bundled example_results demonstrates a clean COMPOUNDING verdict out of the box so you can see the shape before pointing it at your own data.

5. Why this matters

EU AI Act Article 15 asks high-risk AI systems to declare measured robustness. A one-off number is a snapshot; a capability curve with provable non-regression is the real thing — evidence that the system doesn't silently decay as it evolves. Pair it with a guardian that exits non-zero on a real capability regression, and self-improvement stops being a slogan and becomes a gate.

A measured "flat" beats an unmeasured "it's self-improving!" — that honesty is the proof.

Reproduce

git clone https://github.com/Tharven-Security/policy-gate-bench
cd policy-gate-bench
python capability_ledger.py --results example_results   # offline, zero deps

Point --results at your own folder of benchmark JSONs. Open an issue if a verdict looks wrong — the polarity and segmentation rules are meant to be challenged.