The idea that code can be read like a language, and that a project's own conventions can be learned and checked, is decades old in the literature. This page separates what the literature established from what doloop (a deterministic code reviewer that runs at the commit) measured on top of it. Every number here traces to a script or a corpus you can re-run.
None of the core lenses are ours to claim. The published work below set them up. doloop's contribution is narrower, and builds on it.
What doloop adds is the part the literature never shipped as a product: these lenses run deterministically, inferred from your own code, across the whole repository, fast enough to gate a single commit. Read code as language. Choose the reader by the question.
The gate ran against the release history of mature open-source libraries. In 19 cases it flagged a deviation in one release that the maintainers themselves deleted by the next, before anyone told them. Five are hand-verified.
The gate does not invent a rule. It finds the spot a project will fix on its own, at the commit, instead of weeks later in production. No proof on the page is cleaner or less confounded. The standard is the codebase's own, and the gate reads it back.
doloop carries no rulebook. It infers each codebase's conventions from how consistent that codebase is with itself. Across many codebases, it then asks which conventions are near-universal and which flip between one repo and the next.
A near-universal convention is one the gate can block on. A convention that flips, the gate can only warn about. The split is not a label anyone assigned; it falls out of the data. Re-measured as the corpus grew, the safety floor held at every scale.
| corpus | immutable-defaults (universal) | act-on-errors (universal) | type-hints |
|---|---|---|---|
| 88 codebases | 100% | 96% | split |
| 450 | 100% | 96% | split |
| 7,637 | 99% | 91% | style |
| 236,701 | 98% | 86% | 97% do NOT |
immutable-defaults is the near-absolute one, holding at 98 to 100% at every scale. act-on-errors softens as the corpus broadens, because the long tail swallows more of its errors. At full scale, not type-hinting is a 97% consensus. That is why the gate compares your repo to the best codebases in its domain, not the average. The average would tell a beginner to skip what the best do.
Scope, stated plainly: these conventions are Python idioms, measured over the syntax tree. Per-language standards are on the roadmap, not a result yet.
Take a function that is correct and ordinary in one codebase, and paste it unchanged into a sibling codebase that does that kind of thing differently. It still compiles and runs, so nothing is wrong with it except where it landed. doloop passes it in its home codebase and flags it in the foreign one, naming the convention that codebase keeps and how consistently it keeps it. The same code earns opposite verdicts, because doloop judges it against the codebase around it, not a universal rulebook.
| host: flask (sync, 100%) | host: quart (async, 82%) | |
|---|---|---|
flask dispatch_request |
PASS | FLAG |
quart dispatch_request |
FLAG | PASS |
A fixed-rule linter, or a model shown the function on its own, returns one verdict everywhere; it carries one rule for the world. Only a reader that calibrates on the host can flip. That flip is the product in a single frame.
Scope, stated plainly: the flip holds across two unrelated convention classes. On request-dispatch shape, flask/quart and requests/httpx both flip (with two honest nulls: convention-identical siblings, and a pair sharing the convention's direction with nothing to oppose). On diagnostic style, print versus log, doit/nox flips: doit prints its status messages, nox logs them, and the verdict follows the house either way. Every convention is inferred from each host's own consistency, leave-one-out so a pass cannot be circular, deterministic across re-runs.
Three sibling pairs · two convention classes · leave-one-out on diagonals · deterministic across re-runs · anti-hardcode verified (incl. the nox_imposter folder test).
doloop caches a few-kilobyte digest of a codebase and checks each commit against the digest, instead of re-reading millions of lines. The whole approach rests on that digest being byte-identical to a full read. Both halves are proven.
Cache equals full read. For flask, requests, rich, click, sqlalchemy, and httpx, two independent full reads produce the same canonical SHA-256, and the digest round-trips through serialize-and-reload to the same hash. The digest reproduces the full read's hash exactly.
Incremental equals full re-read. The digest's counts are sums over functions, so
counts(whole) = merge(counts(partA), counts(partB)). Tested on flask, requests, and
sqlalchemy, both split-and-merge and drop-a-file-then-re-add reproduce the full-read hash exactly.
When code changes the gate subtracts the changed files' old counts and adds the new ones, and stays
byte-identical to re-reading everything. Freshness without losing determinism, at constant time per
check.
doloop led early with a name-and-documentation-drift lens. It failed under doloop's own benchmarks, and was demoted. The failures are reported as plainly as the wins.
Against CoDocBench, with 4,573 real coupled code-and-doc changes, the structural lens told drifted from aligned docstrings at only 1.4x. Most real drift is prose, which a matcher cannot read.
The validate_, ensure_, and verify_ prefixes over-match
setters, decorators, and predicates. Rejection takes many forms a pattern misses.
Where the name-and-documentation-drift lens lands: advisory only, via a caged 7B model at about 2.0x discrimination, never a block. The pattern this confirms is clean. A closed-form check wins on exact-answer questions, where a rate is the rate. It fails on linguistic ones, which are a model's job, under a deterministic harness.
So the blocking core is narrow on purpose. It is behavioral consistency, where this
codebase acts on its errors 86% of the time, and the paired-operation absence, the
one writer that calls execute() but never commit(). That last class is one
lexical linters miss.
doloop retracts three claims, on the record. First, that the name-and-doc lens is the richest signal: that was a firing count, not a detection rate. Second, the use of a single construction-quality index as a quality score. Third, that models are blind to consistency bugs: a 12B model caught single-function drift 9 of 11 times.
The Codebase Polysemy Contract spans five polysemy dimensions plus the cross-cutting paired-operation absence. Each one stands as follows when run as a deterministic, closed-form gate: two validate, security splits, handoff is a caged-reader class, and two are still design targets. The misses are stated as plainly as the hits.
| Dimension | Verdict | Evidence |
|---|---|---|
| Functional / convention consistency | Validates, the gate | The workhorse, live. Heartbleed re-derived from OpenSSL's own 88% bounds-check adherence. |
| Paired-operation absence | Validates, the gate | An execute() with no commit() in the same function. A class lexical linters miss. |
| Security, bounds-local shape | Validates, precision only | A parameter-derived length reaching a copy with no bounds check. 0.7 to 3% base rate, low false-positive, near-zero recall on hardened code. |
| Security, guard-absence | Fails closed-form, demoted | The function-local pairing never forms (guards live upstream); lift 0.28. Needs inter-procedural taint, not AST counting. |
| Handoff (doc and name drift) | Fails closed-form, caged-reader | doc-drift about chance on CoDocBench; name-drift about 0 of 8 precision. Advisory via a pinned model, never a block. |
| Performance, structural | Design targets | Specified, not yet built or backtested. Named so they read as a plan, not a current capability. |
The product object is not a single number. It is two vector spaces.
Repository space: one point per codebase, built from convention rates, paired operations, language mix, and metadata. The near-universal conventions are the low-variance axes. Quality is distance from the peer cluster and from the convention consensus. A measured caveat: convention vectors alone are noisy, since django's nearest neighbors came back as numpy tutorials. Domain comes from the language mix, so numpy clusters with scikit-learn, and django with tornado. Findings space: a point for each flagged case, for the bug taxonomy and anomaly detection.
doloop meters in loops the way an LLM meters in tokens. Every value claim ships as a public, self-verifiable calculator: you run it on your own repo and reproduce the number.
loopmath defines one loop as one function judged on one feature, frozen.
savemath is bugs caught, times hours, times rate, with your assumptions.
shipmath is velocity kept, the AI write-rate divided by the review-rate.
regret runs read-only on your git log: flask's history is 17% rework, one in six
commits a redo. No vanity counters. The formula and the assumptions are yours.
The method is one move: infer a codebase's requirements from its own consistency, then flag only where a change betrays them.
On an AI-generated repo with no rulebook, the gate inferred "handlers act on the error" from the codebase's own 85% consistency. Then it flagged the 31 places the code betrayed itself. The requirement inheres in the consistency. The bug is the unevenness.
The lenses come from the published work above, plainly credited. Naturalize learned a repo's own conventions and gated the commit at 94% accuracy, and named the universal-versus-flipping distinction. doloop's extension is to run those lenses deterministically and whole-repo, at commit speed, reaching from syntax into semantics. What matters is not whether a model appears. It is whether the verdict reproduces.
| quantity | value |
|---|---|
| codebases gathered | 236,701 |
| proven catches (project deleted the flag by the next release) | 19 (5 hand-verified) |
| cross-release resolution study | 53 repos, 470 flag-then-fix observations |
| immutable-defaults / act-on-errors (universal) | 98% / 86% |
| read rate | ~59,000 LOC/sec (median) |
| calibration time | 0.27 s median |
| cache == full read | byte-identical, 6 repos |
| incremental == full re-read | byte-identical, 3 repos |
| doc-drift, pattern match (CoDocBench, 4,573) | 1.4x (about chance) |
| name-drift, pattern match (38,217 funcs) | ~0/8 precision |
| name-and-doc-drift, caged 7B model | ~2.0x (advisory only) |
| rework rate (flask history) | 17% of commits |
| rework rate (GitHub corpus 2018-2022) | ~18-19% (stable) |
| speed vs an LLM reviewer | ~half a second vs minutes |
Every claim on this page is backed by code you can run yourself against your own repo. The scripts are available on request.