RESEARCH

Built on published work. Measured in the open.

The idea that code can be read like a language, and that a project's own conventions can be learned and checked, is decades old in the literature. This page separates what the literature established from what doloop (a deterministic code reviewer that runs at the commit) measured on top of it. Every number here traces to a script or a corpus you can re-run.

The literature this builds on

None of the core lenses are ours to claim. The published work below set them up. doloop's contribution is narrower, and builds on it.

What doloop adds is the part the literature never shipped as a product: these lenses run deterministically, inferred from your own code, across the whole repository, fast enough to gate a single commit. Read code as language. Choose the reader by the question.

It catches what a project later goes back and fixes

The gate ran against the release history of mature open-source libraries. In 19 cases it flagged a deviation in one release that the maintainers themselves deleted by the next, before anyone told them. Five are hand-verified.

The gate does not invent a rule. It finds the spot a project will fix on its own, at the commit, instead of weeks later in production. No proof on the page is cleaner or less confounded. The standard is the codebase's own, and the gate reads it back.

Conventions almost every codebase shares

doloop carries no rulebook. It infers each codebase's conventions from how consistent that codebase is with itself. Across many codebases, it then asks which conventions are near-universal and which flip between one repo and the next.

A near-universal convention is one the gate can block on. A convention that flips, the gate can only warn about. The split is not a label anyone assigned; it falls out of the data. Re-measured as the corpus grew, the safety floor held at every scale.

corpusimmutable-defaults (universal)act-on-errors (universal)type-hints
88 codebases100%96%split
450100%96%split
7,63799%91%style
236,70198%86%97% do NOT

immutable-defaults is the near-absolute one, holding at 98 to 100% at every scale. act-on-errors softens as the corpus broadens, because the long tail swallows more of its errors. At full scale, not type-hinting is a 97% consensus. That is why the gate compares your repo to the best codebases in its domain, not the average. The average would tell a beginner to skip what the best do.

Scope, stated plainly: these conventions are Python idioms, measured over the syntax tree. Per-language standards are on the roadmap, not a result yet.

Same code, opposite verdicts: a controlled proof that the gate reads the house

Take a function that is correct and ordinary in one codebase, and paste it unchanged into a sibling codebase that does that kind of thing differently. It still compiles and runs, so nothing is wrong with it except where it landed. doloop passes it in its home codebase and flags it in the foreign one, naming the convention that codebase keeps and how consistently it keeps it. The same code earns opposite verdicts, because doloop judges it against the codebase around it, not a universal rulebook.

host: flask (sync, 100%)host: quart (async, 82%)
flask dispatch_request PASS FLAG
quart dispatch_request FLAG PASS

A fixed-rule linter, or a model shown the function on its own, returns one verdict everywhere; it carries one rule for the world. Only a reader that calibrates on the host can flip. That flip is the product in a single frame.

Scope, stated plainly: the flip holds across two unrelated convention classes. On request-dispatch shape, flask/quart and requests/httpx both flip (with two honest nulls: convention-identical siblings, and a pair sharing the convention's direction with nothing to oppose). On diagnostic style, print versus log, doit/nox flips: doit prints its status messages, nox logs them, and the verdict follows the house either way. Every convention is inferred from each host's own consistency, leave-one-out so a pass cannot be circular, deterministic across re-runs.

The transplant result table

flask / quartdispatch shape · sync vs async
host: flask sync 100% · host: quart async 82%
Passflask code · in flask · LOO 100%/16
Flagflask code · in quart · sync, house is async
Flagquart code · in flask · async, house is sync
Passquart code · in quart · LOO 82%/17
requests / httpxsend shape · sync vs async
host: requests sync 100%/9 · host: httpx async 100%/10
Passrequests.send · in requests
Flagrequests.send · in httpx
Flaghttpx.send · in requests
Passhttpx.send · in httpx
doit / noxemit shape · print vs log
host: doit print 88%/8 · host: nox log 82%/28
Passdoit.execute · in doit · LOO 86%/7
Flagdoit.execute · in nox · prints, house logs 82%/28
Flagnox.execute · in doit · logs, house prints
Passnox.execute · in nox · LOO 81%/27

Three sibling pairs · two convention classes · leave-one-out on diagonals · deterministic across re-runs · anti-hardcode verified (incl. the nox_imposter folder test).

Determinism, proven: the cached digest equals a full read

doloop caches a few-kilobyte digest of a codebase and checks each commit against the digest, instead of re-reading millions of lines. The whole approach rests on that digest being byte-identical to a full read. Both halves are proven.

Cache equals full read. For flask, requests, rich, click, sqlalchemy, and httpx, two independent full reads produce the same canonical SHA-256, and the digest round-trips through serialize-and-reload to the same hash. The digest reproduces the full read's hash exactly.

Incremental equals full re-read. The digest's counts are sums over functions, so counts(whole) = merge(counts(partA), counts(partB)). Tested on flask, requests, and sqlalchemy, both split-and-merge and drop-a-file-then-re-add reproduce the full-read hash exactly. When code changes the gate subtracts the changed files' old counts and adds the new ones, and stays byte-identical to re-reading everything. Freshness without losing determinism, at constant time per check.

What didn't validate: the honest nulls

doloop led early with a name-and-documentation-drift lens. It failed under doloop's own benchmarks, and was demoted. The failures are reported as plainly as the wins.

Documentation-drift, as a deterministic pattern match: 1.4x (about chance)

Against CoDocBench, with 4,573 real coupled code-and-doc changes, the structural lens told drifted from aligned docstrings at only 1.4x. Most real drift is prose, which a matcher cannot read.

Name-drift, as a pattern match: about 0 of 8 precision over 38,217 functions

The validate_, ensure_, and verify_ prefixes over-match setters, decorators, and predicates. Rejection takes many forms a pattern misses.

Where the name-and-documentation-drift lens lands: advisory only, via a caged 7B model at about 2.0x discrimination, never a block. The pattern this confirms is clean. A closed-form check wins on exact-answer questions, where a rate is the rate. It fails on linguistic ones, which are a model's job, under a deterministic harness.

So the blocking core is narrow on purpose. It is behavioral consistency, where this codebase acts on its errors 86% of the time, and the paired-operation absence, the one writer that calls execute() but never commit(). That last class is one lexical linters miss.

doloop retracts three claims, on the record. First, that the name-and-doc lens is the richest signal: that was a firing count, not a detection rate. Second, the use of a single construction-quality index as a quality score. Third, that models are blind to consistency bugs: a 12B model caught single-function drift 9 of 11 times.

The floor

50%100% silence · no convention held inference floor · 70% · shipped invoke 69% · refused doit 88% nox 82%
Fig. below the floor the gate owes silence · the threshold refused invoke before anyone checked floor shipped before the run · agreement unplanned

The six dimensions, tested

The Codebase Polysemy Contract spans five polysemy dimensions plus the cross-cutting paired-operation absence. Each one stands as follows when run as a deterministic, closed-form gate: two validate, security splits, handoff is a caged-reader class, and two are still design targets. The misses are stated as plainly as the hits.

Dimension Verdict Evidence
Functional / convention consistency Validates, the gate The workhorse, live. Heartbleed re-derived from OpenSSL's own 88% bounds-check adherence.
Paired-operation absence Validates, the gate An execute() with no commit() in the same function. A class lexical linters miss.
Security, bounds-local shape Validates, precision only A parameter-derived length reaching a copy with no bounds check. 0.7 to 3% base rate, low false-positive, near-zero recall on hardened code.
Security, guard-absence Fails closed-form, demoted The function-local pairing never forms (guards live upstream); lift 0.28. Needs inter-procedural taint, not AST counting.
Handoff (doc and name drift) Fails closed-form, caged-reader doc-drift about chance on CoDocBench; name-drift about 0 of 8 precision. Advisory via a pinned model, never a block.
Performance, structural Design targets Specified, not yet built or backtested. Named so they read as a plan, not a current capability.

The repository vector space

The product object is not a single number. It is two vector spaces.

Repository space: one point per codebase, built from convention rates, paired operations, language mix, and metadata. The near-universal conventions are the low-variance axes. Quality is distance from the peer cluster and from the convention consensus. A measured caveat: convention vectors alone are noisy, since django's nearest neighbors came back as numpy tutorials. Domain comes from the language mix, so numpy clusters with scikit-learn, and django with tornado. Findings space: a point for each flagged case, for the bug taxonomy and anomaly detection.

One convention, three altitudes

Galaxy · 194,119 repos 01 typed · mean 0.025 · a law: almost nobody on the long tail annotates Siblings · modern web frameworks 01 bottle 0.0 django 0.0 fastapi 0.785 urllib3 · uvicorn · starlette 1.0 the same convention · a vernacular: the houses split House · yours 01 your rate · a rule: whatever your house holds
Fig. typed · law at galaxy, vernacular among siblings, rule at home andromeda run · 6 AST dimensions · sibling points exact

The reproducible meters

doloop meters in loops the way an LLM meters in tokens. Every value claim ships as a public, self-verifiable calculator: you run it on your own repo and reproduce the number.

loopmath defines one loop as one function judged on one feature, frozen. savemath is bugs caught, times hours, times rate, with your assumptions. shipmath is velocity kept, the AI write-rate divided by the review-rate. regret runs read-only on your git log: flask's history is 17% rework, one in six commits a redo. No vanity counters. The formula and the assumptions are yours.

Method, and a lineage in plain sight

The method is one move: infer a codebase's requirements from its own consistency, then flag only where a change betrays them.

On an AI-generated repo with no rulebook, the gate inferred "handlers act on the error" from the codebase's own 85% consistency. Then it flagged the 31 places the code betrayed itself. The requirement inheres in the consistency. The bug is the unevenness.

The lenses come from the published work above, plainly credited. Naturalize learned a repo's own conventions and gated the commit at 94% accuracy, and named the universal-versus-flipping distinction. doloop's extension is to run those lenses deterministically and whole-repo, at commit speed, reaching from syntax into semantics. What matters is not whether a model appears. It is whether the verdict reproduces.

The numbers, in one place

quantityvalue
codebases gathered236,701
proven catches (project deleted the flag by the next release)19 (5 hand-verified)
cross-release resolution study53 repos, 470 flag-then-fix observations
immutable-defaults / act-on-errors (universal)98% / 86%
read rate~59,000 LOC/sec (median)
calibration time0.27 s median
cache == full readbyte-identical, 6 repos
incremental == full re-readbyte-identical, 3 repos
doc-drift, pattern match (CoDocBench, 4,573)1.4x (about chance)
name-drift, pattern match (38,217 funcs)~0/8 precision
name-and-doc-drift, caged 7B model~2.0x (advisory only)
rework rate (flask history)17% of commits
rework rate (GitHub corpus 2018-2022)~18-19% (stable)
speed vs an LLM reviewer~half a second vs minutes

Every claim on this page is backed by code you can run yourself against your own repo. The scripts are available on request.

Ask for the scripts → Model risk · SR 26-2