RESEARCH

Built on published work. Measured in the open.

The idea that code can be read like a language, and that a project's own conventions can be learned and checked, is decades old in the literature. This page separates what the literature established from what doloop (a deterministic code reviewer that runs at the commit) measured on top of it. Every number here traces to a script or a corpus you can re-run.

Review, not writing, is the bottleneck

The thesis doloop is built on, that the binding cost of AI code is fit rather than correctness, and that review, not writing, is the bottleneck, is not our claim to make. The studies below measure it, each cited to its primary source and carrying its own honest scope. They converge from different methods. None validates doloop, and we do not claim it does.

Engineers are hand-building the guardrails doloop infers

In June 2026, Florian Buetow, an AI engineer at Xebia, spent forty minutes on a public podcast on the code-review bottleneck. He arrives at the same diagnosis, reaches for the same shape of solution, and names two of doloop's own example rules by hand. He is not a doloop user and has never seen it.

Two viewers, neither of whom has seen doloop, stated the thesis in their own words. One: "the number of conventions in the codebase increases, each part is not wrong but the coherence is low, no amount of the model will get better will fix this." Another: "review needs to become evidence management, not just another human approval step, otherwise teams just trade review fatigue for scanner fatigue." That is the problem and the product, described by strangers.

The literature this builds on

None of the core lenses are ours to claim. The published work below set them up. doloop's contribution is narrower, and builds on it.

What doloop adds is the part the literature never shipped as a product: these lenses run deterministically, inferred from your own code, across the whole repository, fast enough to gate a single commit. Read code as language. Choose the reader by the question.

In the cases we checked, it flagged what a project later went back and fixed

The gate ran against the release history of mature open-source libraries. In 19 cases we checked, it flagged a deviation in one release that the maintainers themselves deleted by the next (pilot, n=19; 5 hand-verified). This is a retrospective read of cases we examined, not a prediction of future bugs.

The gate does not invent a rule. In the cases we checked, it found the spot a project later fixed on its own, at the commit, instead of weeks later in production. No proof on the page is cleaner or less confounded. The standard is the codebase's own, and the gate reads it back.

Conventions almost every codebase shares

doloop carries no rulebook. It infers each codebase's conventions from how consistent that codebase is with itself. Across many codebases, it then asks which conventions are near-universal and which flip between one repo and the next.

A near-universal convention is one the gate can block on. A convention that flips, the gate can only warn about. The split is not a label anyone assigned; it falls out of the data. Re-measured as the corpus grew, the safety floor held at every scale.

corpusimmutable-defaults (universal)act-on-errors (universal)type-hints
88 codebases100%96%split
450100%96%split
7,63799%91%style
236,701 codebases98%86%97% do NOT

This consistency corpus (up to 236,701 codebases) is a separate, larger sample than the 194,119-repository convention catalog shown below. The two measure different things: this one tests how the safety floor holds as the sample grows; the catalog is the profiled crawl the laws and mores were drawn from.

immutable-defaults is the near-absolute one, holding at 98 to 100% at every scale. act-on-errors softens as the corpus broadens, because the long tail swallows more of its errors. At full scale, not type-hinting is a 97% consensus. That is why the gate compares your repo to the best codebases in its domain, not the average. The average would tell a beginner to skip what the best do.

Scope, stated plainly: these conventions are Python idioms, measured over the syntax tree. Per-language standards are on the roadmap, not a result yet.

Same code, opposite verdicts: a controlled proof that the gate reads the codebase

Take a function that is correct and ordinary in one codebase, and paste it unchanged into a sibling codebase that does that kind of thing differently. It still compiles and runs, so nothing is wrong with it except where it landed. doloop passes it in its home codebase and flags it in the foreign one, naming the convention that codebase keeps and how consistently it keeps it. The same code earns opposite verdicts, because doloop judges it against the codebase around it, not a universal rulebook.

host: flask (sync, 100%)host: quart (async, 82%)
flask dispatch_request PASS FLAG
quart dispatch_request FLAG PASS

A fixed-rule linter, or a model shown the function on its own, returns one verdict everywhere; it carries one rule for the world. Only a reader that calibrates on the host can flip. That flip is the product in a single frame.

Scope, stated plainly (pilot, n=3 sibling pairs): the flip holds across two unrelated convention classes. On request-dispatch shape, flask/quart and requests/httpx both flip (with two honest nulls: convention-identical siblings, and a pair sharing the convention's direction with nothing to oppose). On diagnostic style, print versus log, doit/nox flips: doit prints its status messages, nox logs them, and the verdict follows the codebase either way. Every convention is inferred from each host's own consistency, leave-one-out so a pass cannot be circular, deterministic across re-runs.

The transplant result table

flask / quartdispatch shape · sync vs async
host: flask sync 100% · host: quart async 82%
Passflask code · in flask · LOO 100%/16
Flagflask code · in quart · sync, codebase is async
Flagquart code · in flask · async, codebase is sync
Passquart code · in quart · LOO 82%/17
requests / httpxsend shape · sync vs async
host: requests sync 100%/9 · host: httpx async 100%/10
Passrequests.send · in requests
Flagrequests.send · in httpx
Flaghttpx.send · in requests
Passhttpx.send · in httpx
doit / noxemit shape · print vs log
host: doit print 88%/8 · host: nox log 82%/28
Passdoit.execute · in doit · LOO 86%/7
Flagdoit.execute · in nox · prints, codebase logs 82%/28
Flagnox.execute · in doit · logs, codebase prints
Passnox.execute · in nox · LOO 81%/27

Pilot, n=3 sibling pairs · two convention classes · leave-one-out on diagonals · deterministic across re-runs · anti-hardcode verified (incl. the nox_imposter folder test).

Determinism, proven: the cached digest equals a full read

doloop caches a few-kilobyte digest of a codebase and checks each commit against the digest, instead of re-reading millions of lines. The whole approach rests on that digest being byte-identical to a full read. Both halves are proven.

Cache equals full read. For flask, requests, rich, click, sqlalchemy, and httpx, two independent full reads produce the same canonical SHA-256, and the digest round-trips through serialize-and-reload to the same hash. The digest reproduces the full read's hash exactly.

Incremental equals full re-read. The digest's counts are sums over functions, so counts(whole) = merge(counts(partA), counts(partB)). Tested on flask, requests, and sqlalchemy, both split-and-merge and drop-a-file-then-re-add reproduce the full-read hash exactly. When code changes the gate subtracts the changed files' old counts and adds the new ones, and stays byte-identical to re-reading everything. Freshness without losing determinism, at constant time per check.

What didn't validate: the honest nulls

doloop led early with a name-and-documentation-drift lens. It failed under doloop's own benchmarks, and was demoted. The failures are reported as plainly as the wins.

Documentation-drift, as a deterministic pattern match: 1.4x (about chance)

Against CoDocBench, with 4,573 real coupled code-and-doc changes, the structural lens told drifted from aligned docstrings at only 1.4x. Most real drift is prose, which a matcher cannot read.

Name-drift, as a pattern match: about 0 of 8 precision over 38,217 functions

The validate_, ensure_, and verify_ prefixes over-match setters, decorators, and predicates. Rejection takes many forms a pattern misses.

Where the name-and-documentation-drift lens lands: advisory only, via a caged 7B model at about 2.0x discrimination, never a block. The pattern this confirms is clean. A closed-form check wins on exact-answer questions, where a rate is the rate. It fails on linguistic ones, which are a model's job, under a deterministic harness.

So the blocking core is narrow on purpose. It is behavioral consistency, where this codebase acts on its errors 86% of the time, and the paired-operation absence, the one writer that calls execute() but never commit(). That last class is one lexical linters miss.

doloop retracts three claims, on the record. First, that the name-and-doc lens is the richest signal: that was a firing count, not a detection rate. Second, the use of a single construction-quality index as a quality score. Third, that models are blind to consistency bugs: a 12B model caught single-function drift 9 of 11 times.

The floor

50%100% silence · no convention held inference floor · 70% · shipped invoke 69% · refused doit 88% nox 82%
Fig. below the floor the gate owes silence · the threshold refused invoke before anyone checked floor shipped before the run · agreement unplanned

The six dimensions, tested

The Codebase Polysemy Contract spans five polysemy dimensions plus the cross-cutting paired-operation absence. Each one stands as follows when run as a deterministic, closed-form gate: two validate, security splits, handoff is a caged-reader class, and two are still design targets. The misses are stated as plainly as the hits.

Dimension Verdict Evidence
Functional / convention consistency Validates, the gate The workhorse, live. Heartbleed re-derived from OpenSSL's own 88% bounds-check adherence (pilot, n=1 worked case).
Paired-operation absence Validates, the gate An execute() with no commit() in the same function. A class lexical linters miss.
Security, bounds-local shape Validates, precision only A parameter-derived length reaching a copy with no bounds check. 0.7 to 3% base rate, low false-positive, near-zero recall on hardened code.
Security, guard-absence Fails closed-form, demoted The function-local pairing never forms (guards live upstream); lift 0.28. Needs inter-procedural taint, not AST counting.
Handoff (doc and name drift) Fails closed-form, caged-reader doc-drift about chance on CoDocBench; name-drift about 0 of 8 precision. Advisory via a pinned model, never a block.
Performance, structural Design targets Specified, not yet built or backtested. Named so they read as a plan, not a current capability.

The repository vector space

The product object is not a single number. It is two vector spaces.

Repository space: one point per codebase, built from convention rates, paired operations, language mix, and metadata. The near-universal conventions are the low-variance axes. Quality is distance from the peer cluster and from the convention consensus. A measured caveat: convention vectors alone are noisy, since django's nearest neighbors came back as numpy tutorials. Domain comes from the language mix, so numpy clusters with scikit-learn, and django with tornado. Findings space: a point for each flagged case, for the bug taxonomy and anomaly detection.

One convention, three altitudes

Galaxy · 194,119 repos 01 typed · mean 0.025 · a law: almost nobody on the long tail annotates Siblings · modern web frameworks 01 bottle 0.0 django 0.0 fastapi 0.785 urllib3 · uvicorn · starlette 1.0 the same convention · a vernacular: the codebases split Codebase · yours 01 your rate · a rule: whatever your codebase holds
Fig. typed · law at galaxy, vernacular among siblings, rule at home 6 AST dimensions · sibling points exact

Method, and a lineage in plain sight

The method is one move: infer a codebase's requirements from its own consistency, then flag only where a change betrays them.

On one AI-generated repo with no rulebook (pilot, n=1), the gate inferred "handlers act on the error" from the codebase's own 85% consistency. Then it flagged the 31 places the code betrayed itself. The requirement inheres in the consistency. The bug is the unevenness.

The lenses come from the published work above, plainly credited. Naturalize learned a repo's own conventions and gated the commit at 94% accuracy, and named the universal-versus-flipping distinction. doloop's extension is to run those lenses deterministically and whole-repo, at commit speed, reaching from syntax into semantics. What matters is not whether a model appears. It is whether the verdict reproduces.

The numbers, in one place

quantityvalue
codebases gathered236,701
cases we checked where the project deleted the flag by the next release19 (pilot, n=19; 5 hand-verified)
cross-release resolution studypilot: 53 repos, 470 flag-then-fix observations
immutable-defaults / act-on-errors (universal)98% / 86%
read rate~59,000 LOC/sec (median)
calibration time0.27 s median
cache == full readbyte-identical, 6 repos
incremental == full re-readbyte-identical, 3 repos
doc-drift, pattern match (CoDocBench, 4,573)1.4x (about chance)
name-drift, pattern match (38,217 funcs)~0/8 precision
name-and-doc-drift, caged 7B model~2.0x (advisory only)
rework rate (flask history)17% of commits
rework rate (GitHub corpus 2018-2022)~18-19% (stable)
speed vs an LLM reviewer~half a second vs minutes

Every claim on this page is backed by code you can run yourself against your own repo. The scripts are available on request.

Ask for the scripts → Model risk · SR 26-2