The method & the evidence

Review, not writing, is the bottleneck

The thesis doloop is built on, that the binding cost of AI code is fit rather than correctness, and that review, not writing, is the bottleneck, is not our claim to make. The studies below measure it, each cited to its primary source and carrying its own honest scope. They converge from different methods. None validates doloop, and we do not claim it does.

A randomized controlled trial found experienced developers 19% slower with AI in mature repositories they had maintained for years, against their own forecast of a 24% speedup. The authors attribute the cost to "implicit repository context," the undocumented conventions the model could not see. METR, 2025 (16 developers, a snapshot of early-2025 tools, not a universal law).
On a maintainer-graded benchmark of "would this actually be merged," the strongest model cleared only 13.4% of the hardest tasks, graded on style and adherence to the codebase's standards, not just whether tests pass. Cognition FrontierCode, 2026 (vendor-built and recent; striking because it isolates the gap between passing tests and fitting the codebase).
84% of developers use AI, but the number-one frustration, cited by 66%, is output that is "almost right, but not quite," and only 29% trust its accuracy. The exact gap doloop reads. Stack Overflow 2025 Developer Survey (49,000+ responses).
Across 211 million changed lines, refactored "moved" code fell from 24.1% to 9.5% over the AI-adoption years, while copy-pasted code rose: more duplication, less reuse, the opposite of consolidating code to fit what is already there. GitClear, 2025 (correlational and proprietary; a structural signature, not a controlled comparison).
A separate axis, not the convention-fit thesis: 45% of AI-generated code introduced a known security flaw, and the security pass rate stayed flat at 45 to 55% regardless of model size. Veracode 2025 (vendor-interested; security is its own failure mode, listed here only to show that scaling fixes syntax, not fit).

Engineers are hand-building the guardrails doloop infers

In June 2026, Florian Buetow, an AI engineer at Xebia, spent forty minutes on a public podcast on the code-review bottleneck. He arrives at the same diagnosis, reaches for the same shape of solution, and names two of doloop's own example rules by hand. He is not a doloop user and has never seen it.

He names the bottleneck doloop targets: "How do you scale the reviewing process? Because now that is blocking your senior engineers. It burns them out." He wants "deterministic guardrails that execute cheaply and quickly," run "on the developer's laptop, not GitHub." Deterministic, cheap, local, at the commit. He flags "the risky part, the non-deterministic behavior" of using a model as the guard.
He reaches, unprompted, for two of doloop's own catches: "I don't want any default values in any of my methods in Python," and "let's never swallow any errors, any error must always be propagated." The second is act-on-errors, the near-universal law at the top of the barometer above. A skeptic's strongest move is that these conventions are arbitrary; an independent expert naming the same two is evidence they are real and felt.
The difference is the wedge: he writes and maintains these rules by hand, with semantic-grep, iterating for months. doloop reads them off your code in under a second, host-relative, and the transplant shows it follows the codebase a fixed ruleset cannot. He is the proof that the pain is real, not that he would switch.

Two viewers, neither of whom has seen doloop, stated the thesis in their own words. One: "the number of conventions in the codebase increases, each part is not wrong but the coherence is low, no amount of the model will get better will fix this." Another: "review needs to become evidence management, not just another human approval step, otherwise teams just trade review fatigue for scanner fatigue." That is the problem and the product, described by strangers.

The literature this builds on

None of the core lenses are ours to claim. The published work below set them up. doloop's contribution is narrower, and builds on it.

Hindle, Barr, Su, Gabel, and Devanbu, On the Naturalness of Software (ICSE 2012). The finding that source code is as repetitive and predictable as natural language, which is what makes a statistical or rule-based reader of code work at all.
Allamanis, Barr, Bird, and Sutton, Learning Natural Coding Conventions (FSE 2014), and the Naturalize tool. It learned a repository's own conventions from that repository and suggested fixes at the commit, with high accuracy. It also named the split this page leans on: conventions that nearly every codebase shares versus conventions that vary from project to project.
CoDocBench, a benchmark of real, coupled code-and-documentation changes. doloop's documentation-drift lens is tested against it, and it is where that lens fails honestly (below).
The result that larger models take more shortcuts as they scale (arXiv 2305.17256). This is why a separate, deterministic check earns its keep: the more capable the model writing the code, the more it has license to skip the consistency a checker can still verify.

What doloop adds is the part the literature never shipped as a product: these lenses run deterministically, inferred from your own code, across the whole repository, fast enough to gate a single commit. Read code as language. Choose the reader by the question.

Conventions almost every codebase shares

doloop carries no rulebook. It infers each codebase's conventions from how consistent that codebase is with itself. Across many codebases, it then asks which conventions are near-universal and which flip between one repo and the next.

A near-universal convention is one the gate can block on. A convention that flips, the gate can only warn about. The split is not a label anyone assigned; it falls out of the data. Re-measured as the corpus grew, the safety floor held at every scale.

corpus	immutable-defaults (universal)	act-on-errors (universal)	type-hints
88 codebases	100%	96%	split
450	100%	96%	split
7,637	99%	91%	style
236,701 codebases	98%	86%	97% do NOT

This consistency corpus (up to 236,701 codebases) is a separate, larger sample than the 194,119-repository convention catalog shown below. The two measure different things: this one tests how the safety floor holds as the sample grows; the catalog is the profiled crawl the laws and mores were drawn from.

immutable-defaults is the near-absolute one, holding at 98 to 100% at every scale. act-on-errors softens as the corpus broadens, because the long tail swallows more of its errors. At full scale, not type-hinting is a 97% consensus. That is why the gate compares your repo to the best codebases in its domain, not the average. The average would tell a beginner to skip what the best do.

Scope, stated plainly: these conventions are Python idioms, measured over the syntax tree. Per-language standards are on the roadmap, not a result yet.

Same code, opposite verdicts: a controlled proof that the gate reads the codebase

Take a function that is correct and ordinary in one codebase, and paste it unchanged into a sibling codebase that does that kind of thing differently. It still compiles and runs, so nothing is wrong with it except where it landed. doloop passes it in its home codebase and flags it in the foreign one, naming the convention that codebase keeps and how consistently it keeps it. The same code earns opposite verdicts, because doloop judges it against the codebase around it, not a universal rulebook.

	host: flask (sync, 100%)	host: quart (async, 82%)
flask `dispatch_request`	PASS	FLAG
quart `dispatch_request`	FLAG	PASS

A fixed-rule linter, or a model shown the function on its own, returns one verdict everywhere; it carries one rule for the world. Only a reader that calibrates on the host can flip. That flip is the product in a single frame.

Scope, stated plainly (pilot, n=3 sibling pairs): the flip holds across two unrelated convention classes. On request-dispatch shape, flask/quart and requests/httpx both flip (with two honest nulls: convention-identical siblings, and a pair sharing the convention's direction with nothing to oppose). On diagnostic style, print versus log, doit/nox flips: doit prints its status messages, nox logs them, and the verdict follows the codebase either way. Every convention is inferred from each host's own consistency, leave-one-out so a pass cannot be circular, deterministic across re-runs.

The transplant result table

flask / quartdispatch shape · sync vs async

host: flask sync 100% · host: quart async 82%

Passflask code · in flask · LOO 100%/16

Flagflask code · in quart · sync, codebase is async

Flagquart code · in flask · async, codebase is sync

Passquart code · in quart · LOO 82%/17

requests / httpxsend shape · sync vs async

host: requests sync 100%/9 · host: httpx async 100%/10

Passrequests.send · in requests

Flagrequests.send · in httpx

Flaghttpx.send · in requests

Passhttpx.send · in httpx

doit / noxemit shape · print vs log

host: doit print 88%/8 · host: nox log 82%/28

Passdoit.execute · in doit · LOO 86%/7

Flagdoit.execute · in nox · prints, codebase logs 82%/28

Flagnox.execute · in doit · logs, codebase prints

Passnox.execute · in nox · LOO 81%/27

Pilot, n=3 sibling pairs · two convention classes · leave-one-out on diagonals · deterministic across re-runs · anti-hardcode verified (incl. the nox_imposter folder test).

Determinism, proven: the cached digest equals a full read

doloop caches a few-kilobyte digest of a codebase and checks each commit against the digest, instead of re-reading millions of lines. The whole approach rests on that digest being byte-identical to a full read. Both halves are proven.

Cache equals full read. For flask, requests, rich, click, sqlalchemy, and httpx, two independent full reads produce the same canonical SHA-256, and the digest round-trips through serialize-and-reload to the same hash. The digest reproduces the full read's hash exactly.

Incremental equals full re-read. The digest's counts are sums over functions, so counts(whole) = merge(counts(partA), counts(partB)). Tested on flask, requests, and sqlalchemy, both split-and-merge and drop-a-file-then-re-add reproduce the full-read hash exactly. When code changes the gate subtracts the changed files' old counts and adds the new ones, and stays byte-identical to re-reading everything. Freshness without losing determinism, at constant time per check.

What didn't validate: the honest nulls

doloop led early with a name-and-documentation-drift lens. It failed under doloop's own benchmarks, and was demoted. The failures are reported as plainly as the wins.

Documentation-drift, as a deterministic pattern match: 1.4x (about chance)

Against CoDocBench, with 4,573 real coupled code-and-doc changes, the structural lens told drifted from aligned docstrings at only 1.4x. Most real drift is prose, which a matcher cannot read.

Name-drift, as a pattern match: about 0 of 8 precision over 38,217 functions

The validate_, ensure_, and verify_ prefixes over-match setters, decorators, and predicates. Rejection takes many forms a pattern misses.

Where the name-and-documentation-drift lens lands: advisory only, via a caged 7B model at about 2.0x discrimination, never a block. The pattern this confirms is clean. A closed-form check wins on exact-answer questions, where a rate is the rate. It fails on linguistic ones, which are a model's job, under a deterministic harness.

So the blocking core is narrow on purpose. It is behavioral consistency, where this codebase acts on its errors 86% of the time, and the paired-operation absence, the one writer that calls execute() but never commit(). That last class is one lexical linters miss.

doloop retracts three claims, on the record. First, that the name-and-doc lens is the richest signal: that was a firing count, not a detection rate. Second, the use of a single construction-quality index as a quality score. Third, that models are blind to consistency bugs: a 12B model caught single-function drift 9 of 11 times.

The floor

Fig. below the floor the gate owes silence · the threshold refused invoke before anyone checked floor shipped before the run · agreement unplanned

The six dimensions, tested

The Codebase Polysemy Contract spans five polysemy dimensions plus the cross-cutting paired-operation absence. Each one stands as follows when run as a deterministic, closed-form gate: two validate, security splits, handoff is a caged-reader class, and two are still design targets. The misses are stated as plainly as the hits.

Dimension	Verdict	Evidence
Functional / convention consistency	Validates, the gate	The workhorse, live. Heartbleed re-derived from OpenSSL's own 88% bounds-check adherence (pilot, n=1 worked case).
Paired-operation absence	Validates, the gate	An `execute()` with no `commit()` in the same function. A class lexical linters miss.
Security, bounds-local shape	Validates, precision only	A parameter-derived length reaching a copy with no bounds check. 0.7 to 3% base rate, low false-positive, near-zero recall on hardened code.
Security, guard-absence	Fails closed-form, demoted	The function-local pairing never forms (guards live upstream); lift 0.28. Needs inter-procedural taint, not AST counting.
Handoff (doc and name drift)	Fails closed-form, caged-reader	doc-drift about chance on CoDocBench; name-drift about 0 of 8 precision. Advisory via a pinned model, never a block.
Performance, structural	Design targets	Specified, not yet built or backtested. Named so they read as a plan, not a current capability.

The repository vector space

The product object is not a single number. It is two vector spaces.

Repository space: one point per codebase, built from convention rates, paired operations, language mix, and metadata. The near-universal conventions are the low-variance axes. Quality is distance from the peer cluster and from the convention consensus. A measured caveat: convention vectors alone are noisy, since django's nearest neighbors came back as numpy tutorials. Domain comes from the language mix, so numpy clusters with scikit-learn, and django with tornado. Findings space: a point for each flagged case, for the bug taxonomy and anomaly detection.

One convention, three altitudes

Fig. typed · law at galaxy, vernacular among siblings, rule at home 6 AST dimensions · sibling points exact

Method, and a lineage in plain sight

The method is one move: infer a codebase's requirements from its own consistency, then flag only where a change betrays them.

On one AI-generated repo with no rulebook (pilot, n=1), the gate inferred "handlers act on the error" from the codebase's own 85% consistency. Then it flagged the 31 places the code betrayed itself. The requirement inheres in the consistency. The bug is the unevenness.

The lenses come from the published work above, plainly credited. Naturalize learned a repo's own conventions and gated the commit at 94% accuracy, and named the universal-versus-flipping distinction. doloop's extension is to run those lenses deterministically and whole-repo, at commit speed, reaching from syntax into semantics. What matters is not whether a model appears. It is whether the verdict reproduces.

quantity	value
codebases gathered	236,701
cases we checked where the project deleted the flag by the next release	19 (pilot, n=19; 5 hand-verified)
cross-release resolution study	pilot: 53 repos, 470 flag-then-fix observations
immutable-defaults / act-on-errors (universal)	98% / 86%
read rate	~59,000 LOC/sec (median)
calibration time	0.27 s median
cache == full read	byte-identical, 6 repos
incremental == full re-read	byte-identical, 3 repos
doc-drift, pattern match (CoDocBench, 4,573)	1.4x (about chance)
name-drift, pattern match (38,217 funcs)	~0/8 precision
name-and-doc-drift, caged 7B model	~2.0x (advisory only)
rework rate (flask history)	17% of commits
rework rate (GitHub corpus 2018-2022)	~18-19% (stable)
speed vs an LLM reviewer	~half a second vs minutes

Also in research

• WRITING TEXTURE

Why AI prose feels flat

The mechanism behind AI prose flatness, the measure (sum_sd, within-document variance across seven dimensions), and the evidence: human prose scores 69–89; the model prose we tested clustered lower, in the 25–45 band on small models and bare prompts. Includes the Grok adversarial test and the judge-selection finding.

Read the evidence →

• VOICE, NOT DETECTION

The Declaration is 97.75% AI

An AI-writing detector flags the Declaration of Independence as almost certainly machine-written. We ran the same paragraph, and a shelf of Melville, Emerson, and Jefferson against frontier-model knock-offs, through DOMAINS. A detector measures predictability; DOMAINS measures voice, deterministically, and never returns an AI percent. Links to Pangram and the full case for detection included.

Read the crossover →

• DOCUMENTS

Does your LLM dream of electric sheep?

A 60-second test for whether you can trust what your model extracted from a document. The hallucination rate on extraction tasks, and why a located value is not the same as a generated one.

Take the test →

Built on published work. Measured in the open.