THE PAPER · EXPLAINER

A deterministic verification gate
for the age of AI-written code.

doloop reads a codebase, infers the unwritten conventions it actually keeps, and flags new code that breaks them, with the same verdict every run and the exact rule, line, and rate behind the call. The signature result is the transplant: one function, opposite verdicts in two sibling codebases, and the verdict follows the house.

TL;DR

doloop is a deterministic verification gate that reads a codebase, infers the unwritten conventions that codebase actually keeps, and flags new code that breaks them. It issues the same verdict every run and cites the exact rule, the exact line, and the observed consistency rate behind the call. Its signature, demonstrated result is "the transplant": the identical function earns opposite verdicts in two sibling codebases, and the verdict follows the house. That result is structurally impossible for a fixed-rule linter and for a non-deterministic LLM reviewer.

The problem is now acute and measured. Roughly a third of new code at the largest software companies is AI-generated, 84% of developers use or plan to use AI tools, and independent testing finds 45% of AI-generated code introduces a known security flaw. Yet trust in AI output has fallen, more developers now actively distrust it than trust it, and review capacity, not writing speed, is the true bottleneck.

Precision is in hand: roughly a 0.7% floored false-block rate, 1.9% before the floor, verified by hand. The convention catalog is independently confirmed by a live public instrument scanning tens of millions of code files. Coverage works across two layers: a deterministic gate catches the structural conventions, and a caged reader, its verdict pinned and cached so it reproduces, covers the judgments that turn on meaning.

Key Findings

  1. The market need is measured. AI now writes a large and growing share of production code, that code is fluent but flaw-prone, and the binding constraint has shifted from writing to reviewing. doloop targets exactly the slice of review that both linters and LLM reviewers handle worst: the conventions a specific codebase keeps but never wrote down.
  2. The core mechanism is novel and demonstrated. doloop infers conventions from a codebase's internal consistency rather than reading a configured rulebook. It sorts them into LAWS (near-universal, hard-block), MORES (house-specific vernacular, warn), and TASTES (per-tenant confirmed rules). Below a 70% consistency floor it stays silent. The transplant proof, opposite-but-stable verdicts on sibling repositories, has been demonstrated in Python and first-celled in TypeScript.
  3. The intellectual lineage is deep. doloop operationalizes a 2014 academic result, the Naturalize system's distinction between "laws" and "mores," which that system's own authors called inherently difficult to codify. It sits atop the "naturalness" and "localness" of software literature and, more distantly, Christopher Alexander's idea of an order makers feel but cannot fully write down.
  4. The competitive gap is genuine. No shipping commercial product combines all three of corpus-inferred, deterministic, and gating. Linters and static analysis are configured, not inferred. LLM reviewers are non-deterministic. IDE features that do infer, such as IntelliCode, infer only formatting and do not gate.
  5. Coverage spans two layers. A deterministic gate catches the structural conventions; a caged reader covers the judgments that turn on meaning; and the gate absorbs the reader's judgments that rest on a clean structural pattern.

Details

Scope tags

Each result below carries a scope tag:

  • demonstrated: proven by hand, reproduced, deterministic.
  • first-cell: shown once in a new setting, not yet reproduced at scale.
  • built: implemented and running, not yet validated against ground truth.
  • designed: specified, not yet built.
  • predicted: hypothesized, untested.

Market statistics and academic references are sourced to named, datable origins. The lineage in section 6 is marked as lineage, not proof.

1. The problem in human terms: a flood of plausible code and a review line that cannot keep up

Something changed in software between roughly 2023 and 2026, and the change is now large enough to measure.

In April 2025, speaking with Mark Zuckerberg at Meta's LlamaCon developer conference, Microsoft CEO Satya Nadella said that 20 to 30% of the code in Microsoft's repositories was, in his phrasing, "written by software," meaning generated by AI. On Alphabet's first-quarter 2025 earnings call later that same month, Google CEO Sundar Pichai said more than 30% of Google's new code was AI-generated. That updated a figure he had given six months earlier, on Alphabet's third-quarter 2024 call on October 29, 2024, in words that describe the exact workflow doloop sits in: "Today, more than a quarter of all new code at Google is generated by AI, then reviewed and accepted by engineers." Both executives noted the figure is hard to measure precisely and varies by language, with AI doing better in Python than in C++, so the numbers deserve a grain of salt. The direction is unmistakable, and the bottleneck Pichai names, "reviewed and accepted by engineers," is the whole point.

The broad developer population has moved the same way. Stack Overflow's 2025 Developer Survey, fielded from late May to late June 2025 with more than 49,000 responses from 177 countries, found that 84% of developers use or plan to use AI tools, up from 76% the year before, with 51% of professional developers using them daily.

D1 · The volume-versus-review-capacity gap

20212026 throughput THE REVIEW BOTTLENECK code generated per dev per day (AI-assisted) human review capacity, set by attention
Fig. D1 faster writing does not buy faster reading; the bottleneck moved from the keyboard to the review queue illustration · not a measured ratio

The same survey found trust moving in the opposite direction from usage. Only about a third of 2025 respondents said they trust the accuracy of AI output, and just 3% "highly" trust it, while in Stack Overflow's own framing more developers now actively distrust the accuracy of AI tools than trust it, a sharp jump from the prior year. The single biggest frustration, cited by 66% of developers, is AI output that is "almost right, but not quite." The second, cited by 45%, is that debugging AI-generated code takes longer than expected. Usage is up, confidence is down, and the gap between them is the story of this moment.

The trust gap has an empirical basis. Veracode's 2025 GenAI Code Security Report tested more than 100 large language models across roughly 80 standardized coding tasks in Java, JavaScript, Python, and C#, and found that AI-generated code introduced a known OWASP Top 10 security flaw in 45% of cases. In the report's own words, "in 45% of the cases these models introduce a detectable OWASP Top 10 security vulnerability." Java was the worst language at a 72% failure rate. More striking than any single figure is the trend: over the period studied, models got dramatically better at writing code that compiles and runs, with syntax pass rates climbing past 95%, while their security pass rate stayed essentially flat in the 45 to 55% band, regardless of model size or release date. As Veracode's chief technology officer Jens Wessling put it, larger models do not perform significantly better than smaller ones, which suggests a systemic issue rather than a scaling problem. The lesson is that getting code to work and getting it to be right in context are different problems, and scaling the models solves the first without touching the second.

There is a quality dimension beyond security. GitClear's analysis of 211 million changed lines of code from 2020 to 2024 found that the share of copy-pasted lines rose from 8.3% to 12.3%, while "moved" lines, the signature of refactoring and reuse, fell from 24.1% to 9.5%. The firm tracked roughly an eightfold increase during 2024 in blocks of five or more lines that duplicate adjacent code. The picture is of more code, written faster, that fits its surroundings worse.

The hardest evidence that "passes the tests" is not the same as "fit to merge" comes from Cognition's FrontierCode benchmark, built with the maintainers of dozens of flagship open-source repositories to measure not whether AI code passes unit tests but whether a real maintainer would merge it. Even the strongest model evaluated scored a small fraction out of 100 on the hardest subset of tasks. The benchmark's framing is doloop's framing: mergeability includes adherence to a project's existing style and codebase standards, and today's best models clear that bar rarely.

The most rigorous measurement of this cost is a randomized controlled trial. In 2025 METR had sixteen experienced open-source developers work through 246 tasks in repositories they had maintained for an average of five years, randomly allowing or forbidding AI tools on each task. Allowing AI made them 19% slower, even though the same developers had forecast a 24% speedup and still believed, after finishing, that AI had sped them up. The study names the mechanism: AI capabilities are lower "in settings with very high quality standards, or with many implicit requirements (e.g. relating to documentation, testing coverage, or linting/formatting) that take humans substantial time to learn." The cost was not in writing the code. It was in making it fit a house whose standards the model could not see.

Underneath all of these figures sits a simple asymmetry, and it is the reason the problem compounds rather than resolves. Generation scaled and review did not. A model can emit thousands of lines in an hour; a human reviewer reads carefully at the pace of tens of lines, and that pace is set by human attention, not by tooling. So the more capable the model gets at producing plausible code, the more it widens the very gap that human review was supposed to close. Faster writing does not buy faster reading. The bottleneck moved from the keyboard to the review queue.

doloop is built to stand in that gap. Not "does the code run," which compilers and tests answer. Not "is it insecure in a universal way," which security scanners answer. The gap is whether this change fits the particular house it is trying to enter.

2. Pattern-deaf, not pattern-free: the model knows code in general, not your house

The central observation behind doloop is a claim about what AI-generated code is like. AI writes what you might call the universal register of a language: the conventions the entire global corpus shares, because it learned from that corpus. It writes idiomatic, plausible, textbook-fluent code. What it is structurally deaf to is the particular register: the conventions that your codebase keeps and the codebase down the street does not.

doloop's phrase for this is that AI is pattern-deaf, not pattern-free. The output is full of patterns. It simply cannot hear the local ones. A model trained on all of GitHub knows the thousand ways the world writes a request handler. It does not know, and cannot know from the prompt alone, that this project always writes them one specific way, that the single deviation in the file you are editing was itself a bug, and that a reviewer on this team would send your patch back for it.

There is a sharper way to say why. A model reviewing a diff never read the other ten thousand files in the repository. It reproduces the global average of how code is written, not the local norm of how this house writes it, because the global average is what it was trained on and the local norm lives in files it never saw. The failure is not a lack of intelligence. It is a lack of acquaintance.

D2 · The universal and the particular

THE UNIVERSAL conventions the whole world shares AI writes these fluently THE PARTICULAR only this house keeps never written down AI-generated code doloop · listens here
Fig. D2 pattern-deaf, not pattern-free: a model reviewing a diff never read the other ten thousand files concept figure

The metaphor doloop reaches for is auditory. It is the listener, or the meter, that catches what the writer cannot hear. A house has a way of speaking. New text arriving in that house, whether written by a human in a hurry or a model that has never seen this repository, either speaks the house dialect or it does not. doloop's job is to listen for the accent.

This reframing matters because it tells you what kind of tool is needed. If the problem were universal flaws, you would want a bigger universal rulebook. But the problem is local, unwritten convention, which no universal rulebook contains, and which the house itself by definition never wrote down. That points to a tool that learns the rules from the code instead of being told them.

3. How it works: infer, tier, and return a deterministic verdict

doloop's method has three parts.

Part one: infer, do not configure. doloop does not ship with a style guide, and it does not ask you to write one. It reads the codebase and infers the conventions from the code's own internal consistency. If a project writes its request handlers as asynchronous functions 97% of the time, that 97% is the convention. The house has, in effect, voted with its code. doloop learns the rules the house never wrote down, the way a careful new hire learns "how we do things here" by reading a few hundred files before touching anything. The conventions doloop enforces are the codebase's own, discovered from evidence, not an outside authority's idea of good taste. There is direct empirical support for not relying on a written-down rulebook. A 2026 study evaluating AGENTS.md, the practice of writing a repository's rules into a context file for a coding agent, found that these files add over 20% to inference cost while barely moving success: the agent-generated ones slightly hurt it, the developer-written ones only marginally helped, and the study concluded such files should describe only minimal requirements. The agents followed the instructions faithfully; the instructions were the problem. doloop's answer is to read the conventions from the code rather than ask anyone to write them down.

Part two: tier the conventions. Not all conventions are equal, and treating them as equal is how review tools become noise machines. doloop sorts what it infers into three tiers.

LAWS are near-universal across the global corpus, the things essentially every codebase does, such as acting on a caught error rather than silently swallowing it. These are safe to hard-block. Violate a law, and the gate stops you.

MORES, pronounced "more-ays," a term of art borrowed deliberately as the lineage section will make clear, are the neighborhood vernacular. They are the conventions that discriminate between houses, and so they are the interesting ones, because they are the ones that differ. doloop infers them per house by a repo-vector nearest-neighbors method: it represents each repository as a vector, asks which other repositories are its near neighbors, and reads what conventions that neighborhood keeps. A violation of a more earns a warning, not a block, because reasonable houses differ.

TASTES are per-tenant confirmed rules, the things a specific team has explicitly told doloop it cares about.

Governing all of this is a single hard rule, the 70% inference floor. Below 70% observed consistency, no convention is considered held, and the gate owes silence. It issues no verdict at all. If a codebase writes its handlers async only 60% of the time, the house has not actually settled the question, and doloop refuses to pretend it has. This silence is not a failure mode. It is the most important design decision in the system. A tool that always has an opinion is a tool you learn to ignore. doloop's refusal to judge what the house has not settled is what earns its verdicts the right to be taken seriously when it does speak.

D3 · The convention hierarchy and the silence zone

CONSISTENCY → 0%70% FLOOR100% SILENCE · no direction settled inference floor · 70% LAW · near-universal verdict: HARD-BLOCK MORE · house vernacular verdict: WARN TASTE · per-tenant rule verdict: per policy below the floor the gate issues no verdict at all
Fig. D3 silence is a feature: a tool that always has an opinion is a tool you learn to ignore tiers + floor

Part three: return a deterministic, replayable verdict. Given the same code and the same inferred convention set, doloop returns the same verdict, byte-identical on re-runs. Every finding cites three things: the rule it is enforcing, the line it fires on, and the rate, meaning the observed consistency such as "handlers are async in 97% of sites." That triple, rule and line and rate, is what makes a verdict auditable. You can check it. You can replay it. You can argue with it on the evidence. And because it is deterministic, the same disagreement resolves the same way every time, rather than evaporating on the next run.

The verdict is checked leave-one-out, so the convention is never defined by the very code under test. The host's conventions are computed from the rest of the codebase, and the touched site is judged against them. This prevents the gate from grading a change against itself.

Is there an AI model inside doloop? The deterministic core needs none. The pattern inference, the tiering, the floor, and the verdict are computed mechanically from the code's structure, which is what makes them replayable. For the few sub-steps that genuinely require reading, in the sense of judging meaning rather than form, doloop uses a bring-your-own-model arrangement: the customer supplies the model, and that "caged reader" is held to a constrained, auditable role. The point of the architecture is that the part that gates deterministically and the part that reads semantically are separate layers, and only the first makes the deterministic promise.

D4 · The inference-floor decision flow

new code changearrives measure consistency,leave-one-out consistency≥ 70%? no SILENCE · issueno verdict yes which tier? LAW → blockverdict carries rule·line·rate MORE → warncited, replayable TASTE → per policy all verdicts byte-identical on replay
Fig. D4 the deterministic core needs no model; only the verdict's rule, line, and rate decision flow

Determinism is the sharpest line between doloop and the LLM-based reviewers now entering the market. An LLM reviewer, asked the same question twice, can give two different answers. That is the nature of the technology. A deterministic gate cannot. The next section shows why that single property makes the signature proof possible.

4. The transplant: the same code, opposite verdicts in two houses

The result that separates doloop from every alternative is the transplant: one function that the gate passes in one codebase and blocks in another, with the verdict following the house, not the code.

Take one function. Drop it, unchanged, into two sibling codebases, meaning two projects that do the same kind of work but keep different house conventions. doloop returns opposite verdicts. The function is fine in one house and blocked in the other. And the verdict follows the house, not the function. Move the function back, and the verdicts swap with it.

This has been demonstrated in Python, with the scope tag demonstrated, meaning proven by hand, leave-one-out calibrated, and deterministic, across three sibling pairs and two distinct classes of convention.

The first class is dispatch shape. In the pairs flask and quart, and requests and httpx, one sibling writes synchronous functions and the other writes asynchronous ones. The same handler is in dialect for one and out of dialect for the other.

The second class is emit shape. In the pair doit and nox, one sibling diagnoses by printing and the other by logging. The same diagnostic line is correct in one house and wrong in the other. That this is a second, unrelated convention class matters, because it shows the transplant is a property of the method, not a quirk of one kind of rule.

The result has been first-celled in TypeScript, with the scope tag first-cells, meaning shown once and reproduced by hand, not yet at scale. The TypeScript more is arrow-function versus function-declaration at the exported-helper role. In the web framework hono, exported helpers settle on the arrow-function form, observed in 55 of 60 sites. In the query builder kysely, they settle on the function-declaration form, observed in 34 of 34 sites. The same exported helper flips verdicts between them, and the flip reproduces by hand.

This TypeScript case has a subtle and important property. It survives ESLint-subtraction. Both test houses set the relevant ESLint rule to off. In other words, the convention is real and consistently held, but it is deliberately unmechanized. The houses chose not to enforce it with their linter, yet they keep it anyway. It is pure vernacular, a rule the house keeps without writing down or wiring up. That a mechanical tool deliberately ignores it is exactly why an inferred gate is needed to see it.

The same flip reproduces in COBOL, a language two generations removed from the first two, which is the strongest evidence that the transplant is a property of the method rather than of any one ecosystem. The convention is scope-termination style. A modern house terminates an IF with the explicit END-IF keyword introduced in COBOL-85; carddemo, an AWS mainframe-modernization sample, does so on 98% of its 1,059 IF statements. A deep-legacy house terminates with a bare period in the pre-COBOL-85 dialect; DSF, the Norwegian national-insurance system, uses END-IF on zero of its 43 IF statements. Each house's law is inferred from its own code, then the same IF statement is judged against each: a real DSF statement passes in DSF and flags in carddemo, and a real carddemo statement does the reverse, a clean two-by-two flip with file-and-line provenance on both specimens. It holds under controls. Inferring each house's law from all but one of its programs and judging the held-out program, 525 of 525 held-out programs across seven houses conform to their own house. And the legacy side is not one cherry-picked system: two independent corpora hold the bare-period law at near 100%, DSF on 5 of 5 of its programs and the United States NIST COBOL-85 conformance suite from the 1980s on 417 of 417 of its 17,589 IF statements. A Norwegian government system and a US standards body, decades and an ocean apart with no shared authorship, settled on the same law, which is what the dialect explanation predicts: END-IF is a COBOL-85 construct that both period houses pre-date. The grid is a clean two-block partition: the five modern houses agree and flag both legacy houses, and both legacy houses flag all five and agree with each other. So the split is a cross-house property, not one file or one outlier system. The objection a mainframe engineer would raise, whether a text-matching probe really sees COBOL structure, is closed for this law: the bare-period-versus-END-IF partition is confirmed three independent ways, by a column-aware tokenizer, by the GnuCOBOL compiler's own front-end across 100% of the programs including the CICS ones, and by a full ANTLR COBOL-85 syntax tree that inspects every IF statement's terminator. The three levels of authority agree, and along the way the compiler caught a predicted artifact, a data name containing the letters IF, which was corrected with the partition holding after the fix, so the result has teeth and is not propped up by a parsing bug. This confirms the scope-termination law specifically; it is not a claim that arbitrary COBOL is fully parsed. The remaining honest gap is the one the Python work has and the COBOL work does not yet: a measured false-block rate, which needs real accept-or-reject review history, exactly what a modernization partner's own codebase supplies. What it already shows is the failure mode the legacy-modernization market is made of: an AI trained on modern COBOL writes END-IF into a forty-year-old government codebase, the code compiles and passes its tests, and it breaks a house law no living maintainer ever wrote down, because the law exists only in the code, which is the one place doloop reads it.

D5 · The transplant: same code, opposite verdicts

doit · print house nox · log house print-styleexecute() log-styleexecute() PASS in dialect FLAG prints; this house logs FLAG logs; this house prints PASS in dialect TypeScript first-cell: hono settles arrow 55/60 · kysely settles funcdecl 34/34 · survives ESLint-subtraction COBOL: carddemo END-IF 98% · DSF + NIST-85 (two independent legacy corpora) period ~100% · 525/525 leave-one-out · triple-verified: tokenizer + GnuCOBOL + ANTLR AST
Fig. D5 the verdict follows the house, not the function · impossible for a linter or an LLM Python · TypeScript · COBOL · deterministic

The transplant is structurally impossible for the two incumbent categories of tool.

A linter or static analyzer runs one configured ruleset everywhere. By construction it gives the same verdict in both siblings. That is what a fixed rule does. It cannot give opposite verdicts on the same code in two repositories unless someone hand-configures two different rulesets, which simply relocates the problem. Now a human has to know and write down the local convention, which is precisely the thing that was never written down. The whole premise fails.

An LLM reviewer fails from the other direction. It is non-deterministic. Run it twice on the same code and it may give different verdicts on the same repository, let alone across two. It cannot give opposite-but-stable verdicts, because it cannot give stable verdicts at all. Its answers are a distribution, not a function.

Only a tool that is both corpus-inferred, so the verdict can depend on the house, and deterministic, so the verdict is stable and replayable, can produce opposite-but-stable verdicts that follow the house. The transplant is therefore not just a neat demonstration. It is the signature empirical proof that doloop occupies a position neither incumbent can reach. It is the experimental signature of the entire design, and it is the empirical artifact a skeptic should demand.

5. The site-class taxonomy: where a convention lives

One more piece of the engine matters, because it is where doloop's generality is both real and bounded. When doloop infers a convention, it also has to know where in the code that convention lives. Conventions inhabit a small number of site classes.

ROLE: the convention attaches to a code element's role. Example: request handlers are async. The rule is about what async-ness means for things playing the handler role.

SURFACE: the convention attaches to a surface the code presents. Example: all diagnostic emitters use logging, not print. The rule is about a whole class of output sites.

OBJECT-LIFETIME: the convention attaches to the lifecycle of a resource. Example: a resource acquired on an object must have a release path. The rule is about acquisition and release being paired.

The engine ports across languages and even across domains. The same machinery that learned Python conventions learned TypeScript ones. Each new domain's lenses, the things doloop looks through to find conventions, must be re-scoped to where that domain keeps its conventions. doloop calls this the kysely lesson. You cannot assume TypeScript keeps its conventions in the same site classes Python does. Porting the engine is cheap. Porting the lenses is real work, and skipping it produces a false positive: a lens fitted to one house's idioms that fails on the next.

6. Why it works: the intellectual lineage

What follows is lineage, not proof, kept in a separate register from the mechanical claims above. But it matters, because doloop's design did not come from nowhere, and understanding the ancestry is the best way to understand why one should expect the approach to work at all.

The oldest root: svadharma, the law proper to one's own place

The host-relative idea is ancient. The Bhagavad Gita, traditionally placed around 3000 BCE, turns on svadharma: the principle that one's own proper law, even imperfectly carried out, takes precedence over another's law performed well. The same act is right in one station and wrong in another, because each has a standard proper to itself. Translated to code, that is exactly the transplant. The same function is correct in one codebase and wrong in its sibling, not because either is better, but because each house keeps its own law. doloop takes no position on which law is good. It reads the law each codebase already keeps, and holds new code to that one. The lineage that follows, from Alexander to the naturalness literature, is the modern restatement of a very old observation.

The architect: Christopher Alexander and the order you can feel but not write

In the 1970s the architect Christopher Alexander argued, across "A Pattern Language" in 1977 and "The Timeless Way of Building" in 1979, that living structures, whether towns or buildings or rooms, possess a real, binding, recognizable order that their makers feel but cannot fully put into words. He called it the quality without a name: objective and precise, he insisted, yet impossible to name directly. A great old town has a rightness to it that was not designed by writing down rules. It emerged from countless local decisions that were individually small and collectively coherent.

Alexander's 1965 essay "A City is Not a Tree" sharpens the part most relevant to doloop. He contrasted two abstract structures. A tree is a strict hierarchy where every element belongs to exactly one parent. A semilattice is a richer structure where elements can overlap and belong to several overlapping sets at once. Planned, artificial cities, he argued, are trees. Living, natural cities are semilattices. The relevance is direct. A codebase's conventions overlap. A single line of code can simultaneously participate in the handlers-are-async convention, the diagnostics-use-logging convention, and the resources-get-released convention. They are overlapping sets, not a clean hierarchy. doloop's site-class taxonomy is, in effect, a way of reading the semilattice of a codebase's conventions rather than forcing it into a tree of configured rules.

There is a pleasing closing of a loop here. In October 1996, Alexander gave the keynote at OOPSLA '96, the major object-oriented programming conference, to a software community that had enthusiastically adopted his patterns idea in the form of the design-patterns movement. The talk, later published in IEEE Software in 1999, is candid about his surprise at being useful to computer science, and gently challenges the programmers. He could see they had borrowed the structure of patterns, but he had not yet seen evidence of the deeper thing he cared about, which he described as the capacity to produce a living, coherent whole. doloop's claim, in Alexander's terms, is modest and concrete: a codebase has a real, binding, measurable order that was never fully written as rules, and the job of the gate is to measure it and protect it. The philosophy is a source of intuition, not a substitute for the demonstrated results above.

The legal cousin: a standard that points at a corpus

The same tension Alexander describes, an order that is real and binding yet never fully written down, has a precise echo in law, and it sharpens the most important constraint on what doloop is allowed to enforce. In Anderson v. City of Issaquah, a 1993 Washington appeals court struck down a municipal design code that required new buildings to be "interesting" and "harmonious" with their surroundings. The standard was binding in principle, but it was void for vagueness in practice: it pointed at a quality without telling anyone how to satisfy it, so no applicant could know in advance what would pass, and no reviewer could justify a rejection on anything firmer than taste. The lesson generalizes past zoning. A standard that gestures at a quality is unenforceable. A standard that points at a corpus is enforceable, because the corpus is checkable. "Be harmonious" fails. "Match the rate the rest of this codebase already keeps, stated as a number" survives, because every clause of it can be inspected. This is exactly the line doloop draws for itself. It never issues a verdict it cannot ground in an observed, cited rate over the house's own code. The 70% floor is in part a vagueness guard: below it, the corpus has not said anything clear enough to enforce, so the gate owes silence rather than an opinion dressed as a rule. doloop only issues standards that point at a corpus, which is the only kind a fair reviewer, in code or in court, is entitled to enforce.

The direct technical ancestors: naturalness, localness, and Naturalize

The line from intuition to engineering runs through a specific body of software-engineering research, and doloop's debt here is precise.

In 2012, Abram Hindle, Earl Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu published "On the Naturalness of Software" at the International Conference on Software Engineering, later republished in Communications of the ACM in 2016. Their finding was that code, like natural language, is natural, meaning highly repetitive and statistically predictable, because programmers under practical constraints are not very creative most of the time. Code can be modeled with the same statistical language models that power speech recognition, and is in fact more regular than English. It is the foundational permission slip for everything that follows. If code were random, there would be no conventions to infer.

In 2014, Zhaopeng Tu, Zhendong Su, and Premkumar Devanbu published "On the Localness of Software" at the Foundations of Software Engineering conference. They showed that the naturalness of software is not only global but local. Individual projects have their own regularities, their own repeated local patterns, that a global model misses. They introduced a cache language model to capture these project-local regularities and showed it substantially improved on the pure global model. It is the direct technical ancestor of doloop's mores: the formal demonstration that project-local convention is real, measurable, and distinct from universal convention.

Also in 2014, and most important, Miltiadis Allamanis, Earl Barr, Christian Bird, and Charles Sutton published "Learning Natural Coding Conventions" at the same conference, introducing a system called Naturalize that learns a codebase's style and suggests revisions to improve its consistency. Naturalize is doloop's closest ancestor in two ways. First, it is the source of the laws-versus-mores vocabulary that doloop uses, taken directly from here. Second, and tellingly, the paper explicitly identifies the mores, the local and emergent consensus conventions, as inherently difficult to codify. That sentence is the seam doloop set out to work. Naturalize suggested. It did not gate, it was not deterministic in the verdict-replay sense doloop requires, and it treated the mores as a known-hard frontier. doloop's contribution is to operationalize precisely that frontier, to turn the inherently-difficult-to-codify mores into deterministic, replayable, auditable gating verdicts. The transplant is the proof that this operationalization actually works.

D6 · Intellectual lineage, 1964 to 2026

Bhagavad Gita svadharma ~3000 BCE (trad.) 1964Mosteller& Wallace 1965Alexandercity/tree 1977PatternLanguage 1996OOPSLAkeynote 2002Burrows'Delta 2012naturalness 2014localness · Naturalize(laws & mores) 2026doloop adds: deterministic + gating + the transplant proof
Fig. D6 the host-relative idea is old: svadharma is one's own proper law, measured here from the code lineage · antiquity to 2026

The prose lineage: stylometry (designed, not demonstrated)

doloop's design extends to prose and documents, not just code, and there the ancestry is stylometry, the statistical study of writing style. The founding result is Frederick Mosteller and David Wallace's 1964 study resolving the disputed authorship of the Federalist Papers by counting function-word frequencies, the little words like "upon" and "while" that authors use unconsciously and consistently. In 2002, John Burrows formalized this into Burrows' Delta, a function-word-frequency distance that became a standard for authorship attribution. doloop's prose layer is conceived as house-distance descending from this lineage: a house has a measurable written voice, and a document either matches it or does not. This is explicitly a designed surface, not a demonstrated one. It is built on a real and well-validated lineage, but it is not yet shown to gate prose the way the transplant shows it gates code.

7. The evidence: the barometer, the precision number, and the floor

A design is only as good as its measurements. doloop offers three classes of evidence.

The AI Code Barometer (a live, public instrument)

doloop runs a public instrument that scans tens of millions of code files across seven languages, counting what code drawn from the public training corpus fails on and measuring the cross-house consistency of each candidate convention. Its value is independent confirmation. It re-derives the law and mores catalog from a far larger sample than the transplant experiments used, and the catalog holds.

The underlying catalog was built by profiling 194,119 repositories of a larger crawl, across six structural convention dimensions, and separating laws from mores by a variance measure normalized for how common each convention is. The barometer then re-confirmed the split on 28 fresh repositories. The laws show near-universal consistency: acts-on-errors with variance about 0.004, single-return about 0.006, immutable-defaults about 0.0001. Every house does these the same way, which is exactly what makes them safe to hard-block. The mores show high cross-house variance: typed with variance about 0.114 and spanning the full range from 0.00 to 1.00, logging about 0.095, has-docstring about 0.033.

The cleanest example a reader can picture is the typed convention, whether a house annotates its code with explicit types. The web framework starlette types essentially all of the relevant sites, while django types essentially none, and both are internally consistent. Both houses have settled the question. They have simply settled it in opposite directions. A single global rule would be wrong in half of them, which is the whole reason this convention is a more and not a law.

D7 · One convention, opposite settled houses

typed rate → 0%70% floor100% starlette100 mypy100 requests100 pydantic99 aiohttp98 flask93 typer90 fastapi79 redis-py77 sqlalchemy67 pandas60 scikit-learn12 numpy2 django0
Fig. D7 green holds typed · slate holds the opposite, also settled · amber below the floor, silence owed AI Code Barometer · live · rates measured

A snapshot. The same split is rendered live, and grows as the scan widens, on the AI Code Barometer.

The barometer also makes doloop's two kinds of silence visible, which is itself a confirmation that the silence machinery behaves as designed. A convention can be displayed as amber, meaning there is enough data but the house shows no settled direction, with consistency below the floor. Or it can be displayed as a dot, meaning there are too few sites to read the house at all. Both are honest "we do not know" states.

Precision: the false-block rate (demonstrated, command-verified by hand)

The number that matters most to anyone who would actually run a gate in their pipeline is the false-block rate: how often does the gate stop work that should have been allowed? doloop's answer is approximately 0.7%, command-verified by hand. The verification ran across 578 accepted, meaning merged, commits drawn from nine repositories, on which the block lens fired 11 times. A naive reading of those raw numbers gives roughly 1.9%. Inspecting the fires by hand showed that most were real anti-patterns the maintainers had merged anyway. For example, in the tornado project the gate flagged genuine mutable-default arguments, a list and a dictionary used as default parameter values, confirmed by reading the source. The rate dropped to about 0.7% once host-relative flooring was applied, because tornado holds that particular convention below the 70% floor, so the product correctly stays silent there rather than flagging. The walk-down from 1.9% to 0.7% is itself an illustration of the floor doing its job. As separate corroboration from an earlier holdout, the gate produced a 0.4% false-block rate, 1 block in 258 accepted commits, across four Python repositories. The bottom line is that the gate stays silent on the vast majority of merged human work, which is the only way a gate earns the right to live in a pipeline. These are precision and specificity figures, in-sample where noted. The wild false-positive rate, on fully unseen repositories, still warrants a vocabulary-blind holdout to nail down.

The floor behaves (demonstrated, reproduced)

The inference floor is not a soft preference. It is a hard, tested threshold. Dialing a house's consistency down from 100%, the gate holds, meaning it flags the real deviation with no false flag, at consistency of 0.70 or above, and goes silent at 0.69. This behavior was reproduced three times, deterministically. A one-point difference in observed consistency flips the gate between held and silent, exactly as specified, every time. It is both a feature demonstration and a passed falsifier.

How much of the world this can reach

A market-shaped question follows naturally: what fraction of real repositories even have a readable convention for the gate to enforce? Measured on a sample of 210 repositories across just three convention axes, about 88% hold at least one readable convention, with an error band of roughly four points. That figure is a lower bound, because more axes can only raise it. The remaining 12% are data-poor or greenfield repositories, which are contingently addressable through a neighborhood prior, the same nearest-neighbor mechanism that infers mores, applied to lend a new house its street's conventions until it develops its own. That neighborhood-prior path is designed but not yet validated, so the 12% stays explicitly conditional. On the demonstrated mechanism, the gate's addressable market is most of the repositories it would meet, not a niche.

8. Two layers, one verdict: a deterministic core, a bounded reader, one auditable answer

doloop catches the structural conventions a codebase keeps, deterministically: same code in, same verdict out, byte-identical on replay, each finding grounded in a cited rate over the house's own code. No model runs in that lane.

Conventions that turn on meaning rather than form, the ones an AST has no handle on, are read by a caged reader: a model pinned to a version, handed one constrained question, its output schema locked and its verdict cached, so the same input returns the same verdict. Each reader conversion is checked against held-out host code, so a verdict that fires on a violation stays silent on the legitimate sites around it.

The two layers compound. When a judgment the reader makes rests on a clean structural pattern, that pattern becomes a new deterministic rule and the verdict crosses into the byte-identical lane. The gate learns from the reader, and what it learns it keeps.

9. The competitive frame and the novelty argument

The market doloop enters is crowded, well-funded, and growing fast, but it is crowded along axes that leave doloop's specific position open. The cleanest way to see this is a three-prong test. doloop's claim is that no shipping commercial product combines all three of: corpus-inferred, meaning it learns conventions from the codebase itself with no configured rulebook; deterministic, meaning same code in and same verdict out, byte-identical on replay; and gating, meaning it actually blocks rather than merely suggesting or commenting.

Walk the field against that test, characterizing the nearest misses fairly.

Linters and static analysis, such as SonarQube, ESLint, and Semgrep, are mature, powerful, deterministic and gating. SonarQube ships thousands of rules across dozens of languages with a mature quality gate that blocks pipelines, and is widely used. But every one of them is configured, not inferred. Someone has to write or select the rules. That is precisely the failure mode for unwritten conventions: if a human had to write down the house's local mores, they would not be unwritten, and the hardest ones are unwritten exactly because they are hard to articulate. Linters satisfy the deterministic and gating prongs and miss the inferred prong, which is the whole point.

AI code reviewers, such as CodeRabbit, Greptile, Qodo, and Cursor's BugBot, are the hot category. CodeRabbit reached a substantial revenue run-rate and a valuation in the hundreds of millions within about two years of founding, serving thousands of paying customers, evidence the market is paying. These tools build a code graph, run many existing linters and static-analysis tools, and overlay LLM reasoning to produce human-style review comments, and some can catch real cross-file bugs. But they are all LLM-based and therefore non-deterministic. This is not a marketing jab. A CodeRabbit representative said it plainly in a public forum, describing the product as a non-deterministic workflow that will have misses and does not claim total bug coverage. Non-determinism is precisely why one such tool runs multiple analysis passes with randomized ordering and a majority vote to reduce the noise that variance creates, and why the tools that chase higher catch rates tend to pay for it in false positives. These tools touch the inferred prong partially, since some learn from your pull-request history, and the gating prong, since some can block, but they structurally cannot satisfy the deterministic prong. Without it, the transplant is impossible and the verdicts are not replayable or auditable in doloop's sense.

IDE convention-inference, such as Microsoft's IntelliCode and JetBrains' detectors, is the nearest miss on the inferred prong, and deserves scrupulous treatment. IntelliCode infers conventions from your codebase: it can generate a configuration file that matches the conventions already in your code, a real shipping inference feature, motivated by the finding that a meaningful share of pull-request review comments concern conventions, style, and naming. JetBrains IDEs similarly detect things like indentation from existing code. But these infer only formatting, the shallow end of convention, and they do not gate. They emit suggestions at advisory severity, marks in the editor, not pipeline blocks. They satisfy the inferred prong narrowly and miss the deterministic-gating combination in the senses that matter. The conceptual precedent is real and worth acknowledging. The realized capability is a different and much smaller thing.

Stated as a matrix, the gap is immediate:

CategoryCorpus-inferredDeterministicGating
Linters / static analysis (SonarQube, ESLint, Semgrep)no (configured)yesyes
AI reviewers (CodeRabbit, Greptile, Qodo, Cursor BugBot)partial (from stated feedback)no (LLM)partial
IDE inference (IntelliCode, JetBrains)partial (formatting only)yesno (advisory)
doloopyesyesyes

Only doloop fills all three columns, and the transplant is the experimental proof that filling all three is real rather than asserted.

D8 · The three-prong bar

CORPUS-INFERRED DETERMINISTIC GATING linters / static analysis SonarQube · ESLint · Semgrep AI reviewers CodeRabbit · Greptile · Qodo · BugBot IDE inference IntelliCode · JetBrains · formatting only, advisory doloop filled = yes · half = partial · hollow = no · only doloop fills all three
Fig. D8 only doloop is corpus-inferred, deterministic, and gating at once capability matrix

Why now? The three-prong gap has existed for years. What makes this the moment is the demand side. When humans wrote most code, local-convention drift was a slow leak a human reviewer could absorb. With roughly a third of new code at the largest software companies now machine-generated, fluent in the universal register and deaf to the particular, the leak became a flood, and human review capacity did not scale with it. The deterministic, inferred, gating niche went from nice-to-have to the missing layer, which is also why even the LLM-reviewer incumbents now talk about quality gates. doloop's bet is that when the dust settles, the gate that teams will actually trust to block is the one whose verdicts they can replay and audit.

10. Where it stands, and where it is going

The whole product on the claim ladder, surface by surface:

Code in Python: the differentiator, the transplant, is demonstrated, with three sibling pairs and two convention classes, leave-one-out calibrated and deterministic. Precision, the roughly 0.7% floored false-block rate, is demonstrated, command-verified by hand. The inference floor is demonstrated, reproduced three times.

Code in TypeScript: the transplant is first-celled. The arrow-versus-function-declaration more flips by hand between hono and kysely, surviving ESLint-subtraction, but it is not yet reproduced across many pairs at scale, and it is a syntactic convention, which is thinner than Python's behavioral conventions.

The barometer: built and live, a public instrument over tens of millions of code files, independently re-confirming the law and mores catalog.

Coverage: a deterministic gate for the structural conventions and a caged reader for the judgments that need meaning, with the gate absorbing reader judgments that rest on a clean structural pattern.

Prose: built at the mechanical layer and designed above it. A deterministic citation-checksum lens, checking the internal validity of references such as ISBNs and DOIs, is built and produced zero false positives on more than three thousand real citations. The broader prose verdict is designed to come from house-distance, deviation from the publication's own measured style, the same host-relative thesis restated for words, and is not yet demonstrated at scale.

Documents: the extraction surface is live, and the convention-gate is designed. WYSIWYD pulls a table from a PDF deterministically and ties every cell to its place on the page, with zero errors across 2,332 audited cells; that ships today. Applying the convention-gate method to documents, judging a document against a publication's own conventions the way the code gate judges a commit, is designed, not yet built.

D9 · Live today, and the roadmap

LIVE TODAY Code gate · Python (the transplant) AI Code Barometer Citation-checksum lens (prose) The measured precision result ON THE ROADMAP Code · TypeScript depth Prose · house-distance at scale Documents More languages
Fig. D9 solid bars ship today; outlined bars are on the roadmap, not yet built live and roadmap

The hardest technical risk, whether an inferred deterministic gate can produce house-following verdicts at all, is retired: the transplant answers yes, in two languages. The largest remaining surface is breadth, more languages, then prose, then documents, each inheriting the engine but requiring the kysely-lesson re-scoping of lenses to where that domain keeps its conventions.

Recommendations

These are for three readers: a technical buyer weighing the tool, an investor or advisor weighing the company, and a prospective hire weighing the team.

For the technical buyer or pilot lead. Run doloop in shadow mode, meaning warnings only and no blocking, on one or two repositories with strong, settled conventions for two to four weeks. The benchmark that should govern whether you turn on gating is the false-block rate on your own merged history. doloop reports roughly 0.7% floored on its internal corpus, and you want to confirm it stays well under about 1% on yours before letting it block. If it floods you with verdicts on a repository, that is a signal the repository is below the inference floor on the flagged conventions, which is information, not failure. Pair doloop with, rather than as a replacement for, your security scanner and your functional tests. Its lane is convention fit, a complement to those, not a substitute: the deterministic layer for the structural shapes, the reader layer for the semantic judgments.

For the investor or advisor. Treat the transplant as the de-risking event it is. It is the proof that the category position, inferred and deterministic and gating, is occupiable, and it is demonstrated, not promised. The LLM-reviewer comparables show the market is real and paying, and doloop's differentiation is determinism and auditability, which those comparables structurally lack. The highest-leverage work is the loop where the caged reader's judgments harden into deterministic rules, widening the lane whose verdicts replay and audit, alongside expanding the maintainer review-rejection corpus.

For the prospective hire. The clearest places to make a mark are the lens roadmap, growing the mechanical lane, and the kysely-lesson re-scoping for new languages and domains. Both are named, both are hard, and both are decisive.

Caveats

doloop's internal results, the transplant, the roughly 0.7% false-block rate, the floor reproductions, and the barometer figures, are reported at the stated scope tags and have not been independently reproduced by a third party. Treat them as internal until external replication exists.

doloop does not claim a coverage percentage. What fraction of a given codebase's reviewer-flagged issues fall to the deterministic gate versus the caged reader is not a number doloop has measured.

Some external figures are self-reported or estimated. The Microsoft and Google AI-code shares are executive statements on calls or at conferences, with measurement methods undisclosed, useful as direction rather than precision. Competitor revenue and valuation figures come from company statements and third-party estimates. Competitor catch-rate claims are vendor or comparison sources, are not standardized across tools, and where they are high they tend to come bundled with higher false-positive rates, so they should be read as a recall-versus-noise tradeoff rather than a clean win.

The lineage is intuition, not proof. The Alexander, naturalness, localness, and stylometry material explains why the approach should work and where its vocabulary comes from; the demonstrated results are the evidence that it does.

Designed surfaces may not pan out. Prose and document gating are extrapolations from a code result and a separate, venerable stylometric lineage. It is plausible that house-distance gates prose as cleanly as the transplant gates code. It is not demonstrated, and the analogy could break.

The differentiator is demonstrated: the transplant, in two languages. The market is measured and large: AI now writes a third of new code at the biggest firms, trust has fallen, and review is the bottleneck. The instrument is live, the barometer, tens of millions of code files, re-confirming the catalog. The novelty is scoped: inferred and deterministic and gating, a combination no shipping competitor fills. Precision is in hand, roughly 0.7% floored false-block rate, verified by hand. And coverage works across two layers: a deterministic gate for the structural conventions and a caged reader, pinned and cached, for the rest.