Orient
What is doloop?
doloop is the deterministic verification gate for AI output. A model optimizes for plausible, not correct, and it cannot catch its own mistakes, because self-review runs the same engine with the same blind spots as generation. So we put the check outside the model: send any AI output (code, prose, a table, a conversation) and a roster of rule-based checks, we call them donkeys, flags the specific, located failures it can verify, with the same verdict on every run. It catches what a tired reviewer misses, at a scale no human can match, and hands the findings back so the model fixes them before they ship.
Think of it as a stubborn, tireless inspector that catches the subtle mistakes and made-up facts before they reach you. You bring your own model; we never touch it.
Go deeper: Same code, opposite verdicts → · It isn't about catching bugs →
Same code, opposite verdicts
The transplant is our signature result, and it shows doloop reads the dialect of your codebase rather than a universal rulebook. Take one function that is correct and idiomatic in its home repository (say flask), drop it unchanged into a sibling that does similar work but keeps different unwritten conventions (say quart), and doloop returns opposite verdicts: it passes at home and is flagged in the foreign house. The verdict follows the house, not the function. We have shown it in Python (sync vs async dispatch), TypeScript (function-declaration style), and COBOL (scope-termination dialects).
It is a capability two incumbent tools structurally cannot have.
Linters and static analyzers run one configured ruleset everywhere, so the same code must earn the same verdict wherever it lands. Producing opposite verdicts would need a human to already know and write down every project's local conventions, which is exactly the work doloop removes by reading them from the code.
LLM reviewers are non-deterministic: the same code can earn different verdicts across runs, so they cannot give an opposite-but-stable verdict you can audit.
The transplant is where both collapse. doloop is corpus-inferred, so it follows the house, and deterministic, so the verdict is stable and replayable.
Go deeper: Does it hold beyond Python? → · When doloop stays silent →
When doloop stays silent
The 70% inference floor is the threshold a pattern has to clear before doloop will treat it as a rule. Because we infer conventions from a codebase's own consistency rather than a written rulebook, a behavior (async handlers, a logging style) must hold in at least 70% of the places it could apply before the gate will judge against it.
Below the floor the gate stays silent and issues no verdict at all, in two cases: - Unsettled direction. If the house follows a pattern only, say, 60% of the time, it has not settled the question, so we do not pretend it has. - Too little evidence. If there are too few sites to read a direction at all.
The silence is the most important design decision in the system. - It earns trust. A tool that always has an opinion is one you learn to ignore. Staying quiet on unsettled questions is what makes the gate worth listening to when it does speak. - It guards against vagueness. The floor stops the gate from enforcing opinions dressed as rules; it only enforces standards grounded in the corpus. - It keeps the noise down. It is part of how the gate holds a roughly 0.7% false-block rate instead of flooding you with flags on patterns the team never committed to.
Go deeper: What it can't do yet → · Same code, opposite verdicts →
It isn't about catching bugs
Our core thesis: the binding cost of AI code is fit, not correctness. Models are now very good at writing code that compiles and runs. The bottleneck has moved to whether that code fits the implicit, unwritten conventions of the codebase it lands in. A compiler or test suite tells you the code does what it should; only a reviewer, human or an inferred gate, can tell you it belongs here.
The evidence, each at its true size: - Veracode 2025 GenAI Code Security Report. As models improved, the share of generated code that compiles climbed past 95%, but the security pass rate stayed flat at 45 to 55%, regardless of model size. Getting code to work and getting it right in context are different problems, and scale only solves the first. - METR 2025 randomized controlled trial. Sixteen experienced developers, working in repositories they had maintained for years, were 19% slower with AI. The study pinned the cost on implicit requirements (documentation, test coverage, unwritten formatting) that people take time to learn and a model cannot see. One small study on early-2025 tools, not a universal law. - Cognition FrontierCode (June 2026). Graded on whether a real maintainer would merge the change, the strongest models cleared only 13.4% of the hardest tasks. A representative failure passed every test and was rejected for routing warnings through the wrong logging helpers. - GitClear, 211 million lines (2020 to 2024). Copy-pasted lines rose while moved (refactored) lines fell from 24.1% to 9.5%, the signature of code that fits its surroundings worse. - The transplant. The same function earns opposite verdicts in two sibling repositories, which makes the point directly: rightness is often relative to the house, not the code.
Go deeper: Does it hold beyond Python? → · Same code, opposite verdicts →
The proof
Does it hold beyond Python?
The transplant is not a Python trick. The method, infer the convention the house already keeps and check new code against it, is language-agnostic, so it travels language by language. What we have shown so far, each at its honest scope:
Python (demonstrated). Three sibling pairs, two convention classes: request dispatch flips sync vs async (flask/quart, requests/httpx), and diagnostics flip print vs log (doit/nox). Proven by hand, reproduced, deterministic.
TypeScript (first-cell). A real house vernacular: arrow functions vs function-declaration helpers at the exported-helper role. hono settles on arrow, kysely on declarations. It survives ESLint-subtraction, so doloop catches the convention even where the project's linter is told to ignore it. Shown once, not yet reproduced at scale.
COBOL (verified three ways). A dialect split between legacy bare-period and modern END-IF scope termination, holding 525 of 525 leave-one-out across seven houses, including two independent 1980s corpora with no shared authorship: the Norwegian national-insurance system and the US NIST COBOL-85 suite. Confirmed three ways, a column-aware tokenizer, the GnuCOBOL front-end, and a full ANTLR COBOL-85 syntax tree.
A fourth language or a fifth slots in the same way: find a convention the house keeps, then show the same code passing in one sibling and flagged in another. The list grows; the claim does not.
Go deeper: What we found in COBOL → · What it can't do yet →
What we found in COBOL
In COBOL, what we found is a clean dialect split on scope termination: modern systems (the AWS CardDemo reference) close an IF with explicit END-IF; deep-legacy systems close it with a bare period.
Precisely: - Leave-one-out: the convention held on 525 of 525 held-out programs across seven houses. - Independent corpora: the bare-period rule held at near 100% across two unrelated 1980s corpora with no shared authorship, the Norwegian national-insurance government system (DSF) and the US NIST COBOL-85 conformance suite. - Three verification altitudes: a column-aware tokenizer, the GnuCOBOL compiler front-end (all programs), and a full ANTLR COBOL-85 syntax tree inspecting every IF terminator.
What it does not claim: this is a coherence gate, not comprehension. Knowing a house uses periods tells you nothing about what an IF decides. It does not claim to fully parse arbitrary COBOL; the check is the scope-termination law specifically. And there is no measured false-block rate for COBOL yet, because public repositories lack the merge history to compute one.
Go deeper: One number we can't give you yet → · What it can't do yet →
What it can't do yet
We are deliberate about what doloop is not.
It does not generate. doloop does not write code, redraw slides, or compose conversation. It reads the output; it never writes it for you.
It checks fit, not meaning. It is a coherence gate, not a comprehension engine. It can tell you a change does not belong in this codebase without claiming to understand what the code means.
Only the mechanical lenses are byte-deterministic (regex, clustering, arithmetic). Where a lens has to read for meaning or look at a layout, a caged, advisory reader does the work: bounded and labeled, not byte-reproducible. We tell you which is which.
It does not stop hallucination at the source. It is external resistance: the model has to fix its output until it passes the gate.
Maturity is uneven, and we tag it. Python is demonstrated; TypeScript is a first-cell (shown once, not yet at scale); COBOL makes no claim to fully parse arbitrary COBOL and has no measured false-block rate yet. Prose and documents are designed surfaces, extrapolated from the code result, and the analogy could break.
Some probes failed, and we retired them. Doc-drift performed at chance (1.4x) because it cannot read prose; name-drift had near-zero precision over 38,000 functions; a closed-form security guard-absence check failed because it needs inter-procedural taint, not syntax counting. All were demoted to advisory or dropped.
We do not claim a coverage percentage for how many reviewer-flagged issues the deterministic gate catches versus the advisory reader. The honest shape is two layers, not one headline number.
Go deeper: One number we can't give you yet → · Where a smart skeptic pushes back →
One number we can't give you yet
The one number we do not have yet is the false-block rate on a real customer's system, especially a specialized one like COBOL. Inside our own Python corpus the false-block rate floors around 0.7%, but the number that decides whether this is real for you is the rate on your code.
Only a customer can produce it, because measuring it needs your real accept-and-reject review history, your merge history. The gate's flags have to be checked against the actual decisions your maintainers made. Public repositories, especially in COBOL, do not carry that history, so a private codebase with real review decisions is the only place the precision number can be computed. That is the core of the design-partner ask.
Go deeper: Why we bet on determinism → · Where a smart skeptic pushes back →
Stress-test
Where a smart skeptic pushes back
The sharpest objection is two-in-one: won't a bigger model or a longer context window eventually solve fit, and isn't this a linter with extra steps? Three answers.
1. Once rounds to never. Scale teaches a model the conventions that recur across the whole world's code. A repository's own conventions appear roughly once in any training set, and once rounds to never in statistical training. Even with the full repo in context, the model still has to guess which repetitions are load-bearing invariants and which are accidents.
2. The incumbents can't sit where doloop sits. A linter is deterministic but configured: a human must already know and write down the local rules, which is the problem we remove. An LLM reviewer can infer but is non-deterministic: different verdicts on the same code across runs, so it cannot be a stable, replayable gate. doloop is the only seat that is both inferred and deterministic.
3. The transplant is the proof. It is as close to a formal proof as this gets: identical code, opposite-but-stable verdicts in two sibling repositories. In COBOL we triple-verified the dialect split (column-aware tokenizer, GnuCOBOL front-end, full ANTLR tree), which is what shows the gate has teeth and is not a text-matching trick.
Underneath it all: the bug is not a missing universal rule, it is the uneven application of a rule the house otherwise keeps. The 70% floor is what keeps us quiet until a convention is settled, which is how we avoid being the noise machine engineers learn to ignore.
Go deeper: Won't a better model do this? → · Just a linter? Just a wrapper? →
Won't a better model do this?
A better model does not make doloop obsolete, and a model checking its own work cannot replace it. Both for structural reasons.
Scale does not fix fit. - Once rounds to never. Scale learns the conventions shared across the global corpus; it cannot learn the ones unique to your repository, because those appear about once in any training set, and once rounds to never. - Universal vs particular. Models speak the universal register of a language fluently and stay pattern-deaf to the particular register of your house. Even with a big context window, a model cannot reliably tell a load-bearing invariant from an accidental repetition. - Contextual rightness has plateaued. Syntax pass rates climbed past 95%, but being right-in-context and secure stayed flat across model sizes and releases. Scale solves syntax, not fit.
A model cannot be its own check. - Self-review shares the blind spot. Reviewing its own output runs the same engine with the same continuation biases; LLM judges even score their own generations higher than humans do. - Hallucination is a limit, not a bug. It is a formal property of these models, not something a better prompt or a bigger model removes. - Non-determinism. If a vendor builds a reviewer from another model, it is still stochastic: the verdict moves across runs. doloop returns a byte-identical, replayable verdict, which is what auditability and regulation require.
The defensive logic runs the other way: a more capable model shortcuts more and produces more plausible slop, so an external, deterministic adversary that cannot be charmed or gamed gets more necessary as generation gets cheaper, not less.
Go deeper: Why we bet on determinism → · Same code, opposite verdicts →
Just a linter? Just a wrapper?
Two objections, answered by where doloop sits: outside the model, deterministic, inferred.
Isn't this just a wrapper on an LLM? No, because it does not generate. You bring your own model; doloop is the external check on that model's output. It sits outside the generator's loop, which matters because a model cannot reliably check its own work, and its core donkeys are byte-deterministic: same input, same verdict and citations, every run. A wrapper would still be stochastic. doloop is not.
Isn't this just a linter? A linter is configured; doloop is inferred. A linter needs a human to pick or write a rulebook. doloop reads your codebase and works out the conventions your team already keeps at the 70% floor, including the ones nobody wrote down. The transplant is the giveaway: identical code earns opposite verdicts in two sibling repositories, where a linter passes both because it runs one fixed rule everywhere. doloop gates on structural and behavioral conventions, how warnings are routed, how errors are handled, paired operations, that no linter holds because they were never codified.
Go deeper: Same code, opposite verdicts → · Why we bet on determinism →
Why we bet on determinism
Determinism is the moat because models are accurate but stochastic, and stochastic is not good enough for an engineering gate or a regulator. An AI reviewer gives an opinion that moves run to run; doloop gives a verdict that is byte-identical every time. Four pillars.
1. Stochastic review can't be a gate. A model judge might score the same code 77% one morning and 63% the next. You can never tell whether a fix passed because it was right or because the model was in a different mood. That cannot be a hard gate in a pipeline.
2. You can check the checker. Every verdict ties to an input_sha256: same input, same findings and citations, forever. That turns a check into a record a teammate, auditor, or regulator can re-run and reproduce exactly. Because the core rules are mechanical (regex, clustering, arithmetic), they cannot be charmed, bluffed, or talked into agreeing.
3. The regulatory fit. Under SR 26-2 (US Federal Reserve, OCC, FDIC, April 2026), deterministic rule-based processes sit outside the definition of a model. A deterministic gate is governable where a stochastic LLM-as-judge is not. We are honest about the boundary: this is non-binding guidance, not a safe harbor, and the model you bring is still in scope.
4. The transplant needs determinism. Showing the gate reads a house dialect means issuing opposite-but-stable verdicts on the same code in two environments. A linter cannot (one fixed rule everywhere); an LLM cannot (it cannot guarantee the stable re-runs that show the flip was intentional).
As models get larger they shortcut more, so the bottleneck is not a more accurate model, it is an external, deterministic adversary.
Go deeper: Same code, opposite verdicts → · One number we can't give you yet →