All posts
thought leadership

Citations or It Didn't Happen

In a regulated bank, an AI answer without sources is wishful thinking. The single property that turns AI output from a chatbot guess into an auditable artifact is verifiable grounding - every claim, one click back to the line of code it came from.

Ohad KotlerFebruary 11, 202610 min

There is a question every senior engineer at a regulated bank asks the first time they sit down with an AI tool that claims to understand their legacy system. The wording varies. The substance does not.

"How do I know it's right?"

Most AI tools answer that question with confidence. "Accuracy: 94%." Or with a benchmark on a public dataset. Or with the names of customers who deployed it. Or, in the worst cases, with a smooth marketing line about "explainable AI."

None of these answers survive a regulated bank's compliance committee. None of them survive the engineer who has to put their name on the change packet. And none of them survive the moment - which always comes - when the AI confidently produces an answer that turns out to be wrong.

There is one answer that does survive that moment. It is the answer that converts AI output from a chatbot guess into something a bank can sign off on. It is also, today, the single property most legacy-AI tools do not have.

Every claim has to be anchored, one click away, to the exact lines of code it came from.

Citations. Verifiable, granular, automatic. Not as a feature. As the foundation.

"It's like references on Wikipedia"

The framing that has landed hardest with every banking buyer we have worked with - and the one we now lead with in technical conversations - is not architectural. It is a single sentence one of our engineers used with a customer demo audience.

"It's like references on Wikipedia."

You read a claim. You see a footnote. You click. You land on the source. You decide whether the source supports the claim. The verification economics flip from "I have to re-derive everything from scratch" to "I have to check the citations on the parts that matter."

This is what every banking buyer is actually asking for when they ask about accuracy. They are not asking for a number. They are asking for a mechanism by which they can build judgment - the same judgment they already use with Claude, with ChatGPT, with their team's senior engineers - about where to trust the output, where to verify it, and where it is plainly wrong.

The Wikipedia metaphor works in banking conversations for the same reason Wikipedia works on the open internet: it acknowledges that the model is fallible, names the mechanism by which fallibility is contained, and shifts the burden from "the source claims to be perfect" to "the source provides the evidence you need to verify yourself."

An engineering lead at a European banking ISV asked the same question more clinically, after watching a demo: "How do we confirm this? What is the exit criteria to say that we have what we want? Because we will be - this will be our base to make the conversion. If the base is not correct, we will not do a correct conversion."

The answer is not a number. The answer is the citation panel - and the discipline that produces it.

The threshold that gates adoption

A senior engineer at one of our earlier customers articulated the threshold with a precision that has stayed with us ever since. The conversation was in Hebrew. The translation reads:

"The question is whether I can rely on it to do complex things, and expect that it won't tangle me up on the logic side. Like, if I have to check it line by line afterwards, I haven't saved anything."

This is the actual decision criterion. Not "is it 100% accurate?" Banks have lived with humans for a hundred and fifty years, and humans are not 100% accurate. The criterion is "can I trust this enough that I don't have to re-verify everything?"

The criterion separates two categories of AI deployment, and they are not adjacent:

Shallow use. The engineer uses the AI for keyword lookup, navigation, generating boilerplate. Every output is mentally tagged as "probably true, but I'd better check." The AI saves time on the easy parts of the work. It does not change what the team is capable of.

Deep use. The engineer uses the AI for dependency analysis, business-rule extraction, impact assessment, change planning. The output is treated as a starting point that has been verified to a level the engineer can audit. The AI changes what the team is capable of - including teams that previously could not have made certain changes at all.

The first category does not require citations. The second is impossible without them.

The line between the two is where almost every AI-in-banking deployment lives or dies. Banks who buy AI for shallow use get shallow returns. Banks who buy AI for deep use without the citation infrastructure end up in an expensive shallow deployment that nobody trusts. The citation mechanism is the single property that makes deep use viable in a regulated environment.

,[object Object], The distinction is load-bearing. An AI with citations and 80% accuracy is more useful to a bank than an AI with no citations and 95% accuracy - because the bank can build intuition, audit specific outputs, and develop trust over time with the former. The latter is, structurally, a black box.

What the "tests will solve it" objection misses

When the citation framing lands with engineers, it often runs into resistance from procurement and governance. The objection is usually some variant of: "We don't need citations. We'll run tests."

The procurement-team intuition is that test coverage is the verification mechanism. AI generates the code. Tests verify the code. If the tests pass, the code is fine.

A senior modernization executive at a Tier-1 European bank closed this objection in a conversation with us, with a precision that we now use as a positioning anchor:

"It's the scientific principle of: you can never prove something works through experiment, you can only prove that something doesn't work. So it sort of doesn't matter how many tests you've run, you cannot prove to me that it works under every condition. And even if you used AI to generate the tests, theoretically you cannot prove to me through tests that it works."

This is Karl Popper's falsificationism applied to AI-assisted code change, and it closes the "more tests" objection at the philosophy level. Tests can falsify behavior. They cannot prove correctness. No quantity of additional QA changes that. When the same AI generates both the code and the tests, the tests inherit the AI's blind spots - the very class of error you most need the tests to catch is the class the tests are most likely to miss.

The right answer to "how do I know the AI's output is correct" is not "we ran tests." It is "every claim the AI made is back-anchored to the code it came from, and you can audit the trail."

Tests are downstream. Citations are upstream. Banks that conflate them buy the wrong audit.

Where citations fail - and why that's also a citation discipline

The citation framing has one critical failure mode, and naming it explicitly is part of the discipline.

A senior validator at a major bank tried to push the AI hard on a question - repeatedly insisting on a specific answer the AI couldn't ground. She wrote it up afterward: "I really tried to force it to answer the answer I wanted, and it didn't give it to me. It started to bullshit with confidence."

This is the failure mode every team deploying AI in banking has to design against: the agent capitulates under pressure. The model attends to social signals - "the user clearly wants X, I will find a way to provide X" - and generates plausible text without genuine grounding. The citation panel doesn't save the user here. The agent is just generating prose.

There are three responses to this, and only the third one works.

The wrong response is to claim it doesn't happen. It happens. The vendor who tells you their system doesn't is the vendor whose system you should trust the least.

The half-right response is to add a confidence score. Confidence scores are themselves model outputs, and inherit the same drift under coercion. A model that bullshits confidently will produce confident confidence scores.

The right response is to make the citation a hard structural requirement. The agent's claims have to be grounded in actual graph traversal, actual code reads, actual rule extractions - not in plausible text generation. If the agent cannot ground a claim, the right answer is "I don't know" - not "let me synthesize something that sounds plausible." This is a deeper architectural commitment than it sounds. It requires the agent to be operating against a deterministic substrate it can query, not a sea of embeddings it can pattern-match.

When citation discipline is enforced architecturally, the failure mode doesn't disappear - but it becomes detectable. The senior validator pushes, the agent declines to invent, and the conversation ends honestly rather than with hallucinated detail. That is the threshold a regulated bank can sign off on. The previous threshold - confident answers with no way to verify - is the threshold that gets banks into regulatory trouble.

What this means for the regulator-facing artifact

The most consequential consumer of citations in a regulated bank is not the engineer. It is the regulator.

When DORA-aligned regulators ask a bank to demonstrate operational resilience on a legacy estate - and they are asking - the bank's defensible answer is a model of the system with full provenance. Every claim about what the system does, what depends on what, which business rules govern which paths, anchored back to source. Not a McKinsey deck. Not an architect's whiteboard diagram. The actual, queryable, auditable model - and the citations that prove every line of it.

The regulator's underlying ask, across every regulated jurisdiction, is the same: show your work. Citations are how a bank shows its work. Without them, the bank has assertions. With them, the bank has evidence.

,[object Object], When AI output without citations is used to make change decisions in a regulated environment, and the change later goes wrong, the bank's audit defense is: "the AI told us." This is not a defense. It is the precise scenario every regulator is now writing rules to prevent. The bank that depends on uncited AI output for material change decisions is building a known-bad audit posture into its operating model.

The competitive consequence

There is a tactical reason banks should ask about citations in every AI procurement, and a strategic one.

The tactical reason is the one in this post: without citations, AI output is unverifiable, untrustworthy, and unfit for deep use. Banks that buy uncited tools get shallow returns.

The strategic reason is sharper. Citation discipline is hard to retrofit. A platform that wasn't designed for citation provenance from the architecture up - embeddings-only retrieval, generative-only reasoning, no deterministic substrate - cannot bolt citations on later. It can produce citation-looking output - line-number references that are themselves model-generated and as fallible as the rest of the model. But that is not provenance. That is the appearance of provenance, which is worse.

When a bank evaluates AI tools for legacy systems, the citation question is not a feature checkbox. It is a structural test of whether the tool can survive a regulated deployment. Tools that pass it stay on the table. Tools that don't pass it should come off - not because they are not useful for something else, but because they are not useful for the use case the bank is buying for.

The right phrasing of the procurement question is not "do you have provenance?" It is "show me a claim your system made on our codebase, then show me how I get from the claim to the line of code that supports it, in one click - and prove to me that the line of code is the actual source, not a model-generated approximation of one."

Most AI tools cannot pass that test. The ones that can are the ones built around the principle that this post opens with.

Citations or it didn't happen.


Frequently Asked Questions

What's the difference between citations and the line-number references an LLM produces in its answer?

Everything. An LLM producing a line number in its answer is generating text that happens to look like a citation. The line number itself is a model output and is as fallible as the rest of the model - the LLM can hallucinate line numbers as confidently as it hallucinates anything else. A real citation comes from a deterministic graph traversal: the system knows the program is called by these three other programs because it parsed the CALL statements, not because an LLM read the code and inferred it. The citation panel in a real provenance system is built from the deterministic substrate, not from the model's prose.

How do we test for real citation discipline during a vendor evaluation?

Pick an obscure business rule in your codebase that your team has manually verified. Ask the tool a question that should surface that rule. Then click through to the citation. If the line of code shown is the actual line that produced the rule, the tool is operating with real provenance. If the line shown is approximately right but doesn't actually contain the rule - or if the tool can't show a line at all - the tool is producing the appearance of provenance, not the substance.

What about cases where the AI legitimately doesn't know?

That's the harder test, and the more important one. Push the AI on a question it can't ground. The right behavior is I don't know or I have partial information, here is what I have. The wrong behavior is confident invention. Real citation discipline produces the first. The absence of citation discipline produces the second - and the second is the failure mode that gets banks into regulatory trouble.

Doesn't every legacy-AI tool offer some form of citation now?

Most offer something they call provenance. The question is whether it is deterministic provenance from a verified graph traversal, or model-generated provenance that looks like a citation but inherits the model's fallibility. The vendor evaluation question is architectural: where does your provenance come from? If the answer is "the LLM produces it," the provenance is as good as the LLM. If the answer is "a deterministic graph that the LLM queries but cannot override," the provenance is structural.

How does this fit with our existing AI governance framework?

Citations are the operational substrate that makes every other governance requirement satisfiable. Auditability requires evidence. Sign-off requires verification. Incident response requires traceability. All three reduce to: can you produce, on demand, the source-anchored chain of reasoning behind a specific AI claim? If yes, the governance framework has a foundation to operate on. If no, the framework is documentation theater. Most banks are closer to the second state than they'd like to be - and the gap is closing only with tools designed for citation discipline from the architecture up.


Tweezr is built around citation discipline as the architectural foundation, not the marketing surface. Every claim our system makes is back-anchored to the exact lines of code that produced it - and the provenance is built into the deterministic substrate, not bolted on. If you're evaluating AI tools for a legacy banking environment, see how the citation panel works or book a conversation about running a citation-stress test on your own code.

Related Posts