How Proof Engine Works
Design Principles
Proof Engine is an AI agent skill — a set of instructions and bundled Python scripts that plug into LLM coding tools (Claude Desktop, Claude Cowork, Claude Code, Codex CLI, Cursor, Windsurf, Manus, ChatGPT, and others via the Agent Skills standard). When a user asks the LLM to verify a factual claim, the skill directs it to produce four artifacts: a re-runnable proof.py script, a structured proof.md proof report, a proof_audit.md with full verification details, and a proof_narrative.md reader-facing narrative summary.
This document describes the design ideas behind it — what problems it solves, what makes the approach unusual, and where it falls short.
The Core Idea
LLMs hallucinate facts and make reasoning errors. Instead of making the LLM more accurate, we make it prove its work in a form that doesn't require trusting the LLM at all.
Every proof is a Python script that imports the engine's bundled verification modules. Anyone can re-run it — if the math doesn't hold, the script errors out; if citations can't be fetched or quotes don't match, the proof typically degrades to an explicit "with unverified citations" verdict rather than silently passing. (For table-sourced data, if the prose quote fails but verify_data_values() confirms the numbers on the page and cross-checks hold, the proof can still reach full PROVED — the quote failure is a page-structure issue, not an accuracy issue.) The LLM's role is authoring the proof, not asserting the conclusion.
Three Types Of Facts, Three Verification Strategies
The system recognizes exactly three kinds of facts:
- Type A (computed): The computation is the verification.
sympy.isprime(n)doesn't need a citation. The code is re-runnable and deterministic. - Type B (empirical): Every empirical fact needs a source, a URL, and an exact quote. The proof script fetches the URL at runtime and confirms the quote appears on the page. For table-sourced data (where the interesting values are numbers in a table, not prose),
verify_data_values()confirms each numeric value string appears on the source page — a different check than quote matching, but the same principle: the proof doesn't trust the LLM's transcription. - Type S (search): For absence-of-evidence proofs, each database search is documented with a clickable
search_url. The tool confirms the URL is accessible but cannot verify the result count — that's author-reported and reproducible by a human reviewer. This weaker trust boundary is reflected in theSUPPORTEDverdict (neverPROVED).
These are fact types, not claim types. A claim can be purely mathematical, purely empirical, or mixed — combining computation with cited evidence. "Has the US dollar lost more than 90% of its purchasing power since 1913?" is mixed: the CPI values are Type B (cited from BLS data), but the percentage-decline calculation is Type A (computed). The constraint is at the fact level: if an individual fact can't be computed or cited, it doesn't go in the proof.
Separation Of Concerns
The proof has four output files because it serves four audiences. proof.py is for anyone who wants to re-run the verification. proof.md is the structured proof report with verdict and key numbers. proof_audit.md is for a reviewer who wants the citation-by-citation evidence trail and hardening-rule checklist. proof_narrative.md is a plain-language narrative summary for general readers. Combining them into one artifact would force every reader through material meant for someone else.
The site build pipeline adds three machine-readable formats on top of these four: a Jupyter Notebook for interactive re-verification, a W3C PROV-JSON provenance trace for automated pipelines, and an RO-Crate 1.1 research object package for archival. These are generated, not authored — the source of truth remains the four committed files. The separation means the core proof artifacts stay portable (they work without the site), while the machine-readable formats add interoperability for systems that consume structured metadata.
A similar separation applies to the skill instructions: a short main file with gotchas and a reference index, and detailed rules/templates/checklists in separate files loaded on-demand at specific workflow steps.
Structured Verdicts, Not Confidence Scores
The output is one of eight verdicts: PROVED, DISPROVED, SUPPORTED, PARTIALLY VERIFIED, UNDETERMINED, and three "with unverified citations" variants. Not a probability. A "73% confidence" hides why 73% — the verdict system forces transparency by making each fact's status visible. The "with unverified citations" variants distinguish "the evidence contradicts the claim" from "the evidence couldn't be reached."
The 7 Hardening Rules
These aren't coding guidelines. Each one closes a specific, observed failure mode where proof code looks correct but is silently wrong.
Rule 1: Never hand-type extracted values. LLMs read a quote saying "May 14, 1948" and write date(1948, 5, 15). Nothing connects the quote string to the date constructor, so the error is invisible. The fix: parse values from the quote text programmatically. If the parse fails, the proof fails — which is the correct behavior.
Rule 2: Verify citations by fetching. LLMs fabricate plausible-sounding citations. They'll generate a .gov URL, a credible institution name, and a quote that sounds right. The only defense is fetching the URL and confirming the quote appears on the page. This is harder than it sounds — real web pages use en-dashes where you expect hyphens, curly quotes where you expect straight ones, and HTML tags inside the text. The verification code handles this.
Rule 3: Anchor to system time. If a proof needs today's date (e.g., "Israel is over 70 years old"), use date.today(). LLMs sometimes get the current date wrong, and a hard-coded date makes the proof non-reproducible after that date passes.
Rule 4: Explicit claim interpretation. "More than 90%" — is that strictly greater, or greater-than-or-equal? "Since 1913" — the beginning of 1913, or the end? These ambiguities are common in natural-language claims. The proof must state its interpretation in a CLAIM_FORMAL dict before computing anything, so reviewers can disagree with the interpretation even if the math is correct.
Rule 5: Independent adversarial check. Confirmation bias is structural, not psychological. If you only search for supporting evidence, you'll find it. The proof must document what counter-evidence was searched for and what was found. This happens during research, not during proof execution — the adversarial section is documentation of work done, not a runtime search.
Rule 6: Cross-checks must be truly independent. Two sources parsed from the same variable aren't independent. Two computations that share intermediate values aren't independent. The rule requires that cross-checks come from separate sources parsed separately, so that a single error can't contaminate both sides of a comparison. The validator can only heuristically check this (it counts distinct keys in empirical_facts), so the rule is primarily enforced by proof structure and review, not automation.
Rule 7: Never hard-code constants. An LLM might write 365.25 for the length of a Gregorian year. The correct value is 365.2425. The difference is small but matters for long time spans. Constants come from the bundled computations.py with sourced values.
A static analyzer (validate_proof.py) runs before execution to catch common structural problems — missing CLAIM_FORMAL, hardcoded verdicts, probable hand-typed values. It's heuristic, not exhaustive: a fast first pass, not a proof of rule compliance.
Prose Reference Verification
Every external scholarly reference in a proof — whether in meta.yaml depends_on, in v3 evidence[*].source.url, or in free prose inside any .md file — is resolved to canonical metadata from an authoritative registry (arXiv, DataCite, Crossref, Software Heritage, Handle.Net, OpenLibrary) and cached per-proof in depends_on_resolved.json.
Prose attributions are cross-checked against the resolved metadata by four passes:
- Pass 1 finds identifiers in literal prose and inside Markdown link targets.
- Pass 2 cross-checks every prose attribution within a 160-char window of each identifier (and every link-display-text for short-form link citations) against the resolved author list and title.
- Pass 3 advises (warns) when a declared identifier is never mentioned in prose.
- Pass 4 sweeps the whole file for dangling
Author, "Title"andAuthor (YYYY)shapes that are not covered by any verification window, closing the launder-attack class where a correct linked citation elsewhere in the file would otherwise launder a hand-typed misattribution.
Authors use {{cite:<type>:<value>[:<style>]}} tokens in draft prose and run proof-site.py cite-expand to materialize them into canonical Markdown citations — committed to git, served by the HTML build, and archived verbatim to Zenodo on mint-doi. Both publish and mint-doi run a strictly-offline pre-flight gate (cite-expand --check + verify-prose) so a wrong attribution or unexpanded token cannot reach the archived artifact.
What A Proof Looks Like In Practice
The docs/examples/ directory has complete proofs. Here's an annotated walkthrough of one — the claim "The purchasing power of the US dollar has declined by more than 90% since the Federal Reserve was established in 1913" — showing what happens at each stage and why.
Step 1: Formalize the claim. The proof states exactly what it's testing in a CLAIM_FORMAL dict: "more than 90%" means strictly greater than 90.0% (operator >), purchasing power is operationalized via CPI-U, and "established in 1913" refers to the Federal Reserve Act signed December 23, 1913 with the CPI baseline using the 1913 annual average. The interpretation is documented with an operator_note explaining each choice. A reviewer can disagree with the interpretation even if the math is correct — that's the point.
Step 2: Fetch and verify sources. The proof cites two independent CPI sources (BLS data via rateinflation.com and inflationdata.com), each with a URL, a prose quote, and a data_values dict containing the actual CPI numbers. At runtime, the script fetches each URL and searches for the prose quote on the live page after normalizing Unicode and stripping HTML tags. A full match yields verified; a partial match (≥80% fragment) still passes but is flagged; no match yields not_found. If the LLM fabricated a quote, the check catches it — and if the match is only partial, the verdict downgrades to "with unverified citations" so the gap is visible. The two sources are independent (different websites republishing the same upstream BLS data) so a transcription error on one site doesn't contaminate the other.
Step 3: Verify data values, then extract. The CPI numbers live in data_values dicts alongside each source — string literals like "9.883" and "313.689" that the LLM transcribed from the source page. This is a checked transcription: verify_data_values() fetches the source page and confirms each value string actually appears in the page text. If the LLM transcribed a number wrong, it won't be found on the page and the check fails. Once verified, the numbers are parsed from those data_values strings into floats using parse_number_from_quote(). The trust model is: the LLM writes the values, but the verifier confirms they appear on the source page before the proof computes with them.
Step 4: Compute the answer. Decline = (1 − CPI_1913 / CPI_2024) × 100. Using Source A: (1 − 9.883 / 313.689) × 100 = 96.85%. This is Python arithmetic via explain_calc(), which uses AST introspection to print the symbolic expression, substituted values, and result in one line — so the computation describes itself. The engine uses compare(decline, ">", 90.0) from the bundled computations.py instead of letting the LLM write the comparison or use eval().
Step 5: Cross-check. The same computation runs independently on Source B (CPI 1913 = 9.9, CPI 2024 = 313.689). The two decline percentages must agree within tolerance. They do — the difference is ~0.005%, attributable to Source B rounding the 1913 CPI to one decimal place. This cross-check uses independently parsed values from separate sources, so a single error can't pass both sides.
Step 6: Adversarial check. Before concluding, the proof documents searches for counter-evidence: Does hedonic quality adjustment mean CPI overstates inflation enough to drop below 90%? Was the Fed established in 1913 (Act signed) or 1914 (Reserve Banks opened) — does the date choice matter? Could a different price index (PCE, GDP deflator) yield a decline below 90%? Each search is documented with what was found and whether it breaks the proof. None do — the margin is too large.
Step 7: Verdict. PROVED — the decline is 96.85%, which is 6.85 percentage points above the 90% threshold. Both sources agree. The full audit trail shows every citation fetch, every data value verification, every extraction, every cross-check.
The entire chain is in proof.py. Run it yourself: python proof.py. Every number traces back to a live source through code you can inspect and re-execute.
Source Independence And Conflicts Of Interest
Rule 6 requires independent cross-checks, but "independent" isn't binary. A news article and an advocacy report might both cite the same government intelligence dossier — technically different outlets, but not independent primary sources. And a source with a direct stake in the claim's outcome may confirm it for reasons other than truth.
The system addresses this with two mechanisms:
COI taxonomy. Each source in a proof's cross-checks can carry conflict-of-interest flags from six categories: financial, institutional, ideological, geographic, personal, and litigation. Each flag records the source key, COI category, direction (favorable or unfavorable to the claim's subject), and severity (direct or indirect). These are documented in the audit trail so reviewers can see which sources have potential biases.
Mechanical verdict override. If more than half of a sub-claim's confirmed sources have COI flags pointing in the same direction, the verdict is forced to UNDETERMINED regardless of the count. This prevents a proof from reaching PROVED (or DISPROVED) when the confirming evidence is dominated by interested parties. The threshold is deliberately aggressive — a single clean source among several biased ones is enough to prevent the override.
One exception: provenance sub-claims (SC1 in the contested qualifier pattern) bypass the COI gate entirely. A biased source can still reliably confirm that an allegation was made — COI doesn't undermine provenance, only epistemic claims about truth or verification.
Citation Verification Is Messier Than You'D Think
Verifying that a quote appears on a web page sounds simple. In practice, the verification code handles Unicode mismatches (en-dashes, curly quotes, non-standard degree symbols), inline HTML tags, pages that return 403 or render via JavaScript, and quotes that partially match due to page updates.
flowchart TD
A[Fetch URL live] -->|success| B{Match quote on page}
A -->|fail| C{Snapshot provided?}
B -->|full match| V1[✓ verified · full_quote]
B -->|≥80% fragment| V2[✓ verified · fragment]
B -->|<80% fragment| V3[~ partial]
B -->|no match| V4[✗ not_found]
C -->|yes| D{Match quote against snapshot}
C -->|no| E{Wayback opted in?}
D -->|match| V5[✓ verified · snapshot]
D -->|no match| V4
E -->|yes| F{Match quote against archive}
E -->|no| V6[? fetch_failed]
F -->|match| V7[✓ verified · wayback]
F -->|no match| V4
style V1 fill:#d4edda,stroke:#28a745
style V2 fill:#d4edda,stroke:#28a745
style V5 fill:#d4edda,stroke:#28a745
style V7 fill:#d4edda,stroke:#28a745
style V3 fill:#fff3cd,stroke:#ffc107
style V4 fill:#f8d7da,stroke:#dc3545
style V6 fill:#f8d7da,stroke:#dc3545
For table-sourced data, verify_data_values() runs separately — it confirms each numeric value string appears on the page. If quote verification fails but data values are confirmed and cross-checked, the verdict can still be full PROVED.
Each verification mode (live, snapshot, wayback) is tracked in the audit trail.
An important caveat: citation verification confirms quote presence, not semantic entailment. A quote can appear on a page and still not mean what the proof claims it means — context might qualify it, or the quote might be cherry-picked. Rule 5 (adversarial checks) partially mitigates this, but the system cannot mechanically verify that a quote supports the conclusion drawn from it. That judgment stays with the human reviewer.
Asymmetry Between Proof And Disproof
Disproof is almost always easier. To prove "X is true" requires covering all relevant evidence and showing none contradicts it. To disprove "X is true" requires a single verified counterexample.
The system leans into this. For crisp factual claims, a single credible source that contradicts the claim, with a verified quote, is sufficient for DISPROVED. For consensus-style claims ("scientists agree that..."), the system requires multiple independent sources — the default threshold is 3 — because a single source isn't consensus. The threshold is documented in CLAIM_FORMAL so reviewers can see and dispute it.
This also means compound claims (X AND Y) often end up PARTIALLY VERIFIED — one sub-claim holds, another doesn't. The system decomposes compound claims and evaluates each part independently.
Contested qualifiers
A special case arises when a claim bundles a factual assertion with an epistemic qualifier: "X was verified," "Y was confirmed," "Z was proven." These need two different kinds of evidence — did someone make the assertion (provenance), and has anyone independently confirmed it (epistemic warrant)?
The compound template handles this via SC1/SC2 decomposition. SC1 checks provenance: did an identifiable source make the underlying claim? SC2 checks the qualifier: has any independent body confirmed it? Both must hold for PROVED. If SC1 holds but SC2 fails, the verdict is DISPROVED — the assertion exists, but the qualifier is false.
This matters because a naive decomposition would produce PARTIALLY VERIFIED (one sub-claim holds, one doesn't). But for contested qualifiers, "the assertion was made but not verified" is a clean disproof of the qualified claim, not a partial verification. The system auto-detects contested qualifier claims from the operator_note and routes them to the DISPROVED branch.
An expected consequence: SC2 often has zero empirical facts. When no independent body has confirmed a qualifier, there are simply no confirming sources to cite. Sources that reject the qualifier (an independent review finding "claims not substantiated") go in adversarial checks as counter-evidence, not in SC2's fact list — they support the disproof, but they aren't confirming sources for the qualifier.
Reproducibility
Every proof is designed to be re-runnable. Computation proofs are pure Python with no external dependencies beyond standard libraries and sympy. Empirical proofs default to live fetching (with optional Wayback Machine fallback), and can also embed snapshots — pre-fetched page text included in the proof script — for full offline reproducibility. The checked-in examples use live fetch, not snapshots.
The four output files (proof.py, proof.md, proof_audit.md, proof_narrative.md) form a complete record of the proof. When snapshots are embedded, the proof can verify against them without network access. When they aren't, the proof depends on the source URLs remaining available (or archived). The fallback chain — live → snapshot → Wayback — is tracked in the audit trail so you can see how each citation was resolved.
For machine consumption, each published proof also includes three additional formats generated at build time: a Jupyter Notebook (proof.ipynb) for interactive re-verification in any notebook environment, a W3C PROV-JSON document (provenance.json) encoding the full provenance chain in the W3C Provenance standard, and an RO-Crate 1.1 metadata file (ro-crate-metadata.json) packaging all proof artifacts as a self-describing research object for archival and interoperability. These are derived from the four core files — they don't add new information, but they make the same information accessible to notebook environments, provenance-aware pipelines, and research data management systems.
Interactive re-verification
Every published proof page links to a one-click Binder launcher. The launcher repo yaniv-golan/proof-engine-binder pins the Python runtime, installs the dependencies proof.py scripts use, and clones the main proof-engine repo at the matching minor-release tag. A Jupyter Server extension in the launcher image intercepts every incoming request and captures one of two URL shapes that the proof page emits, writing the parsed identifier to a sentinel file under /tmp/. The launcher notebook reads the sentinel on cell execution, fetches proof.py, and executes it with PROOF_ENGINE_ROOT set to the cloned skill path.
The two URL shapes correspond to two trust anchors:
- Minted proofs (
?doi=<Zenodo DOI>) — the DOI resolves via Zenodo's REST API to immutable bytes deposited at mint time. A proof verified this way resolves to the sameproof.pya year from now. - Unminted proofs (
?slug=<slug>&ref=<40-hex-sha>) — the launcher fetcheshttps://raw.githubusercontent.com/yaniv-golan/proof-engine/<sha>/site/proofs/<slug>/proof.py. The trust anchor is the commit SHA in the URL: the executed bytes are the same bytes the "View proof source" section on the page rendered at that commit. The site build embeds the deploying commit SHA into every unminted proof's Binder URL, so the page text and the executable code can never diverge.
The capture mechanism is necessary because Binder's redirect chain preserves the query string on the first request that reaches the user's Jupyter server, but JupyterLab's SPA router strips it from window.location.search before any client-side JavaScript runs.
Five properties of this design worth noting:
1. Immutable launcher reference. binder_url values in doi.json and rendered HTML both point at a specific launcher git tag (e.g. v1.22.0) — never a moving branch. The launcher tag is derived from the main repo's VERSION (vMAJOR.MINOR.0), so bump-version.sh propagates it automatically.
2. Two trust anchors, one mechanism. DOI mode is anchored on Zenodo; slug mode is anchored on a git commit SHA. Both are immutable. The launcher cell handling these is a single branch on MODE; nothing else differs between the paths.
3. Zero Zenodo mutations per launcher change. The launcher URL lives in this repo (in doi.json files for minted proofs, computed at build time for unminted), not in Zenodo metadata. Rotating a compromised tag or patching an infrastructure break is a repo commit, not a Zenodo republication.
4. One Binder cold build per launcher release. Each immutable launcher tag has its own Binder image cache; image reuse is keyed by git ref.
5. Forward-portable proof.py. Generated proofs read PROOF_ENGINE_ROOT from an env var with a hardcoded fallback. Local runs use the fallback; the launcher sets the env var. One file shape serves both environments.
What It Can'T Do
The system works well for claims that decompose into a finite set of extractable facts and a clear rule for what counts as proof or disproof. It struggles with:
- Causal inference: "The Roman Empire fell because of lead poisoning" involves competing interpretations of messy evidence. The engine can verify individual facts but can't weigh competing causal theories.
- Broad literature synthesis: "Coffee reduces the risk of type 2 diabetes" requires synthesizing dozens of studies with conflicting findings, varying methodologies, and different effect sizes. This is closer to a systematic review than a proof.
- Competing definitions: "Is a hot dog a sandwich?" depends on your definition of sandwich. The engine can verify facts about hot dogs and sandwiches, but the conclusion depends on a definitional choice, not evidence.
- Future predictions: "AI will surpass human intelligence by 2030" has no verifiable evidence. The engine will decline or return UNDETERMINED.
- Deep original mathematics: Beyond what sympy can verify. The engine isn't a theorem prover — it can check that a number is prime or that an equation holds, but it can't prove novel conjectures.
The engine is explicit about these limits. It will decline claims that are opinions, suggest reformulations for ambiguous claims, and return UNDETERMINED rather than guess.
Hardening Rules Reference
These nine rules close specific failure modes where LLM-generated proof code looks correct but is silently wrong. Each rule creates a verifiable link between the proof's internal representation and an external ground truth — so that when the LLM hallucinates, the error breaks visibly rather than hiding.
For proof templates, see proof-templates.md.
Rule Applicability By Proof Type
The validator checks all 9 rules for every proof. Some rules auto-pass when the proof doesn't contain the patterns they target. This table shows typical validator behavior — the "Auto-pass when" column describes the heuristic.
| Rule | Date/Age | Numeric/Table | Qualitative | Absence | Pure Math | Auto-pass when |
|---|---|---|---|---|---|---|
| 1 | Checked | Checked | Auto-pass | Auto-pass | Auto-pass | No value-extraction patterns found |
| 2 | Checked | Checked | Checked | Checked* | Auto-pass | No empirical_facts, search_registry, or URLs |
| 3 | Checked | Checked | Auto-pass | Auto-pass | Auto-pass | No time-dependent keywords (today, age, etc.) |
| 4 | Checked | Checked | Checked | Checked | Checked | Never auto-passes |
| 5 | Checked | Checked | Checked | Checked | Checked | Never auto-passes |
| 6 | Checked | Checked | Checked | Checked* | Auto-pass | No empirical_facts or search_registry keys |
| 7 | Checked | Checked | Auto-pass | Auto-pass | Checked | No 365.2*/eval()/inline-age patterns |
| 8 | Auto-pass | Auto-pass | Checked (disproof) | Auto-pass | Auto-pass | No proof_direction: "disprove" found, or no empirical_facts quotes found |
*For absence proofs, Rule 2 checks verify_search_registry import (plus
verify_all_citations if corroborating empirical_facts are present);
Rule 6 counts unique databases in search_registry by URL domain.
Note on Pure Math Rule 6: The validator auto-passes when no empirical sources are present. It does not inspect whether cross-checks use mathematically independent methods — that's a proof-writing discipline enforced by the template and self-critique checklist, not by static analysis.
Rule 1: Never Hand-Type Extracted Values
Failure mode: An LLM reads a quote like "On May 14, 1948, David Ben-Gurion proclaimed..." and then, in a separate line of code, types date(1948, 5, 15). The quote says the 14th; the code says the 15th. Nothing connects them — the quote sits in a string, the date sits in a constructor, and the proof runs without complaint. This happens because LLMs frequently make small transcription errors with numbers, dates, and quantities.
Bad — value disconnected from quote:
fact = {
"quote": "population reached 13,988,129 in 2023",
"value": 13988129 # LLM typed this separately — could be wrong
}
Good — value derived from quote text:
from scripts.extract_values import parse_number_from_quote
fact = {"quote": "population reached 13,988,129 in 2023"}
fact["value"] = int(parse_number_from_quote(fact["quote"], r'reached ([\d,]+)', "fact_1"))
Bad — date disconnected from quote:
empirical_facts = {
"source_a": {
"quote": "On May 14, 1948, David Ben-Gurion proclaimed...",
"extracted_date": date(1948, 5, 14), # hand-typed — could easily be wrong
}
}
Good — date parsed from quote:
from scripts.extract_values import parse_date_from_quote
founding_date = parse_date_from_quote(empirical_facts["source_a"]["quote"], "source_a")
When simple parsing fails — two-phase extraction:
Real-world quotes contain Unicode quirks (en-dashes, curly quotes, degree symbols) that break simple regex. When parse_number_from_quote() fails, write a custom extraction function using smart_extract.py utilities:
from scripts.smart_extract import normalize_unicode, verify_extraction
def extract_ghg_warming_low(quote):
'''Extract GHG low-end warming from NOAA page.
The page uses en-dashes (–) and ° symbols — normalize before extracting.'''
normalized = normalize_unicode(quote)
match = re.search(r'warming of ([\d.]+)', normalized)
value = float(match.group(1))
verify_extraction(value, quote, "ghg_warming_low")
return value
ghg_low = extract_ghg_warming_low(empirical_facts["source_b"]["quote"])
This function is IN the proof script — auditable, re-runnable, no LLM needed at re-run time. Use diagnose_mismatch() to identify the specific character differences before writing the extractor.
How validate_proof.py catches it: Looks for date() literals and "value": N patterns near fact definitions. Flags them as potential hand-typed values.
Rule 2: Verify Citations By Fetching
Failure mode: LLMs hallucinate citations. They generate plausible-sounding quotes, attribute them to real institutions, and provide URLs that look right. The human sees "Source: U.S. National Archives" with a .gov URL and trusts it. But the quote might be fabricated, or the URL might not contain that text.
Bad — citation trusted without verification:
empirical_facts = {
"source_a": {
"quote": "On May 14, 1948, David Ben-Gurion proclaimed...",
"url": "https://history.state.gov/milestones/1945-1952/creation-israel",
"source_name": "U.S. Department of State",
}
}
# Just use the quote as-is, never check if it's real
Good — citation verified by fetching:
from scripts.verify_citations import verify_all_citations
citation_results = verify_all_citations(empirical_facts)
unverified = [k for k, v in citation_results.items() if v["status"] != "verified"]
Critical normalization details: Real web pages contain several categories of mismatch that break naive string matching:
-
HTML tags: Government websites use inline markup (e.g.,
<span class="tei-persname">Ben-Gurion</span>). The script strips tags without injecting spaces, removes spaces before punctuation, collapses whitespace, and lowercases — in that order. -
Unicode mismatches: NOAA, NASA, and IPCC pages use en-dashes (– U+2013) where the LLM transcribes hyphens (-), curly quotes (' U+2019) where the LLM uses straight quotes ('), ring-above (˚ U+02DA) where degree signs (° U+00B0) are expected, and non-breaking spaces (U+00A0) where normal spaces are expected. The script applies
normalize_unicode()fromsmart_extract.pybefore all other normalization. -
Invisible Unicode characters: Pages embed zero-width spaces (U+200B), zero-width non-joiners/joiners (U+200C/U+200D), word joiners (U+2060), BiDi marks (U+200E/U+200F), soft hyphens (U+00AD), and variation selectors (U+FE00–U+FE0F). These are invisible but break string matching. The script strips all of them during normalization.
-
Superscripts and subscripts:
<sup>and<sub>tags are handled context-dependently. In running prose (footnote markers like "study¹"), they are stripped. In mathematical/scientific contexts (exponents like "10²" or "m²"), they are preserved as numeric characters. -
MathML markup: Scientific pages embed
<math>tags with LaTeX in thealttextattribute. The script extractsalttextand converts LaTeX notation (fractions, Greek letters, operators) to readable text vialatex_text.py.
The script uses two-pass matching: first exact match on the fully cleaned text, then substring search as fallback.
When a quote still fails after automatic normalization, use diagnose_mismatch() to identify the specific character differences, then write a custom extraction function using the two-phase extraction pattern (see SKILL.md).
Verbatim quoting requirement: The quote field must be a character-for-character substring of the source page text. A subtler variant of the hallucination problem: the LLM recalls the gist of what a source says and writes a plausible-sounding quote that captures the meaning but uses completely different wording. The URL is real, the source is real, but the quote is fabricated. This passes human review but fails machine verification because the exact words don't appear on the page.
Common violations: - Paraphrasing: "Sixteen participants" when the page says "16 study participants" - Splicing: Stitching the start of one sentence to the end of another, skipping the middle - Synthesizing: Combining facts from different paragraphs into one "quote" - Cleaning up: Removing parenthetical insertions like "(Gold)" that appear in the original
Bad — quote written from memory (paraphrased):
empirical_facts = {
"source_a": {
# LLM recalled the gist but used different words
"quote": "Addition of a single irrelevant clause provokes catastrophic accuracy drops",
"url": "https://arxiv.org/html/2410.05229v1",
"source_name": "Mirzadeh et al., GSM-Symbolic (ICLR 2025)",
}
}
# verify_all_citations() returns not_found — the page uses different wording
Good — quote copied from fetched page:
import requests
resp = requests.get("https://arxiv.org/html/2410.05229v1")
# Search resp.text for the relevant passage, then copy verbatim:
empirical_facts = {
"source_a": {
"quote": (
"We add seemingly relevant but ultimately inconsequential statements "
"to GSM-Symbolic templates. Since these statements carry no operational "
"significance, we refer to them as No-Op"
),
"url": "https://arxiv.org/html/2410.05229v1",
"source_name": "Mirzadeh et al., GSM-Symbolic (ICLR 2025)",
}
}
If the exact text is too long or unwieldy, use a shorter exact substring that still captures the key finding. A 20-word verbatim quote that verifies is better than a 50-word paraphrase that doesn't.
Pre-verification workflow: After writing empirical_facts, run verify_all_citations() immediately. If any citation returns partial or not_found, use closest_passage from the result to locate the actual wording on the page, then copy the visible rendered text. Fix the quote before proceeding.
Verdict impact: If any citation can't be verified, the proof verdict downgrades to "PROVED (with unverified citations)".
How validate_proof.py catches it: Looks for imports of verify_citations / verify_all_citations, or presence of requests.get.
Rule 3: Anchor To System Time
Failure mode: LLMs are often wrong about the current date. The system prompt might say "today is March 25, 2026" but the LLM might internalize it as 2025 or get the month wrong. If the proof hardcodes a wrong date, the entire time-dependent computation is silently incorrect.
Bad — hardcoded date only:
today = date(2026, 3, 25) # What if the LLM got this wrong?
age = today.year - founding_date.year
Good — anchored to system clock:
from datetime import date
PROOF_GENERATION_DATE = date(2026, 3, 25) # For reproducibility
actual = date.today()
if actual == PROOF_GENERATION_DATE:
today = PROOF_GENERATION_DATE
date_note = "System date matches proof generation date"
else:
today = actual
date_note = f"Proof generated for {PROOF_GENERATION_DATE}, running on {actual}"
If run on the generation date, this confirms the LLM was right. If run later, the system clock takes over. If the LLM got the date wrong, the mismatch is visible.
How validate_proof.py catches it: Checks CLAIM_FORMAL["is_time_sensitive"] declaration against actual date.today() usage. Declaring is_time_sensitive: True without date.today() is an issue; using date.today() without the declaration is a warning. When neither condition applies, a hard-coded date(YYYY, ...) without date.today() is flagged as an issue (e.g. the common PROOF_GENERATION_DATE = date(YYYY, M, D) pattern used alongside date.today() is fine).
Rule 4: Explicit Claim Interpretation
Failure mode: Natural language claims are ambiguous. "Over 70 years old" could mean > 70 or >= 70. "Founded" could mean proclaimed, recognized, or admitted to the UN. An LLM silently picks whichever interpretation makes the proof succeed. If the claim is borderline, the wrong choice flips the verdict.
Bad — implicit interpretation:
# "over 70" — is that > 70 or >= 70?
age = compute_age(founding_date, today)
if age >= 70: # Why >= and not >? No documentation.
verdict = "PROVED"
Good — explicit interpretation with rationale:
CLAIM_NATURAL = "The State of Israel is over 70 years old"
CLAIM_FORMAL = {
"subject": "State of Israel",
"property": "age in completed calendar years since founding",
"operator": ">",
"operator_note": (
"'over 70' is interpreted as strictly greater than 70. "
"If Israel were exactly 70 years and 0 days old, this claim would be FALSE. "
"This is the more conservative interpretation — using >= would make it easier to prove."
),
"threshold": 70,
"founding_event": (
"Proclamation of the Declaration of Independence on 14 May 1948. "
"Alternative founding events (US recognition same day, UN admission 11 May 1949) "
"are noted but the proclamation date is the standard reference."
),
}
The operator_note is critical — it forces the LLM to articulate WHY it chose one operator over another, making the decision auditable.
Epistemic qualifiers: If the claim contains words like "verified," "confirmed," "proven," "established," "debunked," or "disproven," these assert a specific evidentiary status — not just the underlying fact. Decompose using the Contested Qualifier pattern in the compound template (template-compound.md). The operator_note must identify the qualifier and explain why it creates a distinct sub-claim (SC1 for provenance, SC2 for the qualifier's warrant).
How validate_proof.py catches it: Looks for CLAIM_FORMAL dict and checks for operator_note.
Rule 5: Structurally Independent Adversarial Check
Failure mode: LLMs exhibit confirmation bias. When asked to prove a claim, they seek supporting evidence. The "adversarial check" in many LLM-generated proofs is theater — it recomputes the same values a different way, which catches nothing.
Bad — adversarial check that's just the proof restated:
# "Adversarial check": 70 years after 1948 is 2018. Today (2026) > 2018. Claim holds.
# This uses the SAME date (1948) and SAME logic. If 1948 is wrong, this is also wrong.
adversarial_age = 2026 - 1948 # not independent at all
Good — adversarial checks that search for independent counter-evidence:
adversarial_checks = [
{
"question": "Was there ever a gap or dissolution of Israel's statehood since 1948?",
"verification_performed": "web search: 'Israel statehood continuity gap dissolution'",
"finding": "No credible source documents any interruption of sovereignty.",
"breaks_proof": False,
},
{
"question": "Is there a credible alternative founding date that would make Israel younger than 70?",
"verification_performed": "web search: 'Israel founding date dispute alternative'",
"finding": "Even UN admission (May 1949) yields 76+ years. No date brings age below 70.",
"breaks_proof": False,
},
{
"question": "Could 'over 70' linguistically require 71+?",
"verification_performed": "linguistic analysis",
"finding": "Even under the strictest reading (71+), 77 > 71.",
"breaks_proof": False,
},
]
The verification_performed field describes what was done to investigate the question. For empirical proofs this is typically a web search; for pure-math proofs it is a computation or structural analysis. (The legacy field name search_performed is also accepted.)
These are structurally independent: they don't re-derive the founding date or recompute the age. They search for entirely different facts that, if found, would invalidate the proof's assumptions. Perform these via actual web searches BEFORE writing the proof code.
Search against your emerging conclusion, not just against the claim. "Counter-evidence" means evidence that contradicts your current verdict direction. If your proof is converging on PROVED, search for reasons the claim is false. If converging on UNDETERMINED or DISPROVED, search for evidence that supports the claim — that's the real adversarial check when your finding is negative. The goal is to stress-test whichever conclusion you're about to reach.
Tactics for effective adversarial search: - Search for alternative definitions of key terms (e.g., "founding" = proclamation vs recognition vs UN admission) - Search for later or earlier milestone dates that could shift the result past the threshold - Search for contested terminology where the same word means different things in different sources - Check if the same institution uses inconsistent wording across pages (e.g., NASA climate page vs NASA FAQ) - Look for source hierarchy conflicts (primary source disagrees with secondary summary) - Search for edge cases where the operator choice matters (claim is exactly at the threshold boundary) - Search for methodological disputes (different measurement approaches yield different numbers) - Distinguish "not found verbatim" from "inconsistent with sources" — Public sources frequently round figures. A precise number not appearing verbatim is not counter-evidence if it falls within the range sources report. Before flagging a figure as fabricated or unlocatable, check whether it's consistent with reported ranges. The correct framing is "precise figure not independently sourced" — not "appears fabricated." Reserve "fabricated" for figures that contradict or fall outside reported ranges.
breaks_proof must be justified when counter-evidence is found. For each adversarial check:
- If breaks_proof: True — the verdict is forced to UNDETERMINED. No further justification needed.
- If breaks_proof: False AND the check found counter-evidence (not just a reproducibility confirmation or null result) — the finding field must contain an explicit rebuttal: why the counter-evidence does not invalidate the conclusion. "Does not break the proof" or "the proof still holds" is insufficient.
- This rebuttal requirement does NOT apply to reproducibility checks, null-result checks, or edge-case checks where no counter-evidence was discovered.
- Red flag: If the finding text contains "no significant difference," "does not confirm," "contradicts," "insufficient evidence," or "RCTs show no effect" — and breaks_proof is False — the rebuttal must explain why this specific contradiction does not apply. If you cannot write a specific rebuttal, set breaks_proof: True.
- Red flag: If a finding says a number "appears to be fabricated" or "does not appear in any source" — verify whether the number is inconsistent with sources or merely more precise than sources. Rounding differences are not fabrication.
How validate_proof.py catches it: Uses AST to verify adversarial_checks is defined as a non-empty list. An empty adversarial_checks = [] is an issue (vocabulary match alone is not sufficient). The absence of any adversarial_checks variable is also an issue. Non-list forms (e.g. adversarial_checks = build_checks()) cannot be AST-counted and are passed with a note that entry count could not be verified.
Rule 6: Cross-Checks Must Be Truly Independent
Failure mode: A "cross-check" computes the same value two ways — both reading from the same founding_date variable. If founding_date is wrong, BOTH computations are wrong, and the assertion passes because they're consistently wrong.
Bad — shared-variable "cross-check":
# Both methods read founding_date — not independent
age1 = today.year - founding_date.year
age2 = (today - founding_date).days / 365.2425
assert int(age2) == age1 # passes even if founding_date is wrong
Good — truly independent cross-check from separate sources:
from scripts.extract_values import parse_date_from_quote
# Source A: U.S. State Department (different organization, different page)
date_a = parse_date_from_quote(empirical_facts["source_a"]["quote"], "source_a")
# Source B: U.S. National Archives (independent source)
date_b = parse_date_from_quote(empirical_facts["source_b"]["quote"], "source_b")
# Cross-check: independently parsed values must agree
assert date_a == date_b, f"Sources disagree: source_a={date_a}, source_b={date_b}"
# Compute age from EACH source independently
age_a = compute_age(date_a, today)
age_b = compute_age(date_b, today)
assert age_a == age_b, f"Ages disagree: source_a→{age_a}, source_b→{age_b}"
Now if one source has a different date, the assertion catches it. The cross-check has truly independent inputs.
Interpreting "independent" for government statistics: For data published by a single authority (BLS for CPI, Census for population), truly independent measurements don't exist — all sources trace back to the same authority. In this context, "independent" means independent publication and presentation: two different websites that republish BLS data can catch transcription errors, display bugs, or rounding differences between them. This provides weaker assurance than independent measurements but still has value. Note the distinction in the audit doc: "independently published (same upstream authority)" vs "independently measured."
Interpreting "independent" for pure-math proofs: Multiple external sources don't apply. Instead, independence means mathematically distinct approaches — different algorithms, identities, or structural arguments that don't share intermediate computations with the primary method. Examples of genuinely independent cross-checks:
- Primary: direct summation → Cross-check: closed-form identity (e.g., sum of Fibonacci numbers = F(n+2) − 1)
- Primary: brute-force enumeration → Cross-check: algebraic proof or generating function
- Primary: numerical computation → Cross-check: modular/structural analysis (e.g., Pisano periodicity)
- Primary: symbolic algebra → Cross-check: numerical spot-check at specific values
Re-computing the same formula with different variable names, a different loop structure, or a trivially equivalent expression is NOT an independent cross-check. The test: if a bug in the primary method's mathematical reasoning would also affect the cross-check, they are not independent.
Independence from the claim's subject (Conflict of Interest):
Sources must be independent not only of each other, but of the entity or claim being evaluated. COI types to check (drawn from IFCN/Cochrane/ICMJE frameworks):
- Organizational — source is part of the same org, parent org, or subsidiary as the claim subject
- Funding dependency — source receives material funding from the claim subject (or vice versa)
- Institutional co-benefit — source's mission or reputation benefits from a particular verdict
- Competitive antagonism — source is a direct competitor with incentive to discredit (inverse COI)
- Revolving door — key personnel moved between source and claim subject recently
- Advocacy/ideological — source exists to advance a position on the topic being evaluated
A COI does not disqualify a source — it reduces the independence credit. Document identified COIs in the coi_flags field of the relevant cross_checks entry (see output-specs.md).
Majority COI override (source-counting proofs only): For proofs where the verdict depends on how many independent sources confirm a finding (qualitative consensus, compound with source-counting sub-claims): if more than half of the confirmed sources have COI in the same direction (favorable_to_subject or unfavorable_to_subject), the verdict is forced to UNDETERMINED. Count unique source keys, not flag entries — a source with multiple COI types still counts as one source. This does NOT apply to date/age, numeric, or pure-math proofs where threshold represents a claim value rather than a source count. The COI majority check only applies when the verified source count meets or exceeds the sub-claim's threshold — below that, the source-count check already degrades the verdict, and applying COI to a reduced set produces misleading results.
For compound proofs, the COI check runs per sub-claim, not globally.
How validate_proof.py catches it: Counts distinct source references (source_a, source_b, etc.). Warns if only one source is found for an empirical proof.
Rule 7: Never Hard-Code Constants Or Formulas
Failure mode: LLMs reconstruct well-known constants and formulas from memory — and sometimes get them wrong. An LLM might type 365.25 (Julian year) instead of 365.2425 (Gregorian year), use eval() for comparisons (unsafe and easy to mis-format), or write an inline age calculation with a subtle off-by-one error. These aren't values extracted from citations (Rule 1 handles those) — they're formulas and constants the LLM "knows" but might misremember.
The deeper issue: Python's dynamic nature means these errors produce valid code that runs without errors. days / 365.25 is perfectly valid Python — it's just using the wrong divisor. The proof runs, produces a number, and nobody notices it's slightly off.
Bad — hard-coded constant and inline formula:
approx_years = total_days / 365.2425 # LLM typed from memory — could be wrong
age = today.year - founding_date.year # inline logic — might miss birthday adjustment
if (today.month, today.day) < (founding_date.month, founding_date.day):
age -= 1
claim_holds = eval(f"{age} > {threshold}") # eval is unsafe and error-prone
Good — verified constants and tested functions from bundled script:
from scripts.computations import compute_age, compare, DAYS_PER_GREGORIAN_YEAR, days_to_years
age = compute_age(founding_date, today) # tested implementation with birthday adjustment
approx_years = days_to_years(total_days) # uses verified DAYS_PER_GREGORIAN_YEAR
claim_holds = compare(age, ">", 70) # type-safe, no eval()
The bundled scripts/computations.py provides:
- DAYS_PER_GREGORIAN_YEAR = 365.2425 (with derivation in docstring)
- compute_age(birth_date, reference_date) → int (completed calendar years, handles birthday edge case)
- compare(value, operator_string, threshold) → bool (replaces eval(), type-safe)
- days_to_years(days, calendar="gregorian") → float
- compute_elapsed_days(start, end) → int
Each constant includes its mathematical derivation in the docstring, so the proof is auditable.
Self-documenting output with explain_calc():
There's a subtler version of this problem: the code computes correctly, but the LLM writes a print() statement that describes the formula differently from what the code actually does. The description and the implementation are disconnected — same structural problem as Rule 1.
explain_calc() uses Python's ast module to introspect the actual expression at runtime. The code describes itself:
from scripts.computations import explain_calc, DAYS_PER_GREGORIAN_YEAR
# Bad — LLM writes description separately from computation:
approx_years = total_days / DAYS_PER_GREGORIAN_YEAR
print(f"Approximate age (365.2425 days/year): {approx_years:.2f}") # description could be wrong
# Good — AST walker generates description from the actual code:
approx_years = explain_calc("total_days / DAYS_PER_GREGORIAN_YEAR", locals())
# output: total_days / DAYS_PER_GREGORIAN_YEAR = 28439 / 365.2425 = 77.8633
The three-column output (symbolic → substituted → result) makes every step auditable. Use explain_calc() for any computation whose output the human needs to verify.
How validate_proof.py catches it: Flags hard-coded 365.24* or 365.25 literals, eval() calls, and inline year-subtraction age calculations when compute_age is not imported.
Rule 8: Evidence Relevance For Rejection Verdicts
Failure mode: A proof reaches DISPROVED by counting sources that don't directly study the claim's subject. Example: a macaque study counts as evidence against a human-neurogenesis claim; a review that "questions" findings counts equally with one that "finds no evidence."
Subject-match requirement: For claims about a specific species, organ, population, or domain, at least 2 of 3 rejection sources in the threshold set must directly study that specific subject. A source studying a different species, population, or domain may corroborate but does not count toward the rejection threshold.
Hedged-language downgrade: A source that "questions," "challenges," or "casts doubt on" a finding provides weaker evidence than one that "rejects," "finds no evidence for," or "refutes." If any source in the rejection threshold set uses hedged language, the verdict must be SUPPORTED (against) rather than DISPROVED, unless the hedged source is replaced by a direct-rejection source that meets the subject-match requirement.
How to apply: 1. For each rejection source, note whether it directly studies the claim's subject 2. For each rejection source, classify its language as "direct rejection" or "hedged challenge" 3. If fewer than 2 of 3 sources pass the subject-match test, downgrade to SUPPORTED (against) 4. If any threshold source uses hedged language and no replacement is available, downgrade to SUPPORTED (against)
Document the subject-match and language classification in the adversarial_checks finding field.
Structural enforcement via rejection_statement: Add a rejection_statement field to each empirical_facts entry in disproof proofs, containing the verbatim phrase from the quote that constitutes the rejection. This makes the evidentiary link between source and claim explicit and auditable without semantic analysis. validate_proof.py warns when the field is absent and raises an issue when it is present but not a verbatim substring of the quote.
Rule 9 — Prose References Are Mechanically Resolvable
Closes: hand-typed author/title strings in narrative prose (the Cheng-vs-Odrzywołek class), and the artifact-gap where build-time template rendering would leave unresolved tokens in the archived file.
Enforced by: tools/lib/reference_resolver.py + tools/lib/prose_reference_scan.py + cite-expand pre-publish gate.
Bad (hand-typed, unverified):
R. Cheng, "The elementary function arithmetic" (arXiv:2603.21852)
Good (verified via {{cite:...}} + cite-expand):
{{cite:arxiv:2603.21852}}
After running proof-site.py cite-expand, the committed markdown becomes:
[A. Odrzywołek, "All elementary functions from a single binary operator"](https://arxiv.org/abs/2603.21852) (2026, arXiv preprint) <!-- cite-source: arxiv:2603.21852 -->
This is what mint-doi uploads to Zenodo, so the archived artifact carries
the resolved attribution, not a template placeholder or a hand-typed claim.
Authoring workflow:
proof-site.py resolve-deps --artifacts-dir <dir> --refresh
proof-site.py cite-expand --artifacts-dir <dir>
proof-site.py verify-prose --artifacts-dir <dir>
proof-site.py publish <dir>
Escape hatch (rare — use only for historical prose, never for actual citations):
<!-- not-a-citation-start --> Einstein (1905) famously... <!-- not-a-citation-end -->
The escape hatch suppresses only DANGLING_SHORT_PATTERN (bare author-year). Author + quoted title must always carry an identifier — there is no escape hatch for that form.
How to Re-run a Proof
Every proof published on this site is independently re-verifiable. You have three options, from lowest to highest effort:
1. Inspect the source in-page (zero install)
Every proof's detail page has a collapsible "View proof source" section showing the exact proof.py that was deposited to Zenodo, syntax-highlighted. Every fact in the verdict banner traces to code visible in that section. If you just want to audit the logic, you never need to leave the page.
2. Re-execute in your browser via Binder (~60s, zero install)
Click "Re-execute in Binder" on any proof page. Binder spins up a temporary container, fetches the immutable Zenodo deposit of that proof, runs it, and prints a verdict. Every cell is visible — no hidden orchestration. First run takes longer while the image builds; subsequent runs are cached.
3. Run locally (persistent setup)
- Install proof-engine — follow the installation guide
- Download
proof.pyfrom the proof's detail page - Run it:
python proof.py - Check the output — the script prints a JSON summary with the verdict, key results, and citation verification details
The proof is self-contained: it fetches sources, verifies citations, runs computations, and prints the result. If the verdict matches what's published here, the proof is independently confirmed. Some proofs cite paywalled sources via local snapshot files — if those files are absent when you re-run, the affected citations will show as unverified but the computation and remaining citations still run.