← All articles

The 0% ChatGPT Citation Trap: Why Your AI Visibility Tool Is Probably Wrong

Chudi Nnorukam|
ai-citabilitycalibrationmethodologychatgpt-citationgenerative-engine-optimizationllm-visibility-toolsaudit-methodologyopenai-responses-api

When citability.dev first audited freeCodeCamp.org against ChatGPT in early 2026, the result was 0/8 citations. We told the client "ChatGPT is structurally not citing your site." We were wrong. The number was a tool bug, not a real signal. ChatGPT cites freeCodeCamp constantly. Our own measurement pipeline was returning empty citation arrays from the OpenAI Responses API because we had not forced web_search invocation. The tool was working as documented and producing materially false numbers. Every customer audit we had shipped that week was wrong on the ChatGPT axis.

This post documents the failure mode, the technical fix, the diagnostic test anyone can run on any AI visibility vendor in 60 seconds, and the calibration receipt format that prevents this class of silent measurement failure. The bug is not specific to citability.dev. It almost certainly affects most tools in the AI visibility category right now. The buyer cannot tell from the dashboard.

The Bug: A Default That Looks Like a Citation Floor

OpenAI's Responses API exposes a web_search_preview tool that lets the model search the web before answering. Citation metadata, including the url_citation array that AI visibility tools depend on, only populates when the model actually invokes that tool during the response. The default behavior is tool_choice: "auto", which lets the model decide whether to search. Across thousands of measurement calls we ran, the model decided NOT to search the overwhelming majority of the time. It answered from training data and returned an empty citation array. The API call succeeded. The data we received was technically correct: zero citations were emitted on that response. But the underlying question, "does ChatGPT cite this site," was not actually being measured.

Anthropic's Messages API has the same pattern with a different shape. The web_search tool exists. Without tool_choice: {"type": "tool", "name": "web_search"}, Claude answers from training data and the web_search_tool_result block (which carries the citation metadata) never appears in the response. The result looks identical to "Claude does not cite this site," and tools that grep for web_search_tool_result to count citations return zero.

Perplexity's sonar models are immune to the bug because the architecture forces a search on every query. There is no version of a sonar response without retrieved sources. This is why Perplexity citation rates in most AI visibility tool dashboards look healthy while ChatGPT and Claude rates look broken: the broken rates are usually the tool, not the engine.

Three subtleties make the bug hard to catch without explicit testing. First, in OpenAI's Responses API specifically, only gpt-4.1 reliably emits url_citation annotations even when search is forced. The gpt-4o and gpt-4o-mini variants on this API surface stay silent more often than not. (The separate chat-completions search-preview models, gpt-4o-mini-search-preview and gpt-4o-search-preview, do emit url_citation annotations on a different code path; the rest of this section is about the Responses API specifically.) The Responses-API behavior is undocumented and we confirmed it across thousands of calls. Second, Anthropic's web_search tool is in beta and the response schema for citations is not stable across SDK versions. A tool built against an older SDK can silently lose citations after a transparent SDK update. Third, the system prompt influences how often gpt-4.1 emits citations even with forced search. A neutral prompt yields about one annotation per query. A nudged prompt ("when answering, search the web and cite multiple authoritative sources inline") yields eleven or more.

The 0% Distribution Is the Signature

If a tool's ChatGPT citation rate column shows zeros for a wide swath of customer sites while Perplexity column shows non-zero rates on the same queries, the tool is almost certainly hit by this bug. The signature is the asymmetry. Real-world engine asymmetry exists (Gemini cites slightly more aggressively than ChatGPT on commercial queries; Perplexity cites smaller brands more often than the larger commercial models do). But the asymmetry shows up as a fifteen-to-thirty-point gap, not as a hundred-point gap.

We confirmed this pattern across multiple audited sites in early 2026, including Wikipedia, freeCodeCamp.org, chudi.dev, and citability.dev. Every site we re-audited had non-zero Perplexity citation rates on brand-recognition queries. Every site had near-zero ChatGPT rates until we forced web_search. After the fix, ChatGPT rates ranged from the mid-teens (the smallest, lowest-authority site) to the mid-seventies (Wikipedia). The tool bug was hiding the entire ChatGPT measurement axis.

The category implication is hard to overstate. Most AI visibility tools (Otterly, Profound, CrowdReply, Knowatoa, Peec AI, the various rebranded SEMrush wrappers, the new entrants showing up monthly) display a per-engine citation rate prominently in their dashboards. If their ChatGPT column is systematically depressed by the same bug we caught in our own tool, then customers are paying for measurement that is wrong on the engine that matters most for SaaS buyers. ChatGPT has 800 million weekly users. A wrong number on that axis is a wrong-investment signal across the entire remediation roadmap a tool then sells.

The diagnostic is cheap: run any tool against Wikipedia for a brand-recognition query, filter to ChatGPT, look at the citation rate. If it is zero, the tool is broken. If it is between sixty and one hundred percent, the tool is calibrated. There is no middle ground for Wikipedia on ChatGPT.

The Fix: Force web_search Explicitly

The technical fix is two lines of code per engine. For OpenAI's Responses API, pass tool_choice as a dictionary specifying the web_search_preview type rather than the default string "auto":

response = openai.responses.create(
    model="gpt-4.1",
    input=query,
    tools=[{"type": "web_search_preview"}],
    tool_choice={"type": "web_search_preview"},
)
citations = [a for a in response.output[0].content[0].annotations if a.type == "url_citation"]

Without the explicit tool_choice dict, the model defaults to auto and skips search the majority of the time. With it, the model is forced to search and emit citation annotations.

For Anthropic's Messages API, the equivalent is a tool_choice block specifying the web_search tool by name:

response = anthropic.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=[{"type": "web_search_20250305", "name": "web_search"}],
    tool_choice={"type": "tool", "name": "web_search"},
    messages=[{"role": "user", "content": query}],
)
citations = []
for block in response.content:
    if block.type == "web_search_tool_result":
        citations.extend(block.content)

For Perplexity, no override is needed. The sonar models always search.

For OpenAI, one further nudge raises the average annotation count by an order of magnitude. Add a system prompt that asks the model to search and cite multiple sources inline:

system_prompt = "When answering, search the web and cite multiple authoritative sources inline."

In our internal benchmarks, this single addition raised the average annotation count by roughly an order of magnitude on gpt-4.1, taking responses from one or two citations per query into the ten-to-fifteen range. The undocumented behavior is that gpt-4.1 treats system-prompt nudges as a citation-density hint, not just a topical hint. We have not seen this written up anywhere else.

After both fixes land, citation rates on every site we have re-audited come back materially higher and the per-engine asymmetry collapses from zero-to-something into a normal fifteen-to-thirty-point gap.

The Calibration Receipt: Proof Before Numbers

The fix prevents the bug going forward. It does not prove to a buyer that the tool is measuring correctly today. That is what a calibration receipt is for.

A calibration receipt is a small self-test the audit tool runs before measuring the customer's site. It has two halves. The known-positive half asks the tool to measure a canonical AI-cited source, typically Wikipedia for general knowledge or GitHub for code, and confirms the result is in the expected range (60 to 100% citation on Perplexity, 30 to 80% on ChatGPT after the fix lands). The known-negative half asks the tool to measure an invented brand at a .invalid top-level domain. RFC 2606 reserves .invalid as a guaranteed-non-resolving TLD, so any citation to a .invalid URL is a fabrication signal: the engine made it up, or the tool's parsing is treating prose mentions as structured citations. Either way, the tool is broken.

Our format for the receipt is JSON, ships inside every audit report, and looks like this:

{
  "calibration_run_at": "2026-05-01T14:22:08Z",
  "known_positive": {
    "target": "en.wikipedia.org",
    "queries": ["what is wikipedia and what does it do", "wikipedia history founding"],
    "engine_results": {
      "perplexity_sonar": {"cited": 5, "tested": 5, "rate": 1.00},
      "openai_gpt_4_1": {"cited": 4, "tested": 5, "rate": 0.80},
      "anthropic_claude_sonnet_4_6": {"cited": 5, "tested": 5, "rate": 1.00}
    },
    "expected_range": {"perplexity": [0.80, 1.00], "openai": [0.30, 1.00], "anthropic": [0.50, 1.00]},
    "verdict": "PASS"
  },
  "known_negative": {
    "target": "nonexistent-test-domain-47821-avr-calibration.invalid",
    "queries": ["nonexistent test domain ai visibility", "nonexistent brand 47821"],
    "engine_results": {
      "perplexity_sonar": {"cited": 0, "tested": 6},
      "openai_gpt_4_1": {"cited": 0, "tested": 6},
      "anthropic_claude_sonnet_4_6": {"cited": 0, "tested": 6}
    },
    "expected": "0/6 across all engines",
    "verdict": "PASS"
  },
  "overall_verdict": "PASS",
  "customer_audit_proceeds": true
}

The receipt has three properties that make it useful to a buyer instead of just to the tool builder. First, it is reproducible: the queries are listed verbatim and any reader can run them against the engines directly to confirm the numbers match within session variance. Second, it can fail. A receipt that always passes is probably not running. Today's run might fail because of an OpenAI quota exhaustion, an Anthropic SDK version drift, or a Perplexity rate limit, and when it does the customer audit refuses to ship. Third, it carries provenance. The timestamp tells the buyer when the tool was last verified working, not just when the tool was last marketed.

For citability.dev specifically, every audit ships with this receipt as the first artifact a customer sees. If you want to verify any vendor's numbers (including ours), the same diagnostic works against any tool. The companion methodology that anchors the receipt against three reference sites and a published curve is documented in Three-Anchor Calibration. The framework that decomposes a single AI visibility number into three independent axes (so the calibrated rate is also actionable) is documented in V/R/C Separation.

Why This Matters for Category Buyers

If you are evaluating AI visibility tools right now, ask the vendor for their calibration receipt before signing anything. The full request goes like this: "Show me the JSON receipt your tool produced on the most recent audit. It should include a known-positive test against Wikipedia and a known-negative test against a .invalid TLD. If you do not have one, when can you produce one?"

Vendors that have one will send it within an hour. The receipt becomes a standard artifact in the sales motion. Vendors that do not have one will go quiet for a week, then reply with a vague description of "internal testing." That is the signal that their numbers are not reliable. The tool may produce a beautiful dashboard, but the numbers in the dashboard are not measuring what you are paying them to measure.

The downstream cost of the wrong numbers is the wrong remediation. If your vendor reports 0% ChatGPT citation, you will spend three months and significant content budget trying to "fix" your ChatGPT visibility when the actual problem is that nobody was measuring it correctly. The audit was wrong, the diagnosis based on the audit was wrong, the remediation based on the diagnosis was wrong, and the next audit (without the fix) will say the remediation did not work. The whole loop is silent failure.

A tool with a calibration receipt cannot quietly produce wrong numbers, because the receipt blocks the audit from shipping when the calibration fails. That is the difference between asking customers to trust the methodology and giving customers proof of the methodology. The category needs to make the second standard practice.

Reproducibility

The diagnostic test in the HowTo above takes five minutes against any vendor's tool. The full calibration receipt format is open: the JSON schema, the RFC 2606 negative-target methodology, and the per-engine expected ranges all sit in the open-source AI Visibility Readiness Framework on GitHub at github.com/ChudiNnorukam/ai-visibility-readiness. If you build an AI visibility tool, fork the receipt format and ship it with your audits. If you buy one, run the diagnostic and demand the receipt.

For the same-engine deep-dive on per-engine citation behavior (why ChatGPT and Perplexity differ, and what the asymmetry tells you about your remediation roadmap), see Chudi's Perplexity vs ChatGPT Citation Rules on chudi.dev. To run the citability.dev free scan against your own site (which ships with the calibration receipt as the first artifact), start at citability.dev/assess. The full methodology lives at citability.dev/docs#methodology including the AVR Score, VRC components, and the parallel Agent Readiness module.

The category is young. The terminology is forming this quarter. The tools that make calibration receipts standard are the ones buyers will trust six months from now. The ones that do not are the ones whose numbers will quietly stop matching reality, and customers will figure that out one expensive remediation cycle at a time.

Check your AI visibility

Free scan. No account required. Results in 10 seconds.

Start Free Scan