What AI Crawlers Actually See When They Visit Your Site
AI crawlers do not see your website the same way a browser does. They see raw HTML, structured data, and metadata. Everything that requires JavaScript rendering, user interaction, or visual layout is invisible to them.
We built citability.dev specifically to measure this gap. After running infrastructure scans on 50+ websites, patterns emerged that explain why some sites get cited by AI and others get ignored.
The First 3 Seconds of an AI Crawl
When GPTBot (OpenAI), ClaudeBot (Anthropic), or PerplexityBot visits your site, the sequence is predictable:
-
robots.txt - The crawler checks if it has permission to access the page. If your robots.txt explicitly blocks the crawler user-agent, it stops here.
-
HTTP response - The crawler fetches the page HTML. It does not execute JavaScript. Whatever your server returns as the initial HTML response is all the crawler sees.
-
Structured data - The crawler parses JSON-LD blocks in the HTML head. These provide machine-readable context about the page type, author, dates, and content structure.
-
Content extraction - The crawler scans the body HTML for extractable content: headings, paragraphs, lists, and tables. It looks for clear, factual statements it can use as answer sources.
Everything else, including animations, interactive elements, images without alt text, and content loaded via API calls, is invisible.
What Fails the Scan Most Often
Across 50+ scans, three failure patterns account for 80% of infrastructure problems:
Client-Side Rendering Without SSR
Single-page applications that render content via JavaScript are the most common failure. The HTML the crawler receives contains a near-empty <div id="root"></div> with no extractable content.
This affects React (Create React App), Vue (client mode), and Angular applications that do not use server-side rendering. Next.js, Nuxt, and SvelteKit all support SSR by default, but developers often add "use client" directives that push content rendering back to the browser.
The fix: ensure your content pages render as server components. Interactive elements (animations, forms, state) should be isolated in client components while the content itself renders server-side.
Missing or Incomplete Structured Data
Most websites have zero JSON-LD structured data. Of those that do, the majority only include Organization or WebSite schema at the layout level without per-page Article, FAQPage, or HowTo markup.
Per-page structured data matters because it gives AI crawlers explicit context about what each page contains. Without it, the crawler must parse ambiguous HTML and guess the content type, author, and freshness.
The minimum viable schema for content pages:
- Article (or TechArticle) with
headline,datePublished,dateModified, andauthor - FAQPage for pages with question-answer patterns
- HowTo for tutorial or step-by-step content
Stale Content Without Date Signals
Pages without dateModified in their schema or visible date indicators get treated as potentially stale by AI systems. Research from Semrush shows that 95% of ChatGPT citations come from recently published or updated content. Pages that were last updated in 2023 are competing against pages updated this quarter.
The fix is not just adding dates. It requires genuine quarterly updates with substantive new content, current statistics, and fresh sources. AI systems are learning to detect fake freshness where the date changes but the content does not.
What Passes: The 8+ Check Pattern
Sites that pass 8 or more of our 10 infrastructure checks share common traits:
- Server-side rendered HTML with content visible in the initial response
- JSON-LD structured data at both the site level and page level
dateModifiedschema backed by genuine content updates- Answer-first content where the primary claim appears in the first 100 words
- Clean heading hierarchy (H1 > H2 > H3) with question-based headings
- robots.txt that explicitly allows AI crawlers (GPTBot, ClaudeBot)
These are not advanced optimizations. They are baseline infrastructure that most websites built after 2020 should already have. The problem is that many sites were built for human browsers and Google, not for AI crawlers that parse raw HTML.
How to See What AI Crawlers See
The simplest test: view your page source (curl -s https://yoursite.com | head -100). If the content is not in that HTML response, AI crawlers cannot see it.
For a comprehensive check, run the free scan. It tests all 10 infrastructure signals and shows exactly which ones pass and which ones fail, with documentation explaining why each signal matters.
The gap between what you see in a browser and what AI crawlers see is where AI visibility problems live. Closing that gap is the first step toward getting cited.