What do AI crawlers check first on a website?

AI crawlers check robots.txt for permissions, then sitemap.xml for content discovery. If robots.txt blocks the crawler or no sitemap exists, the crawler may skip or incompletely index your site. These two files are the entry point for all AI content discovery.

Can AI crawlers execute JavaScript?

Most AI crawlers do not execute JavaScript. GPTBot, ClaudeBot, and CCBot fetch raw HTML only. If your content renders client-side via React, Vue, or Angular without server-side rendering, AI crawlers see an empty page. Google AI Overviews is an exception since it uses Googlebot which does render JavaScript.

Does blocking AI crawlers prevent AI from knowing about your site?

No. Blocking AI crawlers only prevents future indexing. AI models trained before you added the block already contain your content. Blocking also prevents AI from fetching fresh content, which means the AI version of your site becomes increasingly stale.

What AI Crawlers Actually See When They Visit Your Site

AI crawlers do not see your website the same way a browser does. They see raw HTML, structured data, and metadata. Everything that requires JavaScript rendering, user interaction, or visual layout is invisible to them. This is the diagnostic side of what the five pillars of AI citability call retrievability: the first pillar that, when missing, caps every other pillar's effect.

We built citability.dev specifically to measure this gap. After running infrastructure scans on 50+ websites, patterns emerged that explain why some sites get cited by AI and others get ignored.

The First 3 Seconds of an AI Crawl#

When GPTBot (OpenAI), ClaudeBot (Anthropic), or PerplexityBot visits your site, the sequence is predictable:

robots.txt - The crawler checks if it has permission to access the page. If your robots.txt explicitly blocks the crawler user-agent, it stops here.
HTTP response - The crawler fetches the page HTML. It does not execute JavaScript. Whatever your server returns as the initial HTML response is all the crawler sees.
Structured data - The crawler parses JSON-LD blocks in the HTML head. These provide machine-readable context about the page type, author, dates, and content structure.
Content extraction - The crawler scans the body HTML for extractable content: headings, paragraphs, lists, and tables. It looks for clear, factual statements it can use as answer sources.

Everything else, including animations, interactive elements, images without alt text, and content loaded via API calls, is invisible.

What Fails the Scan Most Often#

Across 50+ scans, three failure patterns account for 80% of infrastructure problems:

Client-Side Rendering Without SSR#

Single-page applications that render content via JavaScript are the most common failure. The HTML the crawler receives contains a near-empty <div id="root"></div> with no extractable content.

This affects React (Create React App), Vue (client mode), and Angular applications that do not use server-side rendering. Next.js, Nuxt, and SvelteKit all support SSR by default, but developers often add "use client" directives that push content rendering back to the browser.

The fix: ensure your content pages render as server components. Interactive elements (animations, forms, state) should be isolated in client components while the content itself renders server-side.

Missing or Incomplete Structured Data#

Most websites have zero JSON-LD structured data. Of those that do, the majority only include Organization or WebSite schema at the layout level without per-page Article, FAQPage, or HowTo markup.

Per-page structured data matters because it gives AI crawlers explicit context about what each page contains. Without it, the crawler must parse ambiguous HTML and guess the content type, author, and freshness.

The minimum viable schema for content pages:

Article (or TechArticle) with headline, datePublished, dateModified, and author
FAQPage for pages with question-answer patterns
HowTo for tutorial or step-by-step content

Stale Content Without Date Signals#

Pages without dateModified in their schema or visible date indicators get treated as potentially stale by AI systems. Research from Semrush shows that 95% of ChatGPT citations come from recently published or updated content. Pages that were last updated in 2023 are competing against pages updated this quarter.

The fix is not just adding dates. It requires genuine quarterly updates with substantive new content, current statistics, and fresh sources. AI systems are learning to detect fake freshness where the date changes but the content does not.

What Passes: The 8+ Check Pattern#

Sites that pass 8 or more of our 10 infrastructure checks share common traits:

Server-side rendered HTML with content visible in the initial response
JSON-LD structured data at both the site level and page level
dateModified schema backed by genuine content updates
Answer-first content where the primary claim appears in the first 100 words
Clean heading hierarchy (H1 > H2 > H3) with question-based headings
robots.txt that explicitly allows AI crawlers (GPTBot, ClaudeBot)

These are not advanced optimizations. They are baseline infrastructure that most websites built after 2020 should already have. The problem is that many sites were built for human browsers and Google, not for AI crawlers that parse raw HTML.

How to See What AI Crawlers See#

The simplest test: view your page source (curl -s https://yoursite.com | head -100). If the content is not in that HTML response, AI crawlers cannot see it.

For a comprehensive check, run the free scan. It tests all 10 infrastructure signals and shows exactly which ones pass and which ones fail, with documentation explaining why each signal matters.

The gap between what you see in a browser and what AI crawlers see is where AI visibility problems live. Closing that gap is the first step toward getting cited. The infrastructure side of this (llms.txt, ai.txt, robots.txt for AI crawlers) is documented in depth in Chudi's llms.txt and robots.txt for AI Crawlers guide on chudi.dev.