Home Blog Checklists AI-Ready Website Checklist

AI-Ready Website Checklist: 47 Technical Checks for LLM Crawlability

By August Tange · May 20, 2026 · 10 min read

Quick Answer

Making your website “AI-ready” is not a marketing term — it’s a specific technical configuration that determines whether AI training crawlers and real-time retrieval systems can access, understand, and use your content. This 47-point checklist covers every technical element that affects LLM crawlability, from robots.txt bot permissions to schema markup priority to performance thresholds for real-time retrieval.

47
technical checks across 5 categories
5
AI crawlers to allow in robots.txt
2.5s
LCP target for AI retrieval compatibility
7
schema types that improve AI visibility

Section 1: AI Crawler Access (robots.txt)

The first and most critical check: can AI systems actually reach your content? Many websites are inadvertently blocking AI crawlers through restrictive robots.txt rules — either a blanket Disallow: / for User-agent: *, or specific blocks added when AI traffic spiked. Either way, blocked crawlers mean your site is invisible to those AI systems.

robots.txt — AI crawler permissions

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: *
Disallow: /internal-seo-tracking/
Disallow: /admin/

☑ robots.txt Checklist

  • GPTBot — Allow: / (OpenAI training)
  • OAI-SearchBot — Allow: / (OpenAI real-time search)
  • ClaudeBot — Allow: / (Anthropic)
  • PerplexityBot — Allow: / (Perplexity AI)
  • Google-Extended — Allow: / (Google AI training)
  • No blanket Disallow: / under User-agent: *
  • robots.txt is accessible at /robots.txt with 200 status
  • XML sitemap URL declared in robots.txt (Sitemap: https://...)

Section 2: Technical Foundation

These are the baseline technical requirements that every AI-visible website must have in place. They are not optional — missing any of them creates crawlability and trust gaps that will suppress citation rates.

☑ Technical Foundation Checklist

  • HTTPS on all pages — required for trust signals (check: no mixed content warnings)
  • XML sitemap exists and is submitted to Google Search Console
  • XML sitemap submitted to Bing Webmaster Tools
  • Canonical tags on all pages — self-referencing and correct for paginated content
  • No orphaned pages — every page linked from at least one other page
  • Clean URL structure — descriptive slugs, no unnecessary parameters, consistent format
  • No broken internal links (404s on linked pages)
  • Redirect chains resolved — no 301 chains longer than 2 hops
  • Mobile-responsive on all pages
  • Core content in HTML — not behind JavaScript rendering that blocks AI crawlers

Section 3: Schema Markup Priority

Schema markup is how you communicate structured information to AI systems in machine-readable format. Without schema, AI systems must infer your entity type, service offerings, author authority, and content structure from unstructured HTML. Schema eliminates that ambiguity and directly feeds the fields AI systems use when deciding whether to recommend or cite you.

Implement in this priority order:

# Schema Type Where to Add Priority
1 Organization Homepage Required
2 FAQPage All Q&A and FAQ content Required
3 Article / BlogPosting All blog and content pages High
4 BreadcrumbList All pages with navigation hierarchy High
5 LocalBusiness Contact / About pages (local businesses) High
6 Person Author bio pages, About page Recommended
7 AggregateRating Homepage, service pages Recommended

☑ Schema Markup Checklist

  • Organization schema on homepage with name, URL, logo, sameAs social profiles
  • FAQPage schema on every page with FAQ or Q&A section
  • Article schema on all blog posts with author, datePublished, dateModified
  • BreadcrumbList schema on all site pages
  • LocalBusiness schema if you serve a geographic area
  • Person schema on author bio pages
  • No schema validation errors in Google’s Rich Results Test
  • Schema uses JSON-LD format (not Microdata or RDFa)
  • sameAs links in Organization schema pointing to Google Business Profile, LinkedIn, Trustpilot

Section 4: Content Structure for AI Extraction

Even if AI crawlers can access your pages and parse your schema, they also need to extract and cite your content accurately. Content structure determines how well AI systems can identify, attribute, and use specific passages from your pages.

☑ Content Structure Checklist

  • One H1 per page — contains primary topic keyword
  • H2 for major sections, H3 for sub-sections — clear logical hierarchy
  • Introduction paragraph summarizes the page’s core answer in the first 2–3 sentences
  • Bulleted or numbered lists for enumerable information
  • Author name and bio section on all content pages
  • Publication date and last-updated date visible on page
  • Outbound citations to authoritative sources (at least 2–3 per article)
  • Tables for comparative data — AI systems extract tabular data effectively
  • No critical information in images without alt text or text alternatives
  • Meta description accurately summarizes page content (150–160 characters)
  • Internal links to related pages (minimum 3 per content page)

Section 5: Performance Thresholds for AI Retrieval

Real-time AI retrieval systems — used by Perplexity, ChatGPT Search, and Google AI Mode — must access, load, and parse your pages within seconds during a live query response. Pages that are too slow may be skipped in favor of faster-loading sources. These are the performance targets that keep you in the retrieval window.

✓ Target Performance
  • LCP < 2.5 seconds
  • Server response time < 200ms
  • Mobile PageSpeed score > 85
  • INP < 200ms
  • CLS < 0.1
✗ Risk Thresholds
  • LCP > 4 seconds = deprioritized
  • Server response > 500ms = risk
  • PageSpeed < 60 = significant risk
  • JS-blocked critical content
  • Large uncompressed images

☑ Performance Checklist

  • LCP under 2.5 seconds on mobile — verified via PageSpeed Insights
  • Server response time under 200ms (TTFB)
  • Mobile PageSpeed score above 85
  • All images compressed and served in modern formats (WebP)
  • No critical content blocked by JavaScript rendering
  • CSS and JS minified and compressed
  • Browser caching enabled for static assets
  • CDN in use for global delivery speed
  • CLS under 0.1 — no layout shifting that interrupts content parsing

Quick Monthly Audit Routine

You don’t need to run the full 47-point checklist every month. This 10-minute monthly routine catches the issues most likely to drift after initial implementation:

  1. Check robots.txt — confirm AI crawler permissions unchanged
  2. Validate new pages in Google Rich Results Test — any new content missing schema?
  3. Run PageSpeed Insights on 3 key pages — any score drops below 85?
  4. Check Google Search Console Coverage — any new 404s or excluded pages?
  5. Confirm XML sitemap reflects all published pages
  6. Verify 2 key articles have updated dateModified if refreshed

Frequently Asked Questions

An AI-ready website is technically configured so that AI training crawlers and real-time retrieval systems can access, understand, and use its content. This means allowing AI bots in robots.txt, implementing schema markup, formatting content with clear heading hierarchy, and meeting performance thresholds that allow real-time retrieval systems to access pages quickly.
Allow all major AI crawlers: GPTBot (OpenAI/ChatGPT), OAI-SearchBot (OpenAI real-time search), ClaudeBot (Anthropic/Claude), PerplexityBot (Perplexity AI), and Google-Extended (Google’s AI training crawler). Blocking any of these removes your site from their respective training data and real-time retrieval systems.
Priority order: (1) Organization schema on homepage, (2) FAQPage schema on Q&A content, (3) Article/BlogPosting on content pages, (4) BreadcrumbList on all pages, (5) LocalBusiness schema if applicable, (6) Person schema for key executives, (7) AggregateRating schema if you have reviews. Implement in that order if starting from scratch.
Yes — for real-time retrieval systems (Perplexity, ChatGPT Search, Google AI Mode), page speed directly affects whether your content is accessible within the retrieval window. Pages with LCP over 4 seconds risk being skipped by retrieval systems that need to synthesize multiple sources quickly. Training crawlers are less sensitive to speed, but it remains a quality signal.
Check your robots.txt at yoursite.com/robots.txt. Look for Disallow: / directives under User-agent: * (which blocks all bots including AI crawlers), or specific disallow rules under User-agent: GPTBot, ClaudeBot, or PerplexityBot. If any have Disallow: / or Disallow: /blog/, those crawlers cannot access that content.
Run the full checklist quarterly, and the 10-minute monthly routine each month. The monthly check focuses on robots.txt, new page schema, performance scores, and sitemap currency. The quarterly full audit catches issues that accumulate over time including redirect chains, orphaned pages, and schema drift.

Continue Learning

Is Your Website Already AI-Ready?

Get your free AI Visibility Score and find out exactly how your site is currently configured for AI crawlability — and where the gaps are.

Get Your Free AI Visibility Score →