Active Document
AI Crawler Directives, A 2026 Guide for Edmonton Businesses
GPTBot, ClaudeBot, PerplexityBot, Google-Extended — what each crawler does, how to configure robots.txt for them, and what to allow or block in 2026.
Loading document...
AI Crawler Directives, A 2026 Guide for Edmonton Businesses
Active Document
GPTBot, ClaudeBot, PerplexityBot, Google-Extended — what each crawler does, how to configure robots.txt for them, and what to allow or block in 2026.
Loading document...
By mid-2026, dozens of AI crawlers fetch content from business websites daily — some for training, some for live retrieval, some for both. The default robots.txt most Edmonton businesses have is from 2018 and treats all of them the same way: it ignores them entirely.
Getting AI crawler directives right matters for two reasons: you want the right AI crawlers to see your site so you appear in answers, and you may want to block certain crawlers from training on your content. This post is the practical 2026 guide — what each crawler does, what to allow or block, and how to configure everything without breaking visibility.
GPTBot — OpenAI's training crawler. Fetches content for training future GPT models. If you block this, your content won't be in future GPT training data.
ChatGPT-User — Live fetch when ChatGPT browses for a user's query. Blocking this makes you invisible to ChatGPT answers in real-time.
OAI-SearchBot — OpenAI's search indexing crawler (for their search feature). Affects ChatGPT search results visibility.
ClaudeBot — Anthropic's main crawler. Fetches content for training and retrieval purposes.
Claude-Web — Claude's real-time browsing. Fetches when a Claude user queries about current information.
anthropic-ai — Older user agent, still seen in some deployments.
Google-Extended — Controls whether Google can use your content for training Gemini and future AI products. Separate from general Googlebot (which affects traditional search indexing).
Googlebot — Traditional Google search crawler. Still critical for traditional SEO.
Bingbot — Microsoft's main crawler. Powers Bing, Copilot, and ChatGPT browsing (since ChatGPT uses Bing for some queries).
GPTBot — Confusingly also used by Microsoft's Bing Copilot sometimes.
PerplexityBot — Perplexity's main crawler. Aggressive real-time crawling for live answers.
Perplexity-User — Sometimes seen for user-triggered fetches.
CCBot (Common Crawl) — Used by many AI companies for training data. Blocking this has broad effects.
YouBot — You.com's crawler.
cohere-ai — Cohere's crawler.
Amazonbot — Amazon's crawler, used for Alexa and internal products.
FacebookBot — Meta's crawler, used for Llama training and other products.
Bytespider — ByteDance's crawler. Sometimes considered aggressive/intrusive.
Claude-SearchBot — Newer Claude bot for search-specific fetching.
Applebot-Extended — Apple's AI training crawler (separate from regular Applebot for search).
Every Edmonton business should pick one of these strategies:
Your content can be used by all AI crawlers for all purposes.
Right choice for: Service businesses where visibility is the priority. Edmonton SMBs selling services.
Allow crawlers that fetch for live answers, block those that train.
Right choice for: Businesses with proprietary content or unique IP they want to limit.
Block some crawlers entirely (often smaller ones or ones perceived as hostile).
Right choice for: Larger sites with bandwidth concerns, sites with aggressive scraper problems.
Block every identified AI crawler.
Right choice for: Paywalled content sites, proprietary data businesses, publishing companies fighting for monetization.
Strategy 1 — allow everything. Visibility in AI engines is more valuable than the marginal IP concern from training data inclusion.
Strategy 2 — allow retrieval, block training. Preserves visibility while defending IP.
Strategy 4 — block most AI crawlers. But note you'll lose visibility.
Not a robots.txt issue. Use authentication and don't expose sensitive content publicly in the first place.
User-agent: *
Allow: /
Sitemap: https://yoursite.ca/sitemap.xml
# Allow all traditional crawlers
User-agent: *
Allow: /
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
# Explicitly allow retrieval crawlers
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
Sitemap: https://yoursite.ca/sitemap.xml
Block specific problematic crawlers (example: Bytespider which has a reputation for aggressive crawling):
User-agent: *
Allow: /
User-agent: Bytespider
Disallow: /
Sitemap: https://yoursite.ca/sitemap.xml
User-agent: *
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Applebot-Extended
Disallow: /
Sitemap: https://yoursite.ca/sitemap.xml
ai.txt is a proposed standard (not widely adopted yet) for declaring AI-specific preferences. Spec at spawning.ai/ai-txt but adoption is low in 2026.
For now, robots.txt is the primary mechanism. Adding ai.txt doesn't hurt but isn't widely honored.
noai and noimageai meta tagsSome platforms respect these meta tags as additional AI directives:
<meta name="robots" content="noai, noimageai">
Adoption is inconsistent. Robots.txt remains the primary mechanism.
Some crawlers respect HTTP headers:
X-Robots-Tag: noai, noimageai
Same adoption issue. Use if your situation warrants belt-and-suspenders.
If your site has EU/UK visitors or you want to signal rights reservations under EU Copyright Directive Article 4:
<meta name="tdm-reservation" content="1">
<meta name="tdm-policy" content="https://yoursite.ca/tdm-policy">
This signals machine-readable rights reservation. Adoption growing in 2026, particularly for publisher-class sites.
Edmonton SMBs generally don't need this level of legal framework, but publishers and larger content sites might.
Your web server logs every request. Filter for AI crawler user agents:
grep -iE 'GPTBot|ClaudeBot|Claude-Web|PerplexityBot|Google-Extended|OAI-SearchBot|anthropic-ai|CCBot|YouBot|cohere-ai|Amazonbot|FacebookBot|Bytespider' access.log
Tells you:
The classic: "we blocked GPTBot a year ago because of an IP concern and now we don't appear in ChatGPT answers." Review and update quarterly.
User-agent: *
Disallow: /
Blocks everyone. Surprisingly common accidental misconfiguration on Edmonton business sites.
Robots.txt isn't access control. If content is publicly accessible, scrapers who don't respect robots.txt can still fetch it. Rely on authentication for actual content protection, not robots.txt.
AI crawlers appear and rename frequently. A 2024 robots.txt may not cover 2026's crawler landscape.
Confusing Google-Extended (AI training) with Googlebot (regular search). Blocking Googlebot deindexes you from Google.
# WRONG — this blocks all of Google
User-agent: Googlebot
Disallow: /
# RIGHT — this only blocks AI training
User-agent: Google-Extended
Disallow: /
Some WordPress and Webflow default configurations don't expose robots.txt properly. Test by fetching https://yoursite.ca/robots.txt directly.
Emerging in 2026: first-party consent for AI training (analogous to cookie consent for tracking).
Frameworks like TDM reservations (EU), Pomodoro Mark (proposed), and various industry-specific approaches exist. Adoption is still early.
For most Edmonton SMBs, staying with robots.txt is sufficient through 2026. Revisit in 2027-2028 as standards mature.
Patient info should never be crawlable by anyone. That's authentication, not robots.txt. Public content (service pages, team bios) should be allowed for visibility.
Published articles benefit from AI visibility. Confidential case documents should be on authenticated platforms, not publicly indexable.
Varies. Public course catalogs: allow. Student records: authenticate. Research data: depends on publication agreements.
Almost always: allow everything. Being cited by AI engines is a primary growth channel.
Most nuanced case. Consider Strategy 2 (allow retrieval, block training) or Strategy 3 (selective blocking).
Reputable crawlers (OpenAI, Anthropic, Google, Perplexity) do. Less reputable ones don't. Robots.txt is a signal, not access control.
Partially. ChatGPT browsing uses ChatGPT-User, not GPTBot, for live fetches. But GPTBot is how your content gets into training data — future ChatGPT models will know less about you if you block it. For most Edmonton businesses, allow GPTBot unless you have specific reasons not to.
Quarterly. AI crawler landscape changes fast.
Strategy 1 (allow everything) with a clean robots.txt and sitemap.xml reference. For most Edmonton service businesses, AI visibility is more valuable than IP defense.
Add User-agent: Googlebot-Image and other image crawlers with Disallow rules if you want to protect images from AI training while keeping text accessible. Adoption is inconsistent.
Technically yes (via Cloudflare or server rules), but rarely worth it. Most AI crawlers operate from cloud infrastructure and IPs change frequently.
You can block by IP at the CDN/firewall level. Cloudflare has bot management features. For small Edmonton businesses, this is usually overkill — unreputable crawlers account for small fraction of traffic and aren't worth the maintenance burden.
For WordPress: plugins like Yoast can manage robots.txt. For Next.js: serve as public/robots.txt directly. For Webflow: custom code in site settings. Manual management is fine and gives you more control.
Low for most Edmonton SMBs in 2026. A competent robots.txt is a hygiene signal. Getting it catastrophically wrong (blocking Googlebot by accident) has massive downside. Getting it perfectly right has marginal upside over defaults.
Want your robots.txt audited for AI crawler directives? We'll check what you're allowing and blocking, recommend changes based on your business goals, and verify configuration is actually working. Book a free audit. See also our llms.txt complete guide and schema markup checklist.