Agency7's full architectural guide — from AI lead generation to autonomous financial operations.
AI Crawler Directives: A 2026 Guide for Edmonton Businesses
By mid-2026, dozens of AI crawlers fetch content from business websites daily — some for training, some for live retrieval, some for both. The default robots.txt most Edmonton businesses have is from 2018 and treats all of them the same way: it ignores them entirely.
Getting AI crawler directives right matters for two reasons: you want the right AI crawlers to see your site so you appear in answers, and you may want to block certain crawlers from training on your content. This post is the practical 2026 guide — what each crawler does, what to allow or block, and how to configure everything without breaking visibility.
The crawlers that matter in 2026
OpenAI crawlers
GPTBot — OpenAI's training crawler. Fetches content for training future GPT models. If you block this, your content won't be in future GPT training data.
ChatGPT-User — Live fetch when ChatGPT browses for a user's query. Blocking this makes you invisible to ChatGPT answers in real-time.
OAI-SearchBot — OpenAI's search indexing crawler (for their search feature). Affects ChatGPT search results visibility.
Anthropic crawlers
ClaudeBot — Anthropic's main crawler. Fetches content for training and retrieval purposes.
Claude-Web — Claude's real-time browsing. Fetches when a Claude user queries about current information.
anthropic-ai — Older user agent, still seen in some deployments.
Google crawlers
Google-Extended — Controls whether Google can use your content for training Gemini and future AI products. Separate from general Googlebot (which affects traditional search indexing).
Googlebot — Traditional Google search crawler. Still critical for traditional SEO.
Microsoft / Bing
Bingbot — Microsoft's main crawler. Powers Bing, Copilot, and ChatGPT browsing (since ChatGPT uses Bing for some queries).
GPTBot — Confusingly also used by Microsoft's Bing Copilot sometimes.
Perplexity
PerplexityBot — Perplexity's main crawler. Aggressive real-time crawling for live answers.
Perplexity-User — Sometimes seen for user-triggered fetches.
Other AI crawlers
CCBot (Common Crawl) — Used by many AI companies for training data. Blocking this has broad effects.
YouBot — You.com's crawler.
cohere-ai — Cohere's crawler.
Amazonbot — Amazon's crawler, used for Alexa and internal products.
FacebookBot — Meta's crawler, used for Llama training and other products.
Bytespider — ByteDance's crawler. Sometimes considered aggressive/intrusive.
Claude-SearchBot — Newer Claude bot for search-specific fetching.
Applebot-Extended — Apple's AI training crawler (separate from regular Applebot for search).
The four strategies
Every Edmonton business should pick one of these strategies:
Strategy 1: Allow everything (most common, default if you do nothing)
Your content can be used by all AI crawlers for all purposes.
- Pros: Maximum visibility in AI engines
- Cons: Your content feeds AI training (for free)
Right choice for: Service businesses where visibility is the priority. Edmonton SMBs selling services.
Strategy 2: Allow retrieval, block training
Allow crawlers that fetch for live answers, block those that train.
- Pros: Still visible in AI answers, content not used for training without consent
- Cons: Complex to get right, newer distinction not universally supported
Right choice for: Businesses with proprietary content or unique IP they want to limit.
Strategy 3: Allow specific engines only
Block some crawlers entirely (often smaller ones or ones perceived as hostile).
- Pros: Reduces bandwidth, prevents content theft by less reputable engines
- Cons: More maintenance, risks blocking something you'd want
Right choice for: Larger sites with bandwidth concerns, sites with aggressive scraper problems.
Strategy 4: Block all AI crawlers
Block every identified AI crawler.
- Pros: Content protected from AI training and AI retrieval
- Cons: Invisible to AI-mediated queries, which are growing fast
- Note: Doesn't guarantee protection (not all crawlers respect robots.txt)
Right choice for: Paywalled content sites, proprietary data businesses, publishing companies fighting for monetization.
The recommended defaults for Edmonton businesses
For most service businesses (lawyers, dentists, agencies, trades, real estate):
Strategy 1 — allow everything. Visibility in AI engines is more valuable than the marginal IP concern from training data inclusion.
For content-heavy businesses (publishers, content sites, SaaS with proprietary research):
Strategy 2 — allow retrieval, block training. Preserves visibility while defending IP.
For paywalled businesses:
Strategy 4 — block most AI crawlers. But note you'll lose visibility.
For SMBs with sensitive client data:
Not a robots.txt issue. Use authentication and don't expose sensitive content publicly in the first place.
Implementing robots.txt for 2026
Base robots.txt (Strategy 1 — allow everything, standard)
User-agent: *
Allow: /
Sitemap: https://yoursite.ca/sitemap.xml
Strategy 2 — allow retrieval, block training
# Allow all traditional crawlers
User-agent: *
Allow: /
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
# Explicitly allow retrieval crawlers
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
Sitemap: https://yoursite.ca/sitemap.xml
Strategy 3 — selective blocking
Block specific problematic crawlers (example: Bytespider which has a reputation for aggressive crawling):
User-agent: *
Allow: /
User-agent: Bytespider
Disallow: /
Sitemap: https://yoursite.ca/sitemap.xml
Strategy 4 — block most AI crawlers
User-agent: *
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Applebot-Extended
Disallow: /
Sitemap: https://yoursite.ca/sitemap.xml
What about ai.txt?
ai.txt is a proposed standard (not widely adopted yet) for declaring AI-specific preferences. Spec at spawning.ai/ai-txt but adoption is low in 2026.
For now, robots.txt is the primary mechanism. Adding ai.txt doesn't hurt but isn't widely honored.
Beyond robots.txt — meta tags
noai and noimageai meta tags
Some platforms respect these meta tags as additional AI directives:
<meta name="robots" content="noai, noimageai">
Adoption is inconsistent. Robots.txt remains the primary mechanism.
HTTP headers
Some crawlers respect HTTP headers:
X-Robots-Tag: noai, noimageai
Same adoption issue. Use if your situation warrants belt-and-suspenders.
TDM reservations (European legal framework)
If your site has EU/UK visitors or you want to signal rights reservations under EU Copyright Directive Article 4:
<meta name="tdm-reservation" content="1">
<meta name="tdm-policy" content="https://yoursite.ca/tdm-policy">
This signals machine-readable rights reservation. Adoption growing in 2026, particularly for publisher-class sites.
Edmonton SMBs generally don't need this level of legal framework, but publishers and larger content sites might.
Monitoring — how to know what's crawling you
Log file analysis
Your web server logs every request. Filter for AI crawler user agents:
grep -iE 'GPTBot|ClaudeBot|Claude-Web|PerplexityBot|Google-Extended|OAI-SearchBot|anthropic-ai|CCBot|YouBot|cohere-ai|Amazonbot|FacebookBot|Bytespider' access.log
Tells you:
- Which AI crawlers are actually visiting you
- How often
- Which pages they're reading
- Whether they're respecting your robots.txt
Tools
- Screaming Frog Log File Analyzer — paid but reliable
- Cloudflare analytics — decent for Cloudflare-fronted sites
- Sawmill / GoAccess — open-source log analyzers
- Vercel analytics — if hosted on Vercel
What to look for
- Are the crawlers you want actually visiting you? (If not, your robots.txt may be over-blocking or your site may be broken)
- Are blocked crawlers still attempting? (Most respect robots.txt within 24-48 hours)
- Are there unusual patterns? (Very high volume from one IP range = scraper masquerading as legitimate crawler)
Common mistakes
Blocking a crawler you wanted to allow
The classic: "we blocked GPTBot a year ago because of an IP concern and now we don't appear in ChatGPT answers." Review and update quarterly.
Over-broad disallow patterns
User-agent: *
Disallow: /
Blocks everyone. Surprisingly common accidental misconfiguration on Edmonton business sites.
Blocking in robots.txt while also showing content publicly
Robots.txt isn't access control. If content is publicly accessible, scrapers who don't respect robots.txt can still fetch it. Rely on authentication for actual content protection, not robots.txt.
Forgetting to update when new crawlers emerge
AI crawlers appear and rename frequently. A 2024 robots.txt may not cover 2026's crawler landscape.
Blocking Googlebot accidentally
Confusing Google-Extended (AI training) with Googlebot (regular search). Blocking Googlebot deindexes you from Google.
# WRONG — this blocks all of Google
User-agent: Googlebot
Disallow: /
# RIGHT — this only blocks AI training
User-agent: Google-Extended
Disallow: /
Not serving robots.txt at all
Some WordPress and Webflow default configurations don't expose robots.txt properly. Test by fetching https://yoursite.ca/robots.txt directly.
What about user consent frameworks?
Emerging in 2026: first-party consent for AI training (analogous to cookie consent for tracking).
Frameworks like TDM reservations (EU), Pomodoro Mark (proposed), and various industry-specific approaches exist. Adoption is still early.
For most Edmonton SMBs, staying with robots.txt is sufficient through 2026. Revisit in 2027-2028 as standards mature.
Sector-specific considerations
Healthcare (Edmonton clinics, hospitals)
Patient info should never be crawlable by anyone. That's authentication, not robots.txt. Public content (service pages, team bios) should be allowed for visibility.
Legal (Edmonton law firms)
Published articles benefit from AI visibility. Confidential case documents should be on authenticated platforms, not publicly indexable.
Education
Varies. Public course catalogs: allow. Student records: authenticate. Research data: depends on publication agreements.
Professional services (agencies, consultants)
Almost always: allow everything. Being cited by AI engines is a primary growth channel.
Publishers / content businesses
Most nuanced case. Consider Strategy 2 (allow retrieval, block training) or Strategy 3 (selective blocking).
Frequently asked questions
Do crawlers actually respect robots.txt?
Reputable crawlers (OpenAI, Anthropic, Google, Perplexity) do. Less reputable ones don't. Robots.txt is a signal, not access control.
Will blocking GPTBot hurt my ChatGPT visibility?
Partially. ChatGPT browsing uses ChatGPT-User, not GPTBot, for live fetches. But GPTBot is how your content gets into training data — future ChatGPT models will know less about you if you block it. For most Edmonton businesses, allow GPTBot unless you have specific reasons not to.
How often should I review my robots.txt?
Quarterly. AI crawler landscape changes fast.
Is there a "one size fits all" robots.txt for Edmonton SMBs?
Strategy 1 (allow everything) with a clean robots.txt and sitemap.xml reference. For most Edmonton service businesses, AI visibility is more valuable than IP defense.
What about image-specific AI training?
Add User-agent: Googlebot-Image and other image crawlers with Disallow rules if you want to protect images from AI training while keeping text accessible. Adoption is inconsistent.
Can I block by geography or crawler IP?
Technically yes (via Cloudflare or server rules), but rarely worth it. Most AI crawlers operate from cloud infrastructure and IPs change frequently.
What if a crawler ignores my robots.txt?
You can block by IP at the CDN/firewall level. Cloudflare has bot management features. For small Edmonton businesses, this is usually overkill — unreputable crawlers account for small fraction of traffic and aren't worth the maintenance burden.
Should I use a plugin for this?
For WordPress: plugins like Yoast can manage robots.txt. For Next.js: serve as public/robots.txt directly. For Webflow: custom code in site settings. Manual management is fine and gives you more control.
What's the ROI of getting this right?
Low for most Edmonton SMBs in 2026. A competent robots.txt is a hygiene signal. Getting it catastrophically wrong (blocking Googlebot by accident) has massive downside. Getting it perfectly right has marginal upside over defaults.
Want your robots.txt audited for AI crawler directives? We'll check what you're allowing and blocking, recommend changes based on your business goals, and verify configuration is actually working. Book a free audit. See also our llms.txt complete guide and schema markup checklist.
Get the Autonomous Enterprise Blueprint
A 14-page architectural guide covering the Agency7 mandate, the fractured pipeline, agentic ledgers, and the generative engine optimization playbook — delivered as a PDF to your inbox.