Active Document

AI Crawler Directives, A 2026 Guide for Edmonton Businesses

GPTBot, ClaudeBot, PerplexityBot, Google-Extended — what each crawler does, how to configure robots.txt for them, and what to allow or block in 2026.

Free Download · 14 Pages

The 2026 Autonomous Enterprise Blueprint

Agency7's full architectural guide — from AI lead generation to autonomous financial operations.

AI Crawler Directives: A 2026 Guide for Edmonton Businesses

By mid-2026, dozens of AI crawlers fetch content from business websites daily — some for training, some for live retrieval, some for both. The default robots.txt most Edmonton businesses have is from 2018 and treats all of them the same way: it ignores them entirely.

Getting AI crawler directives right matters for two reasons: you want the right AI crawlers to see your site so you appear in answers, and you may want to block certain crawlers from training on your content. This post is the practical 2026 guide — what each crawler does, what to allow or block, and how to configure everything without breaking visibility.

The crawlers that matter in 2026

OpenAI crawlers

GPTBot — OpenAI's training crawler. Fetches content for training future GPT models. If you block this, your content won't be in future GPT training data.

ChatGPT-User — Live fetch when ChatGPT browses for a user's query. Blocking this makes you invisible to ChatGPT answers in real-time.

OAI-SearchBot — OpenAI's search indexing crawler (for their search feature). Affects ChatGPT search results visibility.

Anthropic crawlers

ClaudeBot — Anthropic's main crawler. Fetches content for training and retrieval purposes.

Claude-Web — Claude's real-time browsing. Fetches when a Claude user queries about current information.

anthropic-ai — Older user agent, still seen in some deployments.

Google crawlers

Google-Extended — Controls whether Google can use your content for training Gemini and future AI products. Separate from general Googlebot (which affects traditional search indexing).

Googlebot — Traditional Google search crawler. Still critical for traditional SEO.

Microsoft / Bing

Bingbot — Microsoft's main crawler. Powers Bing, Copilot, and ChatGPT browsing (since ChatGPT uses Bing for some queries).

GPTBot — Confusingly also used by Microsoft's Bing Copilot sometimes.

Perplexity

PerplexityBot — Perplexity's main crawler. Aggressive real-time crawling for live answers.

Perplexity-User — Sometimes seen for user-triggered fetches.

Other AI crawlers

CCBot (Common Crawl) — Used by many AI companies for training data. Blocking this has broad effects.

YouBot — You.com's crawler.

cohere-ai — Cohere's crawler.

Amazonbot — Amazon's crawler, used for Alexa and internal products.

FacebookBot — Meta's crawler, used for Llama training and other products.

Bytespider — ByteDance's crawler. Sometimes considered aggressive/intrusive.

Claude-SearchBot — Newer Claude bot for search-specific fetching.

Applebot-Extended — Apple's AI training crawler (separate from regular Applebot for search).

The four strategies

Every Edmonton business should pick one of these strategies:

Strategy 1: Allow everything (most common, default if you do nothing)

Your content can be used by all AI crawlers for all purposes.

Pros: Maximum visibility in AI engines
Cons: Your content feeds AI training (for free)

Right choice for: Service businesses where visibility is the priority. Edmonton SMBs selling services.

Strategy 2: Allow retrieval, block training

Allow crawlers that fetch for live answers, block those that train.

Pros: Still visible in AI answers, content not used for training without consent
Cons: Complex to get right, newer distinction not universally supported

Right choice for: Businesses with proprietary content or unique IP they want to limit.

Strategy 3: Allow specific engines only

Block some crawlers entirely (often smaller ones or ones perceived as hostile).

Pros: Reduces bandwidth, prevents content theft by less reputable engines
Cons: More maintenance, risks blocking something you'd want

Right choice for: Larger sites with bandwidth concerns, sites with aggressive scraper problems.

Strategy 4: Block all AI crawlers

Block every identified AI crawler.

Pros: Content protected from AI training and AI retrieval
Cons: Invisible to AI-mediated queries, which are growing fast
Note: Doesn't guarantee protection (not all crawlers respect robots.txt)

Right choice for: Paywalled content sites, proprietary data businesses, publishing companies fighting for monetization.

The recommended defaults for Edmonton businesses

For most service businesses (lawyers, dentists, agencies, trades, real estate):

Strategy 1 — allow everything. Visibility in AI engines is more valuable than the marginal IP concern from training data inclusion.

For content-heavy businesses (publishers, content sites, SaaS with proprietary research):

Strategy 2 — allow retrieval, block training. Preserves visibility while defending IP.

For paywalled businesses:

Strategy 4 — block most AI crawlers. But note you'll lose visibility.

For SMBs with sensitive client data:

Not a robots.txt issue. Use authentication and don't expose sensitive content publicly in the first place.

Implementing robots.txt for 2026

Base robots.txt (Strategy 1 — allow everything, standard)

User-agent: *
Allow: /

Sitemap: https://yoursite.ca/sitemap.xml

Strategy 2 — allow retrieval, block training

# Allow all traditional crawlers
User-agent: *
Allow: /

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

# Explicitly allow retrieval crawlers
User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

Sitemap: https://yoursite.ca/sitemap.xml

Strategy 3 — selective blocking

Block specific problematic crawlers (example: Bytespider which has a reputation for aggressive crawling):

User-agent: *
Allow: /

User-agent: Bytespider
Disallow: /

Sitemap: https://yoursite.ca/sitemap.xml

Strategy 4 — block most AI crawlers

User-agent: *
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Applebot-Extended
Disallow: /

Sitemap: https://yoursite.ca/sitemap.xml

What about ai.txt?

ai.txt is a proposed standard (not widely adopted yet) for declaring AI-specific preferences. Spec at spawning.ai/ai-txt but adoption is low in 2026.

For now, robots.txt is the primary mechanism. Adding ai.txt doesn't hurt but isn't widely honored.

Beyond robots.txt — meta tags

`noai` and `noimageai` meta tags

Some platforms respect these meta tags as additional AI directives:

<meta name="robots" content="noai, noimageai">

Adoption is inconsistent. Robots.txt remains the primary mechanism.

HTTP headers

Some crawlers respect HTTP headers:

X-Robots-Tag: noai, noimageai

Same adoption issue. Use if your situation warrants belt-and-suspenders.

TDM reservations (European legal framework)

If your site has EU/UK visitors or you want to signal rights reservations under EU Copyright Directive Article 4:

<meta name="tdm-reservation" content="1">
<meta name="tdm-policy" content="https://yoursite.ca/tdm-policy">

This signals machine-readable rights reservation. Adoption growing in 2026, particularly for publisher-class sites.

Edmonton SMBs generally don't need this level of legal framework, but publishers and larger content sites might.

Monitoring — how to know what's crawling you

Log file analysis

Your web server logs every request. Filter for AI crawler user agents:

grep -iE 'GPTBot|ClaudeBot|Claude-Web|PerplexityBot|Google-Extended|OAI-SearchBot|anthropic-ai|CCBot|YouBot|cohere-ai|Amazonbot|FacebookBot|Bytespider' access.log

Tells you:

Which AI crawlers are actually visiting you
How often
Which pages they're reading
Whether they're respecting your robots.txt

Tools

Screaming Frog Log File Analyzer — paid but reliable
Cloudflare analytics — decent for Cloudflare-fronted sites
Sawmill / GoAccess — open-source log analyzers
Vercel analytics — if hosted on Vercel

What to look for

Are the crawlers you want actually visiting you? (If not, your robots.txt may be over-blocking or your site may be broken)
Are blocked crawlers still attempting? (Most respect robots.txt within 24-48 hours)
Are there unusual patterns? (Very high volume from one IP range = scraper masquerading as legitimate crawler)

Common mistakes

Blocking a crawler you wanted to allow

The classic: "we blocked GPTBot a year ago because of an IP concern and now we don't appear in ChatGPT answers." Review and update quarterly.

Over-broad disallow patterns

User-agent: *
Disallow: /

Blocks everyone. Surprisingly common accidental misconfiguration on Edmonton business sites.

Blocking in robots.txt while also showing content publicly

Robots.txt isn't access control. If content is publicly accessible, scrapers who don't respect robots.txt can still fetch it. Rely on authentication for actual content protection, not robots.txt.

Forgetting to update when new crawlers emerge

AI crawlers appear and rename frequently. A 2024 robots.txt may not cover 2026's crawler landscape.

Blocking Googlebot accidentally

Confusing Google-Extended (AI training) with Googlebot (regular search). Blocking Googlebot deindexes you from Google.

# WRONG — this blocks all of Google
User-agent: Googlebot
Disallow: /

# RIGHT — this only blocks AI training
User-agent: Google-Extended
Disallow: /

Not serving robots.txt at all

Some WordPress and Webflow default configurations don't expose robots.txt properly. Test by fetching https://yoursite.ca/robots.txt directly.

What about user consent frameworks?

Emerging in 2026: first-party consent for AI training (analogous to cookie consent for tracking).

Frameworks like TDM reservations (EU), Pomodoro Mark (proposed), and various industry-specific approaches exist. Adoption is still early.

For most Edmonton SMBs, staying with robots.txt is sufficient through 2026. Revisit in 2027-2028 as standards mature.

Sector-specific considerations

Healthcare (Edmonton clinics, hospitals)

Patient info should never be crawlable by anyone. That's authentication, not robots.txt. Public content (service pages, team bios) should be allowed for visibility.

Legal (Edmonton law firms)

Published articles benefit from AI visibility. Confidential case documents should be on authenticated platforms, not publicly indexable.

Education

Varies. Public course catalogs: allow. Student records: authenticate. Research data: depends on publication agreements.

Professional services (agencies, consultants)

Almost always: allow everything. Being cited by AI engines is a primary growth channel.

Publishers / content businesses

Most nuanced case. Consider Strategy 2 (allow retrieval, block training) or Strategy 3 (selective blocking).

Frequently asked questions

Do crawlers actually respect robots.txt?

Reputable crawlers (OpenAI, Anthropic, Google, Perplexity) do. Less reputable ones don't. Robots.txt is a signal, not access control.

Will blocking GPTBot hurt my ChatGPT visibility?

Partially. ChatGPT browsing uses ChatGPT-User, not GPTBot, for live fetches. But GPTBot is how your content gets into training data — future ChatGPT models will know less about you if you block it. For most Edmonton businesses, allow GPTBot unless you have specific reasons not to.

How often should I review my robots.txt?

Quarterly. AI crawler landscape changes fast.

Is there a "one size fits all" robots.txt for Edmonton SMBs?

Strategy 1 (allow everything) with a clean robots.txt and sitemap.xml reference. For most Edmonton service businesses, AI visibility is more valuable than IP defense.

What about image-specific AI training?

Add User-agent: Googlebot-Image and other image crawlers with Disallow rules if you want to protect images from AI training while keeping text accessible. Adoption is inconsistent.

Can I block by geography or crawler IP?

Technically yes (via Cloudflare or server rules), but rarely worth it. Most AI crawlers operate from cloud infrastructure and IPs change frequently.

What if a crawler ignores my robots.txt?

You can block by IP at the CDN/firewall level. Cloudflare has bot management features. For small Edmonton businesses, this is usually overkill — unreputable crawlers account for small fraction of traffic and aren't worth the maintenance burden.

Should I use a plugin for this?

For WordPress: plugins like Yoast can manage robots.txt. For Next.js: serve as public/robots.txt directly. For Webflow: custom code in site settings. Manual management is fine and gives you more control.

What's the ROI of getting this right?

Low for most Edmonton SMBs in 2026. A competent robots.txt is a hygiene signal. Getting it catastrophically wrong (blocking Googlebot by accident) has massive downside. Getting it perfectly right has marginal upside over defaults.

Want your robots.txt audited for AI crawler directives? We'll check what you're allowing and blocking, recommend changes based on your business goals, and verify configuration is actually working. Book a free audit. See also our llms.txt complete guide and schema markup checklist.

Before You Go · Free Download

Get the Autonomous Enterprise Blueprint

A 14-page architectural guide covering the Agency7 mandate, the fractured pipeline, agentic ledgers, and the generative engine optimization playbook — delivered as a PDF to your inbox.

Loading document...

End of file

Exit to blog Get AI audit

AI Crawler Directives, A 2026 Guide for Edmonton Businesses

AI Crawler Directives: A 2026 Guide for Edmonton Businesses

The crawlers that matter in 2026

OpenAI crawlers

Anthropic crawlers

Google crawlers

Microsoft / Bing

Perplexity

Other AI crawlers

The four strategies

Strategy 1: Allow everything (most common, default if you do nothing)

Strategy 2: Allow retrieval, block training

Strategy 3: Allow specific engines only

Strategy 4: Block all AI crawlers

The recommended defaults for Edmonton businesses

For most service businesses (lawyers, dentists, agencies, trades, real estate):

For content-heavy businesses (publishers, content sites, SaaS with proprietary research):

For paywalled businesses:

For SMBs with sensitive client data:

Implementing robots.txt for 2026

Base robots.txt (Strategy 1 — allow everything, standard)

Strategy 2 — allow retrieval, block training

Strategy 3 — selective blocking

Strategy 4 — block most AI crawlers

What about ai.txt?

Beyond robots.txt — meta tags

noai and noimageai meta tags

HTTP headers

TDM reservations (European legal framework)

Monitoring — how to know what's crawling you

Log file analysis

Tools

What to look for

Common mistakes

Blocking a crawler you wanted to allow

Over-broad disallow patterns

Blocking in robots.txt while also showing content publicly

Forgetting to update when new crawlers emerge

Blocking Googlebot accidentally

Not serving robots.txt at all

What about user consent frameworks?

Sector-specific considerations

Healthcare (Edmonton clinics, hospitals)

Legal (Edmonton law firms)

Education

Professional services (agencies, consultants)

Publishers / content businesses

Frequently asked questions

Do crawlers actually respect robots.txt?

Will blocking GPTBot hurt my ChatGPT visibility?

How often should I review my robots.txt?

Is there a "one size fits all" robots.txt for Edmonton SMBs?

What about image-specific AI training?

Can I block by geography or crawler IP?

What if a crawler ignores my robots.txt?

Should I use a plugin for this?

What's the ROI of getting this right?

Get the Autonomous Enterprise Blueprint

AI Crawler Directives: A 2026 Guide for Edmonton Businesses

The crawlers that matter in 2026

OpenAI crawlers

Anthropic crawlers

Google crawlers

Microsoft / Bing

Perplexity

Other AI crawlers

The four strategies

Strategy 1: Allow everything (most common, default if you do nothing)

Strategy 2: Allow retrieval, block training

Strategy 3: Allow specific engines only

Strategy 4: Block all AI crawlers

The recommended defaults for Edmonton businesses

For most service businesses (lawyers, dentists, agencies, trades, real estate):

For content-heavy businesses (publishers, content sites, SaaS with proprietary research):

For paywalled businesses:

For SMBs with sensitive client data:

Implementing robots.txt for 2026

Base robots.txt (Strategy 1 — allow everything, standard)

Strategy 2 — allow retrieval, block training

Strategy 3 — selective blocking

`noai` and `noimageai` meta tags

`noai` and `noimageai` meta tags