Block LLMs from scraping by allowing customization of robots.txt#113

I’m more than happy to have by site indexed by search engines such as Google, Bing, Kagi, etc. But I’d love to have a way to block LLMs such as ChatGPT, from training using data from my site.

From what I’m aware, the main viable option for this would be robots.txt. page level <meta> no-follow tags would run away search engines since there’s no way to specify user-agents. Additionally, DNS level scraping products from companies such as Cloudflare only prevent malicious scrapers (e.g. ones looking for emails/phone numbers).

9 months ago

#1 Meta HTML Tags

(Unfortunately, there’s not a lot of documentation on other LLMs like Bard/Claude and blocking GoogleBot will lower your search rankings. If anyone finds documentation other LLM user-agents, feel free to share here!)

#2 Cloudflare Web Application Firewall
https://developers.cloudflare.com/waf/tools/user-agent-blocking/#cloudflare-user-agent-blocking

Source: https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/

Hope this is useful for other folks!

8 months ago

Currently the robots.txt does block ChatGPT, however, I do want to make it more granular on a blog-by-blog basis since this also blocks ChatGPT’s browsing feature from reading page content, which some people would like.

6 months ago

Changed the status to

Planned

6 months ago

I’ve added a robots.txt editor. In your dashboard go to Settings > Advanced settings and add the following (as well as any additions you’d like):

User-agent: GPTBot
Disallow: /

6 months ago

Changed the status to

Completed

6 months ago

Thank you!

6 months ago

Make a suggestion