I’m more than happy to have by site indexed by search engines such as Google, Bing, Kagi, etc. But I’d love to have a way to block LLMs such as ChatGPT, from training using data from my site.
From what I’m aware, the main viable option for this would be robots.txt. page level <meta> no-follow tags would run away search engines since there’s no way to specify user-agents. Additionally, DNS level scraping products from companies such as Cloudflare only prevent malicious scrapers (e.g. ones looking for emails/phone numbers).
#1 Meta HTML Tags
<meta name="ChatGPT-User" content="noindex,nofollow">
<meta name="CCBot" content="noindex,nofollow">
<meta name="GPTBot" content="noindex,nofollow">
(Unfortunately, there’s not a lot of documentation on other LLMs like Bard/Claude and blocking GoogleBot will lower your search rankings. If anyone finds documentation other LLM user-agents, feel free to share here!)
#2 Cloudflare Web Application Firewall
https://developers.cloudflare.com/waf/tools/user-agent-blocking/#cloudflare-user-agent-blocking
Source: https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/
Hope this is useful for other folks!
Currently the robots.txt does block ChatGPT, however, I do want to make it more granular on a blog-by-blog basis since this also blocks ChatGPT’s browsing feature from reading page content, which some people would like.
I’ve added a robots.txt editor. In your dashboard go to Settings > Advanced settings and add the following (as well as any additions you’d like):
User-agent: GPTBot
Disallow: /
Thank you!