AI crawlers problems (and solutions)

boreal_cryptid
Sr. Member ⚓︎

без надії сподіваюсь
⛺︎ My Room

Guild Memberships:

Artifacts:

AI crawlers problems (and solutions)

« on: June 28, 2025 @611.37 »

recently AI crawlers have been even more aggressive. i decided to start a topic about it, and about possible solutions.

stories so far:

possible solutions:

Anubis - Anubis is a Web AI Firewall Utility that weighs the soul of your connection using one or more challenges in order to protect upstream resources from scraper bots.
ai.robots.txt - A list of AI agents and robots to block. There's limitations to this. The most obvious one is that many bots will simply not respect or honor your robots.txt. The ones from major companies supposedly do. Think of it as more of a polite request, rather than a "block".


	Logged

Is this how you honor MelonLand Forum, and the tribe unmourned? Write to me openly, and not by stealth.

crazyroostereye
Full Member ⚓︎

I am most defiantly a Human
⛺︎ My Room
iMood:

RSS:

Guild Memberships:

Artifacts:

Re: AI crawlers problems (and solutions)

« Reply #1 on: June 28, 2025 @694.42 »

This is difficult subject, as you have Anubis which is a bit of a Nuclear Response to Scrapper Bots. As they also block good bots like Archivers and Web Indexers.

Where it ai.robots.txt is good list Scraper User-Agents. But it relays on the User-Agent which can be easily manuplated. And I heard Stories that the big Scrappers (Amazon and co) use typical Browser User Agents and ISP IPs for their Scrappers instead of their bot IPs. Tho I cannot confirm these Stories.

But I an say both Projects are Good Solutions, It's just sad that we have to use them.


	Logged

TheFrugalGamer
Hero Member ⚓︎

⛺︎ My Room
Itch.io: My Games
RSS:

Guild Memberships:

Artifacts:

Re: AI crawlers problems (and solutions)

« Reply #2 on: July 01, 2025 @616.18 »

There's also Glaze, which claims to scramble your images making them unusable by AI:

https://glaze.cs.uchicago.edu/

I have no idea how affective it is, but it's kind of hard to prove a negative, so it may just be that we have to wait around and see if any AI bots manage to scrap glazed artwork.


	Logged

| Find me on the Fediverse

Artifact Swap:

Julikins
Full Member

Working on website - mentally...
⛺︎ My Room
StatusCafe: ookamij64
iMood:

XMPP: Chat!
Itch.io: My Games

Guild Memberships:

Artifacts:

Re: AI crawlers problems (and solutions)

« Reply #3 on: July 01, 2025 @906.45 »

I would second Glaze and Nightshade as well, but sometimes the issue comes with not having the sufficient amount of horsepower to get the program to run. And I understand a lot of folks are willing to help others and get the glazing applied on their end. I could also suggest something like Artshield as well. Was going to put Sanative here as well, but lately their login link isn't working so I opted to not add that here.

I want to say another way to go about it is have a light layer of an image that could confuse the AI crawlers but I can't say for sure what specifically but I remember bits and pieces of it (which is what's already been noted)...


	Logged

Artifact Swap:

brisray
Sr. Member ⚓︎

⛺︎ My Room

Artifacts:

Re: AI crawlers problems (and solutions)

« Reply #4 on: July 02, 2025 @997.47 »

I run my own web server and this morning (1st July) I ran my monthly log rotation and saw that June's logs were 1.70Gb in size (7,245,677 requests, 4,385,798 pages) rather than the usual 300Mb (1,400,000 requests, 1,000,000 pages).

I used Log Parser to list the user agents and here's some of the top rquesters:

GPTBot - 3,807,710 requests
Scrapy - 2,593,094
Barkrowler - 52,715
Meta-ExternalAgent - 42,502
ClaudeBot - 29,145
PetalBot - 20,653
MJ12bot - 14,399
Bytespider - 12,866

What to do about them?

There's loads of others such as the search engine and archiving bots but I am not worried about those. What does bother me are the AI bots and those where you have to pay to get their insights into the scrapes - basically, they charge users for data they got for free.

Most say they respect robots.txt. If that doesn't work then as I run my own Apache server I could use SetEnvIfNoCase or a rewrite rule to deny them access.

There are lists of bad bots but adding them all appears to me to be overkill.


	Logged

crazyroostereye
Full Member ⚓︎

I am most defiantly a Human
⛺︎ My Room
iMood:

RSS:

Guild Memberships:

Artifacts:

Re: AI crawlers problems (and solutions)

« Reply #5 on: July 03, 2025 @320.01 »

Quote from: brisray on July 02, 2025 @997.47

I run my own web server and this morning (1st July) I ran my monthly log rotation and saw that June's logs were 1.70Gb in size (7,245,677 requests, 4,385,798 pages) rather than the usual 300Mb (1,400,000 requests, 1,000,000 pages).

I used Log Parser to list the user agents and here's some of the top rquesters:

GPTBot - 3,807,710 requests
Scrapy - 2,593,094
Barkrowler - 52,715
Meta-ExternalAgent - 42,502
ClaudeBot - 29,145
PetalBot - 20,653
MJ12bot - 14,399
Bytespider - 12,866

What to do about them?

There's loads of others such as the search engine and archiving bots but I am not worried about those. What does bother me are the AI bots and those where you have to pay to get their insights into the scrapes - basically, they charge users for data they got for free.

Most say they respect robots.txt. If that doesn't work then as I run my own Apache server I could use SetEnvIfNoCase or a rewrite rule to deny them access.

There are lists of bad bots but adding them all appears to me to be overkill.

Yeah I noticed similar to, but what I can recommend ai.robots.txt, they include the worse offenders AI scrappers. As those have the tendency to come back more frequent and call more processing Intensive calls, if you run a non Static site. Implementing them is also fairly easy as they already have preperred files for the most common Webservers. Not Perfect but it's the least intrussive I know.


	Logged

boreal_cryptid
Sr. Member ⚓︎

без надії сподіваюсь
⛺︎ My Room

Guild Memberships:

Artifacts:

Re: AI crawlers problems (and solutions)

« Reply #6 on: July 08, 2025 @18.24 »

new stuff: Cloudflare blocks AI crawlers by default and launches Pay Per Crawl for publishers

Quote

Cloudflare has rolled out a default policy to block known artificial intelligence web crawlers, aiming to prevent the unapproved collection and use of website content by AI companies. Under the new approach, domain owners setting up a site on Cloudflare are prompted to specify whether or not to permit AI crawler access, giving users immediate control over data scraping activities.


	Logged

Is this how you honor MelonLand Forum, and the tribe unmourned? Write to me openly, and not by stealth.

Loebas
Full Member ⚓︎

⛺︎ My Room
Matrix: Chat!
RSS:

Guild Memberships:

Artifacts:

Re: AI crawlers problems (and solutions)

« Reply #7 on: July 08, 2025 @267.15 »

I have all known AI bots denied in my robots.txt. But in my eyes robots.txt is just passive dont disturb sign, since it is up to the bot itself to honor it, or not. And experience learned that some AI bots are not like that.

Other option is to take the problematic ASN number, and convert it to a IP range, and deny those IP's access with apache's htaccess. Apache will just serve up a 403 to these bots.

If you own your own webserver, you can also use its firewall to block those bota aswell. Which is more effective, and saves more bandwidth.

In my eyes, you cannot just take someones hardwork to be used in a random LLM. Especially when the webmaster sees nothing in return. Its just stealing.


	Logged

(open on mondays)

Pet's name: barky
Adopt virtual pets at Chicken Smoothie!

boreal_cryptid
Sr. Member ⚓︎

без надії сподіваюсь
⛺︎ My Room

Guild Memberships:

Artifacts:

Re: AI crawlers problems (and solutions)

« Reply #8 on: August 17, 2025 @346.79 »

Cloudflare accuses Perplexity of evading AI crawler blocks on sites using stealth tactics, lol.

Quote

...Perplexity rotated user-agent strings and changed autonomous system networks to avoid detection on sites that explicitly blocked automated access via robots.txt files and similar methods. The activity reportedly spanned millions of daily requests across tens of thousands of domains.

Quote

Perplexity’s spokesperson denied the claims, calling the report a publicity stunt and asserting that the named bot wasn’t theirs and no content was accessed.

tbh i believe Cloudflare more than this shady AI startup :D


	Logged

Is this how you honor MelonLand Forum, and the tribe unmourned? Write to me openly, and not by stealth.

crazyroostereye
Full Member ⚓︎

I am most defiantly a Human
⛺︎ My Room
iMood:

RSS:

Guild Memberships:

Artifacts:

Re: AI crawlers problems (and solutions)

« Reply #9 on: August 17, 2025 @464.81 »

Quote from: boreal_cryptid on August 17, 2025 @346.79

Cloudflare accuses Perplexity of evading AI crawler blocks on sites using stealth tactics, lol.
tbh i believe Cloudflare more than this shady AI startup :D

Oh yeah I have heard now from Multiple Places that they notice that seemingly AI bots are beginning to change their User-Agent to standard Browser Agents. Even worse the bigger Companies usually have a set of IPs from which they run their bots, for example Amazon. This makes it easy to block them. But they have lately started using other IPs including IPs from Residential IP Ranges. Making it Impossible to effectively block from traditional places.


	Logged

Furbisms
Full Member ⚓︎

What's up party people?
⛺︎ My Room
StatusCafe: furbisms
iMood:

RSS:

Guild Memberships:

Artifacts:

Re: AI crawlers problems (and solutions)

« Reply #10 on: August 17, 2025 @479.69 »

Quote from: crazyroostereye on August 17, 2025 @464.81

Oh yeah I have heard now from Multiple Places that they notice that seemingly AI bots are beginning to change their User-Agent to standard Browser Agents. Even worse the bigger Companies usually have a set of IPs from which they run their bots, for example Amazon. This makes it easy to block them. But they have lately started using other IPs including IPs from Residential IP Ranges. Making it Impossible to effectively block from traditional places.

This is utterly infuriating!! I've been meaning to put the robots.txt on my site for a while now but hearing that it gets disrespected so much is upsetting. I don't want to do something as scorched earth as use anubis though, because I hear it can cause problems for legitimate users in certain circumstances.

I wish companies weren't so shady and awful. They shouldn't be allowed to do this stuff. It's so clearly a violation of what should be seen as moral and acceptable. It shouldn't be allowed. I know that's a slippery slope kind of mindset but it genuinely is upsetting they can do this. I don't know what else to say. I can't wrap my head around anyone being able to see this as okay. It's so shady!!


	Logged

crazyroostereye
Full Member ⚓︎

I am most defiantly a Human
⛺︎ My Room
iMood:

RSS:

Guild Memberships:

Artifacts:

Re: AI crawlers problems (and solutions)

« Reply #11 on: August 17, 2025 @797.38 »

Quote from: Furbisms on August 17, 2025 @479.69

This is utterly infuriating!! I've been meaning to put the robots.txt on my site for a while now but hearing that it gets disrespected so much is upsetting. I don't want to do something as scorched earth as use anubis though, because I hear it can cause problems for legitimate users in certain circumstances.

I wish companies weren't so shady and awful. They shouldn't be allowed to do this stuff. It's so clearly a violation of what should be seen as moral and acceptable. It shouldn't be allowed. I know that's a slippery slope kind of mindset but it genuinely is upsetting they can do this. I don't know what else to say. I can't wrap my head around anyone being able to see this as okay. It's so shady!!

Yeah, the disrespecting of the robots.txt isnt new, there have been plenty of times of scrappers and companys forgetting that robot.txt is a thing. The bigger Issue is the manipulation of the User Agent, because an alternative is setting a Server rule that rejects specific user agents, instead of suggesting your not allowed with robots.txt. Thats what I am referring to with AI robots txt as they provide already made Scriplets for a verity of Web servers. And they can't Function without a proper User Agent. This is already evil. But even more is the using of Residential IPs, as even Black Hat people, without hacking somebody, haven't had access to such IP pools. But these big Company's do. This is what Black Hat people couldn't even dream of achieving.


	Logged

Blue
Full Member ⚓︎

⛺︎ My Room
StatusCafe: overmore
iMood:

Matrix: Chat!
Itch.io: My Games

Artifacts:

Re: AI crawlers problems (and solutions)

« Reply #12 on: August 19, 2025 @639.11 »

A bit ago I came across a post on bluesky talking about an HTML zip bomb, including a link to it. I will admit, I am still not sure how this functions (not that tech savvy still, but we're getting there) and maybe it could be worth looking through it?