Advice on how to deal with AI bots/scrapers?

zoey@lemmy.librebun.com · 1 day ago

Advice on how to deal with AI bots/scrapers?

Possibly linux@lemmy.zip · 14 hours ago

Honestly we need some sort of proof of work (PoW)

Xanza@lemm.ee · 7 hours ago

This is the most realistic solution. Adding a 0.5/1s PoW to hosted services isn’t gonna be a big deal for the end user, but offers a tiny bit of protection against bots, especially if the work factor is variable and escalates.

Possibly linux@lemmy.zip · 7 hours ago

It also is practical for bots. It forces people to not abuse resources.

Xanza@lemm.ee · 7 hours ago

There are a lot of crypto which increase workfactor PoW to combat spam. Nano is one of them, so it’s a pretty proven technology, too.

Possibly linux@lemmy.zip · 5 hours ago

I’m putting crypto on my website. However, I think it would be feasible to do Argon2.

mel@jlai.lu · 21 hours ago

I guess sending tar bombs can be fun

slazer2au@lemmy.world · 20 hours ago

Go on.

orize@lemmy.dbzer0.com · 5 hours ago

You first pick them up.

Then you throw them.

Classic!

SidewaysHighways@lemmy.world · 4 hours ago

instructions unclear, tsar bomba away

breadsmasher@lemmy.world · 1 day ago

Im struggling to find it, but theres like an “AI tarpit” that causes scrapers to get stuck. something like that? Im sure I saw it posted on lemmy recently. hopefully someone can link it

_cryptagion [he/him]@lemmy.dbzer0.com · 8 hours ago

If you had read the OP, they don’t want the scrapers using up all their traffic.

breadsmasher@lemmy.world · edit-2 2 hours ago

yes i did read OP.

ed. i see this was downvoted without a response. But il put this out there anyway.

If you host a public site, which you expect anyone can access, there is very little you can do to exclude an AI scraper specifically.

Hosting your own site for personal use? IP blocks etc will prevent scraping.

But how do you identify legitimate users from scrapers? Its very difficult.

They will use your traffic up either way. Dont want that? You could waste their time (tarpit), or take your hosting away from public access.

Downvoter. Whats your alternative?

zoey@lemmy.librebun.com · 1 day ago

I did find this github link as the first search result, looks interesting, thanks for letting me know the term “tar pit”.

_cryptagion [he/him]@lemmy.dbzer0.com · 8 hours ago

If you’re looking to stop them from wasting your traffic, do not use a tarpit. The whole point of it is that it makes the scraper get stuck on your server forever. That means you pay for the traffic the scraper uses, and it will continually rack up those charges until the people running it wise up and ban your server. The question you gotta ask yourself is, who has more money, you or the massive AI corp?

Tarpits are the dumbest bit of anti-AI tech to come out yet.

rumba@lemmy.zip · 3 hours ago

There’s more than one style of tar pit. In this case you obviously wouldn’t want to use an endless maze style.

What you want to do in this case is send them through an HA proxy that would redirect them on user agent, whenever they come in as Claude you send them over to a box running on a Wanem process at modem speeds.

They’ll immediately realize they’ve got a hug of death going on and give up.

doodledup@lemmy.world · 7 hours ago

I don’t quiet understand how this is deployed. Hosting this behind a dedicated subdomain or path kind of defeats the purpose as the bots are still able to access the actual website no problem.

Natanael@infosec.pub · 4 hours ago

The trick is distinguishing them by behavior and switching what you serve them

doodledup@lemmy.world · 4 hours ago

How would I go about doing that? This seems to be the challenging part. You don’t want false positives and you also want replayability.

Natanael@infosec.pub · 3 hours ago

If you’ve already noticed incoming traffic is weird, you try to look for what distinguishes the sources you don’t want. You write rules looking at the behaviors like user agent, order of requests, IP ranges, etc, and put it in your web server and tells it to check if the incoming request matches the rules as a session starts.

Unless you’re a high value target for them, they won’t put endless resources into making their systems mimic regular clients. They might keep changing IP ranges, but that usually happens ~weekly and you can just check the logs and ban new ranges within minutes. Changing client behavior to blend in is harder at scale - bots simply won’t look for the same things as humans in the same ways, they’re too consistent, even when they try to be random they’re too consistently random.

When enough rules match, you throw in either a redirect or an internal URL rewrite rule for that session to point them to something different.

zitrone 🍋@lemmings.world · 1 day ago

there is also https://forge.hackers.town/hackers.town/nepenthes

N0x0n@lemmy.ml · edit-2 17 hours ago

Now I just want to host a web page and expose it with nepenthes…

First, because I’m a big fan of carnivorous plants.

Second, because it let’s you poison LLMs, AI and fuck with their data.

Lastly, because I can do my part and say F#CK Y0U to those privacy data hungry a$$holes !

I don’t even expose anything directly to the web (always accessible through a tunnel like wireguard) or have any important data to protect from AI or LLMs. But just giving the opportunity to fuck with them while they continuously harvest data from everyone is something I was already thinking off but didn’t knew how.

Thanks for the link !

CronyAkatsuki@lemmy.cronyakatsuki.xyz · edit-2 1 day ago

Try crowdsec.

You can set it up with list’s that are updated frequetly and have it look at caddy proxy logs and then it can easilly block ai/bot like traffic.

I have it blocking over 100k ip’s at this moment.

https://www.crowdsec.net/

zoey@lemmy.librebun.com · 1 day ago

Not gonna lie, the $3900/mo at the top of the /pricing page is pretty wild.
Searched “crowdsec docker” and they have docs and all that. Thank you very much, I’ve heard of crowdsec before, but never paid much attention, absolutely will check this out!

K3CAN@lemmy.radio · 4 hours ago

The paid plans get you the “premium” blocklists, which includes one specially made to prevent AI scrapers, but a free account will still get you the actual software, the community blocklist, plus up to three "basic"lists.

Jakeroxs@sh.itjust.works · 5 hours ago

You don’t have to pay to use it

Greg Clarke@lemmy.ca · 1 day ago

What are you hosting and who are your users? Do you receive any legitimate traffic from AWS or other cloud provider IP addresses? There will always be edge cases like people hosting VPN exit nodes on a VPS etc, but if its a tiny portion of your legitimate traffic I would consider blocking all incoming traffic from cloud providers and then whitelisting any that make sense like search engine crawlers if necessary.

drkt@scribe.disroot.org · 1 day ago

Build tar pits.

tuna@discuss.tchncs.de · 4 hours ago

this might not be what you meant, but the word “tar” made me think of tar.gz. Don’t most sites compress the HTTP response body with gzip? What’s to stop you from sending a zip bomb over the network?

drkt@scribe.disroot.org · 3 hours ago

Even if that was possible, I don’t want to crash innocents peoples browsers. My tar pits are deployed on live environments that normal users could find themselves navigating to and it’s overkill when if you simply respond to 404 Not Found with 200 OK and serve 15MB on the “error” page then bots will stop going to your site because you’re not important enough to deal with. It’s a low bar, but your data isn’t worth someone looking at your tactics and even thinking about circumventing it. They just stop attacking you.

mholiv@lemmy.world · 1 day ago

They want to reduce the bandwidth usage. Not increase it!

𝕽𝖚𝖆𝖎𝖉𝖍𝖗𝖎𝖌𝖍@midwest.social · 1 day ago

A good tar pit will reduce your bandwidth. Tarpits aren’t about shoving useless data at bots; they’re about responding as slow as possible to keep the bot connected for as long as possible while giving it nothing.

Endlessh accepts the connection and then… does nothing. It doesn’t even actually perform the SSL negotiation. It just very… slowly… sends… an endless preamble, until the bot gives up.

As I write, my Internet-facing SSH tarpit currently has 27 clients trapped in it. A few of these have been connected for weeks. In one particular spike it had 1,378 clients trapped at once, lasting about 20 hours.

mholiv@lemmy.world · 18 hours ago

Fair. But I haven’t seen any anti-ai-scraper tarpits that do that. The ones I’ve seen mostly just pipe 10MB of /dev/urandom out there.

Also I assume that the programmers working at ai companies are not literally mentally deficient. They certainly would add .timeout(10) or whatever to their scrapers. They probably have something more dynamic than that.

𝕽𝖚𝖆𝖎𝖉𝖍𝖗𝖎𝖌𝖍@midwest.social · 14 hours ago

Ah, that’s where tuning comes in. Look at the logs, take the average time-out, and tune the tarpit to return a minimum payload consisting of a minimal HTML containing a single, slightly different URL back to the tar pit. Or, better yet, JavaScript that loads a single page of tarpit URLs very slowly. Bots have to be able to run JS, or else they’re missing half the content on the web. I’m sure someone has created a JS forkbomb.

Variety is the spice of life. AI botnet blacklists are probably the better solution for web content; you can run ssh on a different port and run a tarpit on the standard port, and it will barely affect you. But for the web, if you’re running a web server you probably want visitors, and tarpits would be harder to set up to catch only bots.

mholiv@lemmy.world · 14 hours ago

I see your point but like I think you underestimate the skill of coders. You make sure your timeout is inclusive of JavaScript run times. Maybe set a memory limit too. Like imagine you wanted to scrape the internet. You could solve all these tarpits. Any capable coder could. Now imagine a team of 20 of the best coders money can buy each paid 500.000€. They can certainly do the same.

Like I see the appeal of running a tar pit. But like I don’t see how they can “trap” anyone but script kiddies.

WhyJiffie@sh.itjust.works · 11 hours ago

you whisky couldn’t solve tarpits completely. they may hold up the scrapers for less time, but they will still do that for the amount of the timeout

mholiv@lemmy.world · 10 hours ago

Maybe not with just if statements. But with a heuristic system I bet any site that runs a tar pit will be caught out very quickly.

𝕽𝖚𝖆𝖎𝖉𝖍𝖗𝖎𝖌𝖍@midwest.social · 12 hours ago

Nobody is paying software developers 500.000€. It might cost the company that much, but no developers are making that much. The highest software engineer salaries are still in the US, and the average is $120k. High-end salaries are $160k; you might creep up a little more than that, but that’s also location specific. Silicon Valley salaries might be higher, but then, it costs far more to live in that area.

In any case, the question is ROI. If you have to spend $500,000 to address some sites that are being clever about wasting your scrapers’ time, is that data worth it? Are you going to make your $500k back? And you have to keep spending it, because people keep changing tactics and putting in new mechanisms to ruin your business model. Really, the only time this sort of investment makes sense is when you’re breaking into a bank and are going to get a big pay-out in ransomware or outright theft. Getting the contents of my blog is never going to be worth the investment.

Your assumption is that slowly served content is considered not worth scraping. If that’s the case, then it’s easy enough for people to prevent their content from being scraped: put in sufficient delays. This is an actual a method for addressing spam: add a delay in each interaction. Even relatively small delays add up and cost spammers money, especially if you run a large email service and do it at scale.

Make the web a little slower. Add a few seconds to each request, on every web site. Humans might notice, but probably not enough to be a big bother, but the impact on data harvesters will be huge.

If you think this isn’t the defense, consider how almost every Cloudflare interaction - and an increasingly large number of other sites - are including time-wasting front pages. They usually say something like “making sure you’re human” with a spinning disk, but really all they need to be doing is adding 10 seconds to each request. If a scraper of trying to indeed only a million pages a day, and each page adds a 10s delay, that’s wasting 2,700 hours of scraper computer time. And they’re trying to scrape far more than a million pages a day; it’s estimated (they don’t reveal the actual number) that Google indexes billions of pages every day.

This is good, though; I’m going to go change the rate limit on my web server; maybe those genius software developers will set a timeout such that they move on before they get any content from my site.

mholiv@lemmy.world · edit-2 11 hours ago

When I worked in the U.S. I was well above $160k.

When you look at leaks you can see $500k or more for principal engineers. Look at valves lawsuit information. https://www.theverge.com/2024/7/13/24197477/valve-employs-few-hundred-people-payroll-redacted

Meta is paying $400k BASE for AI Reserch engineers with stock options on top which in my experience is an additional 300% - 600%. Vesting over 2 to 4 years. This is to H1B workers who traditionally are paid less.

Once you get to principal and staff level engineering positions compensation opens up a lot.

https://h1bdata.info/index.php?em=meta+platforms+inc&job=&city=&year=all+years

ROI does not matter when companies are telling investors that they might be first to AGI. Investors go crazy over this. At least they will until the AI bubble pops.

I support people resisting if they want by setting up tar pits. But it’s a hobby and isn’t really doing much.

The sheer amount of resources going into this is beyond what people think.

That and a competent engineer can probably write something on the BEAM VM that can handle a crap ton of parallel connections. 6 figure maybe? Being slow walked means low CPU use which means more green threads.

sem@lemmy.blahaj.zone · 15 hours ago

There’s one I saw that gave the bot a long circular form to fill out or something, I can’t exactly remember

𝕽𝖚𝖆𝖎𝖉𝖍𝖗𝖎𝖌𝖍@midwest.social · 15 hours ago

Yeah, that’s a good one.

drkt@scribe.disroot.org · 1 day ago

Bots will blacklist your IP if you make it hostile to bots

This will save you bandwidth

douglasg14b@lemmy.world · 1 day ago

Cool, lots of information provided!

WasPentalive@lemmy.one · 1 day ago

Too bad you can’t post a usage notice that anything scrapped to train an AI will be charged and will owe $some-huge-money, then pepper the site with bogus facts, occasionally ask various AI about the bogus fact and use that to prove scraping and invoice the AI’s company.

poVoq@slrpnk.net · edit-2 1 day ago

It seems any somewhat easy to implement solution gets circumvented by them quickly. Some of the bots do respect robots.txt through if you explicitly add their self-reported user-agent (but they change it from time to time). This repo has a regularly updated list: https://github.com/ai-robots-txt/ai.robots.txt/

In my experience, git forges are especially hit hard, and the only real solution I found is to put a login wall in front, which kinda sucks especially for open-source projects you want to self-host.

Oh and recently the mlmym (old reddit) frontend for Lemmy seems to have started attracting AI scraping as well. We had to turn it off on our instance because of that.

zoey@lemmy.librebun.com · edit-2 1 day ago

In my experience, git forges are especially hit hard

Is that why my Forgejo instance has been hit twice like crazy before…
Why can’t we have nice things. Thank you!

EDIT: Hopefully Photon doesn’t get in their sights as well. Though after using the official lemmy webui for a while, I do really like it a lot.

poVoq@slrpnk.net · 1 day ago

Yeah, Forgejo and Gitea. I think it is partially a problem of insufficient caching on the side of these git forges that makes it especially bad, but in the end that is victim blaming 🫠

Mlmym seems to be the target because it is mostly Javascript free and therefore easier to scrape I think. But the other Lemmy frontends are also not well protected. Lemmy-ui doesn’t even allow to easily add a custom robots.txt, you have to manually overwrite it in the reverse-proxy.

solrize@lemmy.world · 1 day ago

Might be worth patching fail2ban to recognize the scrapers and block them in iptables.

Kairos@lemmy.today · 1 day ago

Read access logs and 403 user agents or IPs

doodledup@lemmy.world · 8 hours ago

That would be extremely tedious. There are hundrets of thousands of scrapers out there.