Conversation

The fuck is this.

WHERE ARE MY BOTS?!

I am not used to 100req/sec total. With ai.robots.txt being the top one, asn and faked-browser barely even registering?

Huh.

9
0
0

@algernon
they're gathering up forces for a larger strike /s

1
0
0

@wolf480pl tbh, that is exactly what happened last time I had a lull.

0
0
0

@algernon now heading into part 2 of the baba and the `wei saga, where they realize their chromes are bringing back the same garbage

1
0
0

@zaire

Baba: are you thinking what I'm thinking, 'Wei?
'Wei: that we should scrape the whole wide web? Yes, Baba!
Baba: No, you idiot. We only get garbage. We'll go back to the cage and Vibe!

0
0
0

@Ganneff A little!

For the past year, I never saw my incoming request rate below 100req/sec for longer than half a day (except when I firewalled half the internet off). It's at ~60req/sec for ~23 hours now.

1
0
0

I have not seen anything like this in a year. Never, actually, never since I started monitoring the bots.

3
0
0

I had lulls, yes, but... never this long, and never with Alibaba & Huawei almost completely disappearing. Never with the faked browsers barely registering.

Well, except when I firewalled them all off, but... that doesn't really count. They would've been here, if they could be. Nothing is stopping them now! They just... don't come.

4
0
0

Now if this stayed like this forever, that would be grand. If any AI company wants me to stop working on iocaine: this is how you do it. You stop visiting.

Preferably you take one hard look at my /robots.txt and fuck off forever. Or better yet, just close shop and do something useful, but that's probably too much to ask.

2
0
0

@algernon Conspiracy theory: Alibaba and Huawei actually operate from inside Iran.

1
0
0

@liw The worst part is that I can't immediately refute that.

(I will keep an eye on my metrics and their correlation with the internet situation in Iran...)

0
0
0

@algernon do you have metrics for the total number of requests that aren't blocked? what if they just aren't crawling the iocaine pages at all bc they're detecting iocaine & iocaine can't detect them, so they're still crawling your normal pages

1
0
0

@solonovamax I do, that's the green "default" line. They ain't coming through. Not in large numbers anyway.

0
0
0

@algernon the funniest explanation would be that we all first heard about the big crash/onset of the third winter from watching iocaine logs

1
1
0

@technomancy ROFL.

But... how do I sell the rights to Hollywood so they can make a film of my dashboards... hmm....

0
0
0

In half an hour, the "150" (request/sec) range will disappear off of the charts for the past 24 hours.

Feels so weird.

1
0
0

Is this what success looks like?

What if they just want to deny me data, so I can't do crawler research?

1
0
0

Well, if that's the case, they're in for a surprise. I have way more sources of data than my own sites.

1
0
0

If any of you scrapers are reading this:

  1. Fuck you.
  2. This doesn't mean I want you back.
1
0
0

We're soon entering "80 is the highest number on the Y axis" territory, and this is just unbelieavable.

3
0
0

No, they're not getting past my defenses. The green line is the "default" ruleset that lets things through. It's not different than normal.

My firewall doesn't block them. The load on my servers are noticably lower.

I forgot this feeling of calm.

1
0
0

Yes, ai.robots.txt is still blocking ~60 requests / sec, but that's such a tiny amount compared to what I normally get hit with.

I wouldn't mind that going away either, mind you.

3
0
0

@algernon I doubt it is, but it would be cool if this was some early warning to interesting news. (wistfully hoping to see they just shut it down) (not gonna happen, but i can wish)

1
0
0

...and I have the core idea for the next Baba and the 'Wei episode. Coming to a blog near you.... proooooobably tonight.

2
0
0

@algernon Maybe someone has finished gathering data for a training run?

1
0
0

@datarama they pretty much all disappeared. Not just a single distributed crawler, but... like, all of the disguising ones are gone.

1
0
0

@algernon The other obvious explanation, namely that someone put your site on an ignore-list because they realized they were getting served garbage, also doesn't make sense then.

1
0
0

@datarama i'm seeing the same thing on a canary domain that doesn't serve garbage, so... probs not an ignore list either.

But: others are still seeing regular crawler assault. So this retreat feels staggered at best.

1
0
0

@algernon Is your canary domain on the same IP as the wonderful garbage spout?

1
0
0

@datarama No. Different IP, different hoster, different country, different domain, and my name or references to me doesn't appear anywhere near it. Entirely different software stack too (OpenBSD + httpd, no iocaine).

(It's not even registered in my name, I just control it.)

0
0
0

Oops! I just had a minor ASN blip! Baba and the 'Wei came back for half an hour.

MY PETS ARE ALIVE!

0
0
0

So... about that new Baba and the 'Wei story. It's title's gonna be: "Baba and the 'Wei do Hollywood".

There might be spicy scenes.

1
0
0

@algernon I still get a lot of requests (almost 700 reqs/m, if I calculate correctly). Maybe they specifically excluded your sites?

1
0
0

@jak2k Dunno. I'm seeing the same drop on a canary domain (different ip, different hoster, country, stack, no iocaine, etc), and I've heard others report a drop in activity too.

But I've also heard - and seen - other places still seeing the "normal" amount of absurdly high request rates.

0
0
0

...and it is probably not coming today, because I lost inspiration halfway through.

0
0
0

We are now in the 80s territory. If things stay this way, we'll be in the 60s territory by two in the morning.

1
0
0

Still feels weird... if I'd firewall off about a hundred IP ranges, iocaine would be out of a job.

0
0
0

@algernon Can't live with them, can't live without them.

1
0
0

@Ganneff Oh I could definitely live without them. But they'd have to take the rest of the crap with them, that faint yellow area on my stats is still a nuisance!

1
0
0

@algernon I have seen this happen previously. They came back within a few months.

1
0
0

@aaron Yeah, I fully expect them to come back, but... this is the first big lull I'm seeing on my own infra. I've had 8-12 hour "outages" where the total request/sec fell below 100. But it's over 24 hours now.

Wouldn't mind if it stayed that way a little bit longer.

0
0
0

@algernon Ok, honestly, I would have been really surprised if the answer would have been a no.

1
0
0

@Ganneff That would have been half correct too. Originally, it was piss yellow purely by accident. I wanted to preserve that accident, so made it explicit.

0
0
0

I have enlisted the Bestest Detective I know to aid me in finding my bots. She's keenly observing the scene.

#DogsOfMastodon

1
0
0

@algernon Btw, how load-bearing is ASN ruleset? Are there many scrapers that match exclusively on it? I've been massively procrastinating on setting up maxmind db, wonder how important it is.

1
0
0

@KFears So-so. In most cases, the other rules catch them too, but I ordered ASN early, so I could highlight Alibaba & Huawei in stats easier. Usually, there are very few matches. But there are waves sometimes where they piggy-back on real Chromes (it's always Chrome), and even avoid poisoned URLs. Then the ASN ruleset catches them.

That happens rarely, though, so in the grand scheme of things, I would not consider it load bearing. But it is useful.

0
0
0

A day later, and my bots are still gone. Baba and the 'Wei had a 40 minute spike, but they have barely crossed the 60 request/sec line.

Not going to lie, I'm enjoying this calmness.

1
0
0

@algernon I'm weirded out by the bots being gone.

I wonder what's up.

1
0
0

@datarama Me too, because they're only gone from some sites. Other people, and other sites of mine still see them.

Maybe a particular new model is done collecting, and now they're filtering, realized I only serve them garbage and they're not getting through, so put me on a blocklist. We'll see if that's the case if they stay away. I fully expect them to come back, though.

This is something I will never know for sure, and even if I had a chance of figuring out, I wouldn't spend effort on it. Not worth it. My time is better spent enjoying the quiet!

1
0
0

@algernon @datarama it’s possible they thought the return on investment was not there

I suspect large codebases/forges will still get hammered until the bubble pops, because it’s worth it to try and bypass defenses like iocaine with real browsers

2
0
0

@Byte @datarama If that's the case, if they're really gone, that'd be fantastic. Then the method works!

(Luckily, one of the larger GitHub alternatives is also behind iocaine + Nam-Shub of Enki, so... hopefully this'll pan out similarly for them too in the longer run!)

0
0
0

@Byte @algernon Yes, but *all of them*? These goons don't strike me as the most cooperatively-inclined bunch.

But our gracious host understandably prefers to enjoy the peace and quiet, so I won't speculate further here.

1
0
0

@datarama @Byte Well, the Usual Suspects1 that don't try to hide are still here. "Only" the worst ones that try to hide are gone - and they're not completely gone either, just fell from ~200 req/sec to ~3-4req/sec.


  1. Anthropic, OpenAI, Meta, Google, etc ↩︎

1
0
0

@datarama @algernon do the usual suspects who don’t try to hide still ignore robots.txt?

1
0
0

@Byte @datarama Of course they do.

There are some, like Google, who do a bit of performance art to try and prove they respect it, but that's just that, performance art. For all intents and purposes they still ignore it.

(With that said, my data is half a year old. I serve garbage to them at /robots.txt too, because why bother telling them to fuck off when they're gonna ignore it anyway.)

1
0
0

@Byte "We'll respect robots.txt when directly crawling, but we'll crawl your site if anything links to it. Including our index, or your own site."

I mean, Google's various bots hit my sites with ~9k requests a day, even though I have x-robots-tag: noindex, nofollow, nosnippet, noimageindex, noarchive, nocache, notranslate in every response. They can't access my /robots.txt1, but even if they could, they still hit me with the same amount of requests.

Now, 9k hits isn't much, but it's about 8.9k more than it should be. They also load resources found in the HTML, despite the x-robots-tag header, so they barely even respect that.


  1. This is my robots.txt ↩︎

0
0
0

Baba and the 'Wei are showing signs of life. They had a 6-hour scraping episode between ~20:00 and 02:00 my time, roughly 03:00-09:00 China Standard Time.

I wonder what they will do over the weekend. Will I have my playthings back?

I have some surprises waiting for them.

1
0
0

Also, for some fun facts! You see those green spikes?

That's when y'all boost and star my toots. ~75% of the green line, the requests that pass through, are from fedi software.

1
0
0

A couple of hours later, Baba and the 'Wei ain't back yet. But the weekend wave usually starts around 17 my time, so there's a few hours to go.

I wonder what will happen!

66% Baba and the 'Wei do what they do every weekend: try to scrape the whole wide world
33% This weekend's episode is cancelled.
0
0
0

1.5 hours, and we'll see! I'm camp "try to scrape the whole wide world".

1
0
0

Huh. This is not what I expected.

Baba and the 'Wei are back, but... they're at the rate they crawled at some 12 hours ago, not nearly at the level they crawled last weekend.

Now, I've seen them do this pattern before, coming at me every ~11.5-12.5 hours. It's not unusual.

But it's not their weekend pattern of late.

So it looks like this weekend's episode is cancelled. We're not going to be without any Baba and the 'Wei scraping, but instead of a weekend episode, we're gonna have some old re-runs.

1
0
0

I'm seeing a very tiny uptick from Baba, but it's staying below the ai.robots.txt line. As if it was reading my charts.

Maybe that could be another story in the "algernon presents Baba and the 'Wei" series?

1
0
0

Baba's gone. It did it's ~6 hour scraping, like ~16 hours earlier, and then it left. Its crawling speed never exceeded that of ai.robots.txt.

0
0
0