Rock is my traffic real?

Rock is my traffic real?

Sadly the beginning of this month I was forced to start rate limiting and outright blocking parts of my private infrastructure. Let me explain:

I noticed slow response times on all of my personal projects's API's. At first glance, the trigger looked like something you’d normally auto scale in production: traffic was up over 200%. So, You let the orchestrator add capacity, tune caches, maybe optimize a few endpoints.

But this wasn’t that.

When I dug into the logs, it became clear that this wasn’t organic usage. On some services, like my personal Git server, roughly 94% of all requests were automated. And not the usual mix of search engine crawlers or occasional mass-scanners. The bulk of it was scraping every single detail, activity, and other information of my public accessible endpoints. much of the requests clearly tied to the infrastructure of the big AI companies.

At that point, you’re not serving many curious users... but you’re feeding AI pipelines.

As a result, I’ve had to change how I expose things. Without time to implement a verification system, i had to block routes that where being used by these scrapers.
All my shared code that used to be browsable via the web is no longer available that way. If you want to see it, you’ll have to clone the repositories directly. Sorry

After trying to deny crawling via robots.txt it works for a little to get most of the performance back but then it gets ignored by most not so nice crawlers, and within hours the hammering started again, and after blocking user agents they often get by distributed clients masquerading as generic Chrome instances. So yeah, the only solution is to protect routes by either using third-party solutions or start human verifications per session etc.

It is inline with how people are using the internet now. A lot of queries that used to go through search engines are now going through LLMs instead.

You can see it on Twitter/X, people increasingly ask AI systems to interpret content for them. Even obvious cases, like distinguishing AI-generated visuals, (Arma3) game footage, and real-world events. are now being outsourced to models.

None of this is particularly surprising.

And for the icing on the cake i asked an LLM to compile a story by reading some sources, because I’m definitely not the only one noticing:


AI Crawlers Are Striking

Across the industry, there’s clear evidence that AI-driven scraping traffic has increased sharply over the past 1–2 years, both in volume and in how aggressively it interacts with systems.

At the edge, providers like Cloudflare report that AI and search crawler traffic grew 18% between May 2024 and May 2025, with peaks of 32% year-over-year growth. More telling is the relative growth of individual agents: GPTBot traffic increased by over 300% year-over-year, indicating that newer AI crawlers are scaling far faster than traditional indexing bots.

What’s different from classic search crawlers is the asymmetry between requests and value returned. Historically, crawling implied discoverability and referral traffic. With LLM crawlers, that relationship is breaking down. Cloudflare data shows that around 80% of AI crawling is for training purposes, not live retrieval. In practical terms, that means high request volume with minimal downstream traffic. In some measured cases, crawlers were fetching thousands to tens of thousands of pages per referral generated.

This pattern shows up clearly in origin infrastructure as well. Wikimedia reports that bandwidth consumption for media downloads increased by 50% since early 2024, attributing much of that growth to automated scraping for AI datasets. Notably, bots account for ~65% of their most resource-intensive traffic, even though they represent a minority of actual user interactions. In other words: the expensive traffic isn’t the human traffic anymore.

From a traffic-shaping perspective, the load characteristics are also different. Fastly observed AI fetchers reaching ~39,000 requests per minute against a single target, which aligns with the bursty, parallelized patterns many of us are now seeing in logs—wide IP distribution, shallow respect for crawl delay, and inconsistent adherence to robots.txt.

Security vendors are seeing the same trend at scale. DataDome reports that LLM crawler activity increased nearly 4× within 2025 alone, while publisher-focused measurements (like TollBit) show both rapid growth in AI bot share and a rising rate of robots.txt bypass attempts.

It’s important to contextualize this within broader automation trends: over 50% of web traffic is now automated according to Imperva. But the key shift is qualitative, not just quantitative. AI crawlers are:

  • More aggressive in parallelization
  • Less aligned with attribution (i.e. fewer referrals)
  • Increasingly resistant to traditional controls

That combination is what turns “background bot traffic” into something that directly impacts latency, bandwidth costs, cache efficiency, and origin stability.

The response from infrastructure providers reflects this shift. In 2025, Cloudflare moved to block unauthorized AI crawlers by default for new zones, effectively acknowledging that the previous equilibrium—crawl in exchange for traffic—no longer holds.

So if your servers suddenly feel “hotter” without a corresponding increase in users, you’re not imagining it. The traffic mix has changed—and it’s changing everywhere.


Sources


Generated Proost,