Personal Blog Blocked by LLM Crawler Countermeasures, Spillover Effects on Feed Readers
Personal blog "Wandering Thoughts" blocked high-frequency crawlers for LLM training, inadvertently affecting feed readers like Inoreader and Feedly. User-Agent spoofing and difficulties in crawler identification highlighted.
Operators of personal websites are struggling with countermeasures against high-frequency crawlers targeting data collection for LLM (Large Language Model) training. The blog “Wandering Thoughts,” run by Chris Siebenmann, a system administrator at the University of Toronto in Canada, has drawn industry attention as a practical example of such countermeasures. The blog introduced measures to block access from user-agents of old browsers, which has inadvertently affected some feed readers.
Background of the Crawler Issue
Since early 2025, Siebenmann has faced a surge in high-frequency crawlers aimed at collecting data for LLM training. These crawlers access the site using spoofed User-Agent strings of old browsers, typically represented by older versions of Chrome. For a personally operated blog, the server load caused by such crawlers has become non-negligible.
As a countermeasure, the blog introduced a mechanism that inspects the User-Agent of HTTP requests and denies access to those identified as old browsers. However, this simple blocking method came with unintended side effects.
Impact on Feed Readers
The most notable impact appeared on readers using RSS feed readers. Both Inoreader and Feedly failed to properly fetch the blog’s feed content, instead displaying a block page.
According to Siebenmann’s investigation, these feed readers send HTTP requests using old browser User-Agent strings when periodically fetching feeds. As a result, they cache the block page instead of the actual feed data, displaying it to users.
For Feedly, it was confirmed that it accesses the site using a different User-Agent than during normal feed fetches. Inoreader faces a similar issue; while its feed fetching agent itself is not blocked, requests with a different User-Agent trigger the block.
Vivaldi Browser User-Agent Problem
This issue extends beyond feed readers. In the Chromium-based browser Vivaldi, the User-Agent is spoofed as Google Chrome by default. Siebenmann recommends that Vivaldi users change the “User Agent Brand Masking” in settings to identify themselves as genuine Vivaldi, as a countermeasure to the ongoing attack.
Tricky Challenge with Archive Sites
Sites like archive.today, archive.ph, and archive.is—so-called “archive.*” sites—are also heavily affected by this blocking measure. These sites operate to save snapshots of web pages, but their crawling methods are indistinguishable from malicious actors.
Specifically, archive.* crawlers use old Chrome User-Agent values and have IP address ranges widely dispersed, making them hard to identify. Additionally, some IP addresses have reverse DNS entries spoofed to appear as Googlebot’s, a behavior typically seen only from malicious attackers.
Siebenmann recommends archive.org (Wayback Machine) as a more appropriately behaved archiving crawler.
Dilemma of Personal Site Operators
This case highlights a universal dilemma faced by operators of personal or small-scale websites. The increase in crawlers targeting LLM training data imposes significant costs in terms of server load. However, User-Agent-based blocking is a crude method that affects legitimate users and services.
In particular, RSS feed readers are long-standing services used by many users, and blocking them risks losing an important readership for blog operators. The fact that services like Feedly and Inoreader have not properly managed the User-Agent they use when fetching feeds calls for industry-wide improvement.
Limitations of Technical Countermeasures
User-Agent-based blocking is easy to implement, but attackers can also easily spoof it. On the other hand, completely eliminating the risk of accidentally blocking legitimate services requires more advanced access controls (IP reputation, JavaScript challenges, CAPTCHA, etc.), but demanding such implementations from personally operated sites is not realistic.
Siebenmann has not taken measures such as relaxing the block or whitelisting. Given the load issues and the malicious nature of the crawlers, he finds it necessary to maintain the current blocking policy.
Editorial Opinion
In the short term, User-Agent blocking by personal blogs and small sites is expected to increase. Crawlers for LLM training data are likely to continue increasing, and operators’ countermeasures will also strengthen. However, as this case highlights the impact on feed readers, it may prompt feed reader companies to review their User-Agent management. In particular, Feedly and Inoreader may need to adopt a fixed User-Agent for feed fetching.
In the long term, the effectiveness of blocking methods reliant on User-Agent will likely diminish. LLM training crawlers evolve daily, using realistic browser User-Agents and mimicking more human-like behavior. Consequently, websites will need to introduce mechanisms for more granular content delivery policies (e.g., advanced use of robots.txt, access control via APIs, enhanced terms of service). Additionally, the industry as a whole must establish rules that harmonize LLM training data collection with the rights of web publishers.
The editorial team views this incident not merely as a “personal blog’s struggle story,” but as indicating a structural tension between the interoperability of the internet and the data demands of the AI industry. The clash between feed readers and blocking is just the tip of the iceberg. The extent to which AI companies are entitled to use public web data and what measures site operators should take will require increasingly important industry-wide discussion. This issue also serves as an opportunity to reconsider the ethical and technical foundations of data collection amid the large-scale AI deployment, such as Apple Intelligence Full Launch, Siri AI Revamp in iOS 27.
References
- Understanding Embark in GNU Emacs (a bit) and some ‘stupid’ Embark tricks (original article, but inaccessible due to blocking) — Published June 9, 2026
- Comments by the Wandering Thoughts blog administrator (within the original article)
Frequently Asked Questions
- Why does blocking old browser User-Agents affect feed readers?
- Inoreader and Feedly may use old Chrome User-Agents when periodically fetching feeds. When these requests are blocked, they cache the block page content, preventing users from seeing the correct feed data.
- What is the most realistic measure a personal blog operator can take against LLM crawlers?
- Properly configuring robots.txt; User-Agent-based blocking is simple but has significant side effects. More realistically, using CDN or WAF to implement rate limiting or IP reputation can be effective. However, this incurs costs, so small sites may choose not to take measures if the load is acceptable.
- Why is there a difference between archive.org and archive.today?
- archive.org (Wayback Machine) has clearly managed crawler User-Agents and IP ranges, making it easier to distinguish from malicious crawlers. In contrast, archive.today-type sites exhibit behavior similar to malicious actors, such as using old User-Agents and spoofing reverse DNS IPs, making them more likely to be blocked under a safety-first approach.
Comments