Over 340 Local News Outlets Block Internet Archive
More than 340 U.S. local news sites restrict Internet Archive access over concerns about AI training data, raising alarms for public archives and local journalism.
According to an investigation by Nieman Lab, there is a rapidly growing trend among U.S. local news outlets to restrict access to the Internet Archive due to concerns over AI companies scraping training data. As reported by Slashdot, over 340 organizations have implemented such blocks, with the majority coming from five major local newspaper groups.
Earlier this year, Nieman Lab reported that prominent publishers like The New York Times, The Guardian, and USA Today Co. had begun blocking access to the Internet Archive, fearing that AI companies were scraping its repository to train their models. The blocking of one of the last non-profit archives poses significant challenges to public interest. Four months later, the situation has escalated even further.
A new analysis by Nieman Lab has confirmed that more than 340 local news outlets across the U.S. are now restricting access to and storage of their articles on the Internet Archive. Many of these sites are owned by five of the seven largest local newspaper groups in the U.S.: USA Today Co., McClatchy, Advance Local, MediaNews Group, and Tribune Publishing. Notably, MediaNews Group and Tribune Publishing are subsidiaries of Alden Global Capital, often criticized as a “vulture hedge fund.”
Unintended Consequences of AI Fears
Publishers argue that the restrictions are necessary to prevent the unauthorized use of their content by AI companies. However, as reported by Techdirt, blocking the Internet Archive severely undermines the organization’s mission to preserve historical records. Local news outlets are often the sole chroniclers of regional history and events. If their content is locked behind paywalls and removed from the Internet Archive, access to past records for journalists, researchers, and the general public becomes increasingly difficult.
Local journalists are among the harshest critics of this trend. In a petition signed by over 200 journalists, B.J. Mendelson, editor of The Monroe Gazette, which covers New York’s Rockland, Sullivan, and Orange counties, voiced his concerns:
“I report from within a vast news desert. This makes me heavily reliant on archive data from defunct or zombified media outlets. Without the Internet Archive, my work becomes extremely challenging.”
A Crisis for Public Archives
This development highlights the complex relationship between publishers and archival institutions in the age of AI. While publishers prioritize protecting the value of their content and preventing exploitation by AI, the Internet Archive is dedicated to preserving human knowledge. The clash of these interests has put public intellectual property, along with the work of independent journalists and researchers, at risk.
In contrast to the U.K.’s Competition and Markets Authority (CMA), which recently imposed new regulations on Google Search to protect publishers, U.S. publishers are effectively shutting the door to historical records themselves. Ironically, the AI scraping they aim to prevent could still occur through alternative means, even if the Internet Archive strengthens its bot-detection measures. Critics argue that these blocks are unlikely to be effective.
Internet Archive’s Response
The Internet Archive has acknowledged the concerns raised by local news media and is actively pursuing solutions. In December of last year, the organization partnered with the Poynter Institute and Investigative Reporters and Editors to launch training programs for 33 local and national news organizations. These programs, funded by the Press Forward initiative, aim to teach newsrooms how to develop and implement digital preservation strategies. By the end of 2027, the goal is to train staff at 300 newsrooms on how to use the Internet Archive’s services to preserve their content.
In essence, the Internet Archive seeks not to remain a passive target of these blocks but to foster a mutually beneficial relationship with news organizations by enhancing their own archiving capabilities. This approach prioritizes collaboration over confrontation.
Editorial Perspective
Short-Term Impact: Over the coming months, these blocks are expected to have a significant impact on local journalism. Independent journalists and smaller media outlets operating in so-called “news deserts” will face challenges as they lose access to past articles—an essential resource for investigative reporting. For our readership, which includes engineers and product managers, this may manifest as reduced availability of the Internet Archive’s APIs and Wayback Machine. The standoff between publishers and non-profit archives over AI training data may also pose risks to other digital archival institutions, such as Japan’s National Diet Library Web Archive.
Long-Term Perspective: Looking ahead one to three years, this issue could serve as a catalyst for redefining the concept of “digital public goods.” While publishers’ desire to control the use of their content for AI training is understandable, the role of non-profit archives like the Internet Archive is critical to preserving humanity’s knowledge base. In the absence of clear legislation, self-regulatory solutions—such as the collaboration between the Internet Archive and the Poynter Institute—may become the norm. However, as the value of historical news articles as AI training data increases, publishers may continue to see it as a proprietary asset worth protecting. This ongoing tension could prompt discussions on new licensing models or legal frameworks specifically designed to safeguard archival resources.
Editorial Query: Is blocking the Internet Archive truly an effective way for publishers to prevent AI companies from scraping their content? Or does limiting access to past records pose a greater risk of reinforcing biases in AI models and perpetuating historical amnesia in society? Readers, what do you think are the rights and responsibilities concerning public archives on the web? If Japanese newspapers and broadcasters accelerate efforts to separate past articles from digital archives, how should we respond? One potential solution might lie in the use of open-source archiving tools or decentralized storage networks like IPFS. We encourage you to share your thoughts on this important issue.
References
- Slashdot: 340 Local News Outlets Now Blocking the Internet Archive — Published June 5, 2026
- Techdirt: 340 Local News Outlets Now Blocking the Internet Archive — Published June 5, 2026
- Nieman Lab: 340 local news sites blocking Internet Archive — Published June 5, 2026
- U.K. CMA Imposes New Rules on Google Search to Protect Publishers — Archive article on this site
Frequently Asked Questions
- Why are news media outlets blocking the Internet Archive?
- The primary reason is concern that AI companies are scraping archived articles from the Internet Archive without authorization to train their models. Publishers aim to protect their intellectual property and prevent potential revenue loss due to AI exploitation. However, this blocking also restricts public access to archived materials.
- Who is most affected by this block?
- Independent journalists and small media outlets that rely on archived articles for investigative reporting are the most affected. Additionally, historians and the general public lose access to free, non-paywalled archives, raising concerns about the public's right to information—a cornerstone of democracy.
- How is the Internet Archive addressing the issue?
- The Internet Archive is engaging with local news organizations to address their concerns and offer solutions. It has partnered with the Poynter Institute and Investigative Reporters and Editors to train newsrooms on digital preservation strategies. This initiative aims to train 300 newsrooms by the end of 2027, fostering a collaborative rather than adversarial relationship.
Comments