AI

The Atlantic Publishes Searchable Database of Music Used for AI Training — Over 12 Million Songs Included

The Atlantic has released a searchable database of four music datasets used to train AI models. The largest dataset contains up to 12 million songs, used without permission, raising copyright concerns.

4 min read Reviewed & edited by the SINGULISM Editorial Team

The Atlantic Publishes Searchable Database of Music Used for AI Training — Over 12 Million Songs Included
Photo by Steve A Johnson on Unsplash

The Atlantic journalist Alex Reisner has discovered four music datasets used to train AI models and published them as a searchable database for the public. According to a report by The Verge on June 20, 2026, two of the datasets are massive in scale, containing approximately 12 million and 9 million songs respectively, while the remaining two each exceed 100,000 songs.

The Reality of the Datasets

Reisner’s investigation revealed that these datasets have been downloaded thousands of times, and both Google and Stability AI have confirmed using them in their research papers. Some of the datasets, such as those from the Free Music Archive, are free for personal use but require licenses for commercial use.

Notably, the method of obtaining these datasets is problematic. Three of the datasets are distributed as collections of links to songs on YouTube and Spotify, and AI developers use automated tools to download the actual audio. Reisner points out that “these tools bypass login screens, ads, and mechanisms that generate revenue or subscribers for creators, violating the platforms’ terms of service.”

The artists included in the database span a wide range, including Lady Gaga, Fred Again.., Radiohead, Aphex Twin, Wu-Tang Clan, Bruce Springsteen, and experimental musician Hainbach. The Atlantic’s dedicated site, “AI Watchdog,” allows users to search not only for songs but also for books and other media to see which AI models they have been used to train.

With the rapid proliferation of AI music generation services, the origin and rights clearance of training data have become serious problems. The music industry has long pushed back against unauthorized use of music by AI companies. In 2024, major record labels took legal action against AI music generation startups.

The publication of this database is significant because it concretely visualizes the extent of copyright infringement. Even if datasets are “theoretically freely available on the internet,” actual use involves violating platform terms of service and rights clearance issues. As noted in a related article on this site about the explosive virtuous cycle of AI inference demand, sourcing and clearing training data is a challenge that affects the entire industry’s sustainability.

Transparency and Regulatory Moves

The European Union’s AI Act requires transparency in training data, mandating disclosure of dataset origins and copyright information. In Japan, the Agency for Cultural Affairs is also formulating guidelines on AI and copyright, with the legality of training data being a key focus.

The Atlantic’s initiative is commendable from a citizen journalism perspective. By providing a means for rights holders to check whether their works have been used to train particular AI models, it enhances transparency and provides a foundation for discussion. Meanwhile, the fact that the datasets themselves continue to be distributed illegally calls for legal measures and cooperation between platforms.

Impact on the Industry

It is interesting that Google and Stability AI acknowledge using these datasets for research. Both companies disclosed their use in research papers, so it is not a case of complete non-disclosure. However, if these datasets are incorporated into commercial models without compensating rights holders, the risk of litigation increases.

The music AI startup Suno recently reportedly raised an additional $400 million in funding. While investors remain bullish on the growth of the AI music market, the impact of copyright risks on corporate valuation cannot be ignored. As dataset transparency increases, AI companies may be forced to carefully select training data and secure licenses.

Editorial Opinion

In the short term, this database release is likely to accelerate moves by rights organizations and record labels to intensify lawsuits or demand licensing agreements against AI companies. Especially for existing AI models that used large-scale datasets, retrospective rights clearance may become necessary. The music industry has already taken a tough stance against AI-related copyright infringement, and it is inevitable that this database will be used as concrete evidence. In the long term, this issue is expected to promote the formation of a licensing market for AI training data. Currently, the prevalence of pirated datasets is normalized, but intermediary services offering rights-cleared datasets, or models where platforms legally provide data via APIs, may emerge. AI companies will face a decision to steer toward reducing legal risk, even if it means short-term cost increases. The editorial team hopes that this database release will not end as mere “naming and shaming” but lead to constructive discussions on solutions.

References

Frequently Asked Questions

What exactly can be checked with this database?
Users can search by artist name or song title to see which AI model's training dataset includes that song. On The Atlantic's "AI Watchdog" site, users can also search for books and other media, not just songs.
Can anyone obtain these datasets?
The dataset links themselves are public, but downloading the actual audio files requires tools that violate YouTube or Spotify's terms of service. This is the core of the copyright issue.
How will this problem affect the future of music AI?
While legal conflicts between rightsholders and AI companies are likely to intensify, it may also drive the formation of a licensing market and increased transparency. In the short term, litigation risks will increase, but in the long term, it could lead to a sustainable framework for data use.
Source: The Verge

Comments

← Back to Home