Apple, Nvidia, and Anthropic used thousands of unauthorized YouTube videos to train their AI models
The investigation by Proof News has unveiled a significant issue concerning the use of YouTube content by some of the wealthiest AI companies for training their models. Despite YouTube's rules against harvesting materials without permission, it was found that subtitles from 173,536 YouTube videos, siphoned from over 48,000 channels, have been used by Silicon Valley giants such as Anthropic, Nvidia, Apple, and Salesforce.
This dataset, known as YouTube Subtitles, includes video transcripts from educational and online learning channels like Khan Academy, MIT, and Harvard, as well as media giants such as The Wall Street Journal, NPR, and the BBC. Even popular entertainment shows like "The Late Show With Stephen Colbert," "Last Week Tonight With John Oliver," and "Jimmy Kimmel Live" were not spared. Additionally, the dataset encompasses content from YouTube megastars including MrBeast, Marques Brownlee, Jacksepticeye, and PewDiePie, with hundreds of their videos taken for training purposes.
David Pakman, host of "The David Pakman Show," expressed his frustration over the unauthorized use of his content. Pakman’s channel, which boasts over two million subscribers and more than two billion views, had nearly 160 videos included in the dataset. He highlighted the impact on his livelihood and the resources invested in creating his content, stressing that if AI companies are profiting from his work, he should be compensated. Dave Wiskus, CEO of Nebula, echoed these sentiments, calling the practice "theft" and "disrespectful," particularly since AI-generated content could potentially replace artists and creators.
Despite the controversy, EleutherAI, the creators of the dataset, have not commented on the allegations. Their website states their goal is to democratize access to cutting-edge AI technologies by training and releasing models. The YouTube Subtitles dataset, part of a larger compilation called the Pile, does not include video imagery but consists of the plain text of subtitles, often with translations. The Pile also includes material from sources like the European Parliament, English Wikipedia, and emails from Enron Corporation employees.
Notably, the Pile is publicly accessible, allowing anyone with sufficient resources to use it. This openness has facilitated its use by academics and developers outside Big Tech, but it has also been utilized by major companies such as Apple, Nvidia, and Salesforce. These companies have used the Pile to train AI models, with Salesforce releasing an AI model for public use in 2022 that has been downloaded at least 86,000 times. However, Salesforce acknowledged that the dataset contains biases and profanity, raising concerns about potential vulnerabilities and safety issues.
Anthropic, another significant AI player, confirmed using the Pile in its generative AI assistant Claude. They emphasized that YouTube's terms cover direct use of its platform, not the Pile dataset, thus deferring the issue of potential violations to the Pile’s authors. Nvidia declined to comment, and representatives from Apple, Databricks, and Bloomberg did not respond to inquiries.
The use of YouTube content for AI training highlights a broader trend where AI companies compete by acquiring high-quality data, often keeping their sources secret. Earlier this year, The New York Times reported that Google, which owns YouTube, used platform videos to train its models under agreements with creators. Similarly, OpenAI was found to have used YouTube videos without authorization.
Dave Farina, host of "Professor Dave Explains," whose channel had 140 videos used, underscored the need for a conversation about compensation or regulation if AI-generated content could potentially replace human creators. This sentiment is echoed by creators and stakeholders concerned about the unauthorized use of their work.
The YouTube Subtitles dataset, published in 2020, includes subtitles from over 12,000 videos that have since been deleted, raising further ethical questions. The situation is reminiscent of the controversy surrounding another Pile dataset, Books3, which included works by prominent authors without permission, leading to lawsuits against AI companies for alleged copyright violations. These legal battles are ongoing, with questions about permission and payment remaining unresolved.
Creators are increasingly vigilant about unauthorized use of their content, regularly filing takedown notices. However, there is growing concern that AI could soon generate content similar to, or even indistinguishable from, what they produce. David Pakman recounted encountering a TikTok video labeled as a Tucker Carlson clip but realized it was a voice clone of Carlson reading Pakman’s script, highlighting the potential for AI-generated misinformation.
EleutherAI founder Sid Black explained on GitHub that he created YouTube Subtitles using a script to download subtitles from YouTube’s API, similar to how a viewer’s browser downloads them. Despite YouTube’s terms prohibiting automated access, the script has been bookmarked by over 2,000 GitHub users.
In conclusion, the unauthorized use of YouTube content by AI companies for training models raises significant ethical and legal issues. Creators like David Pakman and Dave Wiskus demand compensation and recognition for their work, while others call for clearer regulations and protections against such practices. As AI continues to evolve, the balance between innovation and respect for creators' rights remains a contentious and crucial debate.
Related Courses and Certification
Also Online IT Certification Courses & Online Technical Certificate Programs