AI companies rely on premium publishers for training data, new research finds

Cryptopolitan

Nov 9, 2024 8:30 PM

Major technology companies, including OpenAI, Google, Meta, and Anthropic, rely on high-quality, copyrighted material from prominent publishers to train their large language models (LLMs).

This is according to a study conducted by Ziff Davis, the parent company of CNET, IGN, and Mashable, which shows the essential role that high-quality content plays in training these AI models. The study shows that authoritative sources are preferred for training datasets in AI companies to enhance the model’s performance, but the contribution of these sources is often neglected.

In the research, Ziff Davis’ AI attorney, George Wukoson and Chief Technology Officer Joey Fortuna claimed that AI companies choose training data based on the ranking of authoritative websites with high search engine rankings. High quality and popular websites were chosen to improve the models since they have a good reputation. A strategy that, according to the study enables the AI developers to fine-tune the language model.

Ziff Davis has pointed out that top-tier content providers like Axel Springer, Future PLC, Hearst, News Corp, and The New York Times, among others, have contributed to the development of training datasets. In particular, it has been identified that 12.04% of OpenWebText2, which was used for the creation of OpenAI’s GPT-3, came from these trusted publishers.

Mark Zuckerberg also weighed in on the ongoing debate surrounding content use in AI training. In a recent interview with The Verge, Zuckerberg acknowledged that data scraping for AI is challenging but also pointed out that individual creators’ or publisher’s content might not be that impactful. He stated, “I think individual creators or publishers tend to overestimate the value of their specific content in the grand scheme of this.”

Publishers file lawsuits against AI companies

The secrecy around training data sources has raised concerns among publishers and consumers alike. The New York Times and The Wall Street Journal recently filed lawsuits against AI companies, saying that they have violated copyright laws by using their content.

While OpenAI has advanced efforts to obtain content licensing from media organizations such as the Financial Times and DotDash Meredith, several AI firms still work without proper licensing. The report further states that “major LLM developers no longer disclose their training data as they once did.”

While the values of AI companies rise, the gap between technology titans and conventional media companies remains vast. Tech giants such as Google and Meta, which have an estimated value of $2.2 tn and $1.5 tn, respectively, remain at the forefront of generative AI, while startups such as OpenAI and Anthropic are valued at $157 billion and $40 billion respectively.

On the other hand, publishers are still dealing with layoffs and restructuring, which is evidence of the financial pressure of adjusting to an environment more and more defined by AI. As a result of the competition with user-generated and AI-based content, numerous publishers face challenges in terms of reducing costs and staff.

Disclaimer: For information purposes only. Past performance is not indicative of future results.