AI companies and rights holders need to find a way to coexist. Does one exist?

23 August 2024
By Luca Bertuzzi

Generative AI systems such as OpenAI’s ChatGPT and Google’s Gemini have been developed in a legal gray area over the use of copyright-protected content for model training that will eventually come to an end.

As more and more high-quality data is required to train foundation models that are ever more powerful, AI companies will have to come to terms with rights holders and the limits of web scraping for content. Generative AI is already undermining the business models of some rights holders, and others will need to strike agreements sooner rather than later.

“Both parties are motivated to reach licensing deals. This initial period of mass scraping will have to come to an end sooner or later,” Paul Joseph, an intellectual property lawyer at Linklaters, told MLex.

Converging interests

AI companies might be interested in reaching licensing agreements because they need access to high-quality data to build cutting-edge models. In contrast, rights holders are becoming extremely careful about protecting their content.

For instance, a team of researchers at the Imperial College London created a technique dubbed “copyright traps,” which consists of hiding parts of text meant to identify if the content was used to train an AI model.

Put differently, to remain competitive in the race to build increasingly large and sophisticated AI models, companies such as OpenAI and Anthropic will need to access up-to-date data in an easily digestible way.

Joseph suggests that what we are likely to see is similar to what happened with the music industry in the early days of the Internet. Initially, piracy threatened to undermine the creators’ business model. Then Spotify and other streaming services came along, providing a seamless service that respected copyright with an arguably reasonable subscription model.

For some, however, the music industry might also be a good indicator of how this sort of arrangement can go wrong, since there is a longstanding grievance that record companies monopolize most of the revenues while creators are left with peanuts.

Indeed, AI companies might need to reach licensing agreements with a few major rights holders and then arrange collective agreements with the rest, even though small website publishers probably do not have a business model for licensing their content.

“There are hundreds of actors in the media sector that are too small to have their voice heard and in a weak position to negotiate,” said Marie-Avril Roux-Steinkühler, a lawyer at law firm Mars-IP who is working with press publishers in Germany and France.

According to her, rights holders should organize themselves in collective management societies to better negotiate with AI companies. Still, the legal framework in the EU favors this fragmentation since copyright law is a national regime, and EU countries even have different definitions of copyright-protected work.

Various interests at stake

While AI companies might be able to divide and conquer with the rights holders, it is also true that striking a few major deals might cement their market position, creating a significant entry barrier for anyone willing to challenge the incumbent players.

In turn, reaching a licensing agreement might not just mean extra revenues for rights holders but also a way to gain a competitive advantage over their rivals, because featuring on ChatGPT’s replies might become as important as on Google’s search results.

But the rights holders' different approaches relate not only to their relative weight at the negotiation table but also to whether generative AI is an existential threat. These various stances are well expressed in courtrooms.

Cases such as The New York Times vs. OpenAI are instrumental precisely in getting a better licensing deal. By contrast, some litigations are motivated by generative AI completely undermining one’s business model, as is the case for Getty Images vs. Stability AI.

These landmark cases contribute to an environment whereby AI companies are incentivized to reach licensing deals since there is a shared understanding that this legal gray area cannot last forever.

“Whatever judgments come out in front of the court, the rulings will not allow all the bulk scraping activities we have seen until now,” Joseph said, adding there are many examples of rights holders quietly reaching licensing agreements.

Public and private enforcement

Discussions on the relationship between generative AI and copyright are ongoing in several forums, but the EU has the most advanced legal framework and the highest potential to influence other jurisdictions.

Earlier this month, EU countries kicked off a reflection on how rights holders could express their reservation rights and whether the current copyright rules are still fit for purpose in the age of generative AI.

Cultural and historical differences exist across the bloc regarding copyright. For example, in most of southern Europe, the issues are dealt with by the culture ministries, whereas in Northern Europe, economic ministries usually take the lead.

The rights holders’ camp might have lost its most prominent advocate, as France, a country traditionally sensitive to copyright-related issues, is entangled in domestic politics due to parliamentary deadlock stemming from recent elections.

Meanwhile, the European Commission has clarified that the text and data mining exception of the Copyright Directive applies to AI model training, and it is working toward establishing a functioning licensing market.

Initially, AI model providers did not care about the EU copyright regime because copyright law is territorial in scope, and most models are trained outside of Europe. They started paying attention, though, after the AI Act was finalized earlier this year.

The AI law mandates that to make an AI model available on the EU market, model providers must “put in place a policy to comply with Union copyright law, in particular, to identify and comply with the reservation of rights [under the Copyright Directive].”

This legal expedient of making compliance with copyright law a market entry requirement has prompted dismay among IP lawyers, though this approach is standard in EU product safety legislation, on which the AI Act is based.

The EU’s AI rulebook might also help rights holders bring their cases before non-EU courts, since it mandates AI companies to publish a sufficiently detailed summary of the data used to train their models.

This summary is meant to help rights holders enforce their rights because it is extremely difficult to know who is scraping one’s content. Thus, bringing transparency into the dataset training process might also fuel litigation in other jurisdictions.

“Until we get court decisions that are strong enough to make a reference, it's going to be a cautious approach from [the tech companies'] side — they are not just waiting for Europe. They are also waiting to see how the US court cases are developing,” said Marie Sellier, vice president at media giant Vivendi.

In other words, public enforcement in Europe and private litigation in the US are bound to push tech companies and rights holders to find ways to collaborate.

Forced co-existence

While the EU's legislative framework is relatively advanced, technical solutions to operationalize the legal requirements, starting with how rights holders are supposed to express their reservation rights, are still lacking.

Only one generally recognized tool exists at this time, Robots.txt, an exclusion protocol that tells crawlers which URLs they can access on one’s website. This system implies, though, that publishers must opt out of each crawler individually.

While some crawlers are well known, such as OpenAI’s GPTBot, most are not. To date, neither a command to opt out of all AI crawlers nor a repository of crawlers exists. In addition, such a system raises the question of who sets the policy when, for example, an image from a photo agency such as Shutterstock is used in a news article.

An alternative way might be to develop a hashing system similar to the one used to track child pornography that would tell the AI company whether the content is copyright-protected and requires a license.

Rights holders and tech companies are already working on technical solutions, for instance, in the context of the Coalition for Content Provenance and Authenticity, which aims to develop a common approach to content authentication and management.

Technology is only part of the problem, though. “A prerequisite for a functioning licensing market is that AI companies start turning out a profit instead of bleeding money,” Paul Keller, director of policy at the Open Future Foundation, pointed out.

Even the world’s most popular chatbot, ChatGPT, continues to burn money despite its subscription revenues, and its maker OpenAI is reportedly losing $5 billion this year.

If AI companies don’t figure out a sustainable business model, they won’t have any appetite to pay for licensing copyrighted content. As Vivendi’s Sellier put it: “This market will not exist if AI companies do not want to participate.”

Meanwhile, the window for the rights holders to strike licensing agreements might close soon because, in a few years, millions of pieces of AI-generated content might be reused to train future models, with repercussions that are anyone’s guess at this stage.

It seems inevitable, however, that AI companies and rights holders will have to find an uncomfortable way to coexist, as they need each other much more than they would openly admit. As often happens, it is just a matter of time.

For the latest developments in AI & tech regulation, privacy and cybersecurity, online safety, content moderation and more, activate your instant trial of MLex today.

14-day free trial