How AI tools are trained is alarming creators. Is copyright law enough of a roadblock to stop them?

The OpenAI logo is seen on a mobile phone in front of a computer screen displaying output from ChatGPT. ©Michael Dwyer/AP Photo, File

When German photographer Robert Kneschke used an online website a year ago to see if his copyrighted photographs had been used to train artificial intelligence (AI) tools, he was “stunned and shocked” by how many he found.

At least 20 watermarked images from his portfolio were collected in datasets provided by the German non-profit AI organisation LAION to train machine learning systems like Stable Diffusion.

But when the photographer asked them to remove the photos, lawyers for LAION argued that the database provided access to images publicly available on the Internet and did not save copies of them.

With a trial date set for April 25 in Hamburg, Kneschke’s case will be one of the first trials of its kind in Europe as artists, authors, and companies bring lawsuits against AI companies that trained on content scrapped from the Internet.

Experts say, however, that the issue of these tools using copyrighted works for training has no foregone conclusion, with it likely to be interpreted by courts on a case-by-case basis as more come forward.

Copyright infringement will be ‘hard to prove’

Generative AI models such as ChatGPT are trained on vast amounts of material and information to learn how to produce human-like content.

“To train a model that generates content is very much like me reading a bunch of John Steinbeck novels and then writing a novel to try to imitate John Steinbeck's style,” said Marc Kaufman, a partner at Rimon Law in Washington DC.

If he writes a novel that tells the same story as one of Steinbeck’s novels, then it might be a copyright infringement, whereas if he imitates the style of writing it likely wouldn’t count as such.

“Imitating a style is really not a copyright violation. Copying a story is a copyright violation,” he said.

German photographer Robert Kneschke who found his copyrighted photographs in the LAION database.Credit: Mareen Fischinger

Google and OpenAI provided similar arguments for machine learning training late last year in separate responses to the US Copyright Office, as part of its inquiry into AI and copyright.

Google compared training AI tools to a previous court case, stating that “the act of reading a book and learning the facts and ideas within it” would not count as infringement.

OpenAI highlighted that the systems do not save copies and are supposed to generate new content.

“When asked for a response, the model uses its weights to write a new response each time it is asked,” OpenAI’s general counsel argued.

“It does not copy its response from its pre-training data, or access it via a database. Much like a person who has read a book and sets it down, our models do not have access to training information after they have learned from it”.

Kaufman says that copyright infringement should be based “on the output of the AI model” and how similar it is to the original work, which is a view shared by other experts.

He contended, however, that there may be the assumption that he bought the novels legitimately, whereas some creators could argue that AI companies copied their art or text directly from the Internet illegitimately.

One problem is that people may not know if their material was used, making it “hard to prove because the protected work isn’t there anymore, it’s just the result of the machine learning that is there,” said Daniel Westman, an independent legal advisor on copyright in Sweden.

Several prominent copyright cases have been filed in the US, with the New York Times notably suing ChatGPT-maker OpenAI for "billions of dollars in statutory and actual damages" for using its articles for training.

OpenAI CEO Sam Altman during a Senate Judiciary Subcommittee on Privacy, Technology and the Law hearing.Patrick Semansky/AP Photo, File

The newspaper’s lawyers argue that the “unlawful use” of its articles will allow AI tools to create content to compete with the New York Times, with the lawsuit including specific examples of “output that recites Times content verbatim”.

But while cases such as this one could hinge on a “fair use” clause in the US that provides certain exceptions for the unlicensed use of works, there are some differences in the European Union that may favour rightsholders.

EU directive allows for ‘opt-out’

While there was an effort to incorporate a “fair use clause” similar to the US into EU law, according to intellectual property expert Marianne Levin from Stockholm University, European countries all came to the table with specific exceptions.

“You have now a sort of smorgasbord in this paragraph [of the EU copyright directive] with a lot of different exceptions,” she explained.

Instead of these exceptions, experts have pointed to a more recent EU directive as a basis for copyright cases related to AI.

A spokesperson for the European Commission told Euronews Next that as a general rule, “the developer or user of the AI should seek the rightsholder’s consent” to use copyright-protected works for training AI systems.

The Commission has previously pointed to a 2019 directive, that has applied in EU states since June 2021, which provides a more specific clause about “text and data mining,” the automated process of combing through large quantities of content to find patterns.

The directive allows for creators such as artists and authors to opt out of their works being used for that purpose.

“Nowadays, a lot of discussion going on in Europe is about if this exception is at all applicable for taking someone else's material and using it for machine learning,” Westman said, but at the moment there is no “standardised way to opt-out from every use,” with some experts doubtful about its application.

Internal market commissioner Thierry Breton said this directive did apply to the “AI context” in a response to the European Parliament in March 2023.

If the rightsholders’ consent is necessary, there could be limited licensing of works to train AI tools in Europe.

“There is a natural claim by the artistic, musical society where they say, ‘but now they are using our works, why don't we get paid?’” Levin said.

This was among the concerns highlighted by the writers’ and actors’ strikes in Hollywood last year that shut down the industry for months. Yet the subsequent agreements did not end in an outright ban on using their work for training AI.

"This is a topic we will continue to monitor and, depending on the outcome of the litigation and studies, we can revisit this topic if the courts and copyright office find training requires consent," the actors' guild SAG-AFTRA said on their website.

Agreements to licence content have also already begun with some organisations entering into agreements with OpenAI, for instance, to use their materials for training. Axel Springer which owns BILD and WELT announced such a partnership with the ChatGPT maker in December 2023.

German training database LAION points out that people could help AI companies by using a public domain licence to provide access to their data as well, which they argue would help to democratise and power the industry.

Actor and filmmaker Justine Bateman, right, speaks outside Netflix during a Writers Guild rally.Mark J. Terrill/AP Photo, File

More transparency to come?

Europe’s new AI Act includes a provision on transparency, obliging companies to disclose summaries about “content used for training” machine learning models.

It will become applicable two years after it enters into force following its formal approval in the European Parliament, except for certain prohibitions that come into effect earlier.

Westman said this type of obligation could make things “a little better for the rightsholders,” but that it’s not changing anything on a “fundamental level”.

Jenia Jitsev, a co-founder of the LAION database which faces the German photographer’s case in court this year, told Euronews Next in an e-mail that the AI organisation supports transparency for machine learning models, pointing out that they offer the only transparent training datasets.

A European Commission spokesperson added that while the “interplay between AI and copyright is complex and requires constant monitoring,” the current rules are “generally sufficient to deal with the different issues raised”.

They expect that future copyright problems will be determined case by case.

Levin said in many ways makers of AI tools have won by being the first as others play regulatory catch-up.

“They have a certain distance where they have already started, so I think that they have won,” she said.

© Euronews