*/
In preparing to write this article, I visited the website interface of a well-known artificial intelligence (AI) chatbot and asked, ‘Do you harvest the intellectual property of other people to answer questions?’ To paraphrase the answer, I was assured that this chatbot does not actively access or harvest other people’s intellectual property (IP) and that it generates responses based on patterns learned from licensed, created and publicly available data. Interestingly, the chatbot concluded by saying, ‘My outputs may resemble existing works, but I’m designed to avoid reproducing copyrighted material verbatim.’ Clearly, it is a question that has been asked before.
In recent years, AI has revolutionised the way we create, communicate, and consume content. From generating realistic images to writing essays and composing music, AI models are now capable of producing outputs that closely mirror human creativity. Behind these seemingly magical capabilities lies a complex, controversial, and unpalatable reality: the massive datasets that power these models are frequently harvested from third-party IP without explicit permission or compensation.
AI models, particularly large language models (LLMs) and generative image models, require immense quantities of data. They use it to learn patterns which, once learnt, can be used to generate content. One of the most common methods for obtaining data is web scraping, in which companies systematically collect publicly available information like blog posts, news articles and stock images, which is then fed into the training model.
While much of this content exists on the open internet, scraping often ignores copyright restrictions, terms of service or site-specific rules stored in files called ‘robots.txt’, which are designed to explicitly prevent automated harvesting. The sheer volume of content scraped is staggering, often running into billions of text documents or images. With training datasets not released due to commercial sensitivities, and scrapers nearly impossible to detect, it is exceedingly difficult for individual creators to track or control their work’s use.
To obscure things further, AI companies often obtain data through partnerships with data brokers. Data brokers aggregate public and semi-public information into datasets that can be sold to AI developers, creating an additional layer of opacity between the original creators and the models that learn from their work.
Most AI training datasets are a mix of licensed, public-domain, and scraped proprietary content. While licensed or public-domain material raises fewer legal issues, the inclusion of copyrighted work without consent places companies in a legally grey area. From a technical perspective, AI models do not simply copy content; they tokenise and embed it into complex mathematical representations. Under UK copyright law, style is generally not protected. Copyright under the Copyright, Designs and Patents Act 1988 protects specific expressions of an idea rather than the idea, or style, alone. Nevertheless, even partial reproduction can amount to unauthorised use of another’s IP, as demonstrated by outputs that closely mimic specific artistic styles or replicate portions of text.
The UK’s new Data (Use and Access) Act 2025 is unlikely to fully prevent AI companies from scraping online content. The law stops short of introducing an ‘opt-in’ rule requiring rights-holders’ permission before data is used, favouring transparency over outright prohibition. While it includes provisions for disclosing what data has been collected and by whom, these measures rely heavily on voluntary compliance and are difficult to enforce in practice. Many AI developers train their models overseas, creating jurisdictional loopholes that make it hard to apply UK rules (see Getty Images v Stability AI below). Moreover, key amendments that would have imposed stricter restrictions were removed before the Act passed, reflecting industry pressure for flexibility. As a result, while the Act improves visibility and gives creators more grounds to challenge unauthorised scraping, it does not offer a comprehensive barrier to the practice – leaving much of the responsibility on creators to monitor and enforce their rights.
Getty Images (US) Inc & Ors v Stability AI Ltd [2025] EWHC 2863 (Ch) was hoped by many to bring a clear answer when it was heard in the High Court before the summer recess. During the process of writing this article, the judgment was handed down. The High Court dismissed most of Getty’s secondary copyright claim. Interestingly, Getty dropped the primary claim because the AI training took place outside the UK. But the High Court did find very limited trademark infringement due to the reproduction of Getty’s watermark by older versions of the Stability model. We should be clear, however, that the High Court did not address the key question of whether or not the web-scraping of online content, and subsequent use of that content to train an AI model in the UK, is a primary infringement of copyright or database rights in the UK.
Addressing the challenges posed by data harvesting requires a combination of legal, technical and policy solutions. One approach is to implement licensing and compensation mechanisms. Micro-payments for scraped content or collective licensing models could ensure that creators are remunerated for their contributions to AI training datasets. Opt-in systems, initially considered under the new Data Bill, where creators are explicitly credited or paid for their work, offer another pathway to fairness.
Technical solutions also have a role to play. Data labelling, provenance tracking and watermarking can help monitor how, and more importantly where, content is used in training datasets and generated outputs. These measures provide transparency and accountability, allowing creators to assert their rights or receive credit.
Regulatory clarification is crucial as well. We need clear guidance on whether training constitutes fair use or creates derivative works to reduce uncertainty for both creators and developers. Industry best practices, such as avoiding scraping paywalled content and respecting terms of service, can complement legal frameworks and help maintain ethical standards.
AI companies have built transformative tools that leverage the creative output of countless human authors, artists and developers. Yet the economic model that underpins these innovations often relies on the uncompensated harvesting of IP, creating ethical dilemmas and economic pressures for creators. While AI has the potential to democratise creation and enhance productivity, unchecked data harvesting risks eroding the very foundations of creative labour. Moving forward, sustainable AI development will require balancing innovation with fairness, ensuring that creators are recognised and rewarded for the content that fuels the next generation of intelligent machines. Without thoughtful management, the engines of AI innovation could run on the unpaid labour of countless human creators, undermining the diversity and vitality of the creative economy.


In preparing to write this article, I visited the website interface of a well-known artificial intelligence (AI) chatbot and asked, ‘Do you harvest the intellectual property of other people to answer questions?’ To paraphrase the answer, I was assured that this chatbot does not actively access or harvest other people’s intellectual property (IP) and that it generates responses based on patterns learned from licensed, created and publicly available data. Interestingly, the chatbot concluded by saying, ‘My outputs may resemble existing works, but I’m designed to avoid reproducing copyrighted material verbatim.’ Clearly, it is a question that has been asked before.
In recent years, AI has revolutionised the way we create, communicate, and consume content. From generating realistic images to writing essays and composing music, AI models are now capable of producing outputs that closely mirror human creativity. Behind these seemingly magical capabilities lies a complex, controversial, and unpalatable reality: the massive datasets that power these models are frequently harvested from third-party IP without explicit permission or compensation.
AI models, particularly large language models (LLMs) and generative image models, require immense quantities of data. They use it to learn patterns which, once learnt, can be used to generate content. One of the most common methods for obtaining data is web scraping, in which companies systematically collect publicly available information like blog posts, news articles and stock images, which is then fed into the training model.
While much of this content exists on the open internet, scraping often ignores copyright restrictions, terms of service or site-specific rules stored in files called ‘robots.txt’, which are designed to explicitly prevent automated harvesting. The sheer volume of content scraped is staggering, often running into billions of text documents or images. With training datasets not released due to commercial sensitivities, and scrapers nearly impossible to detect, it is exceedingly difficult for individual creators to track or control their work’s use.
To obscure things further, AI companies often obtain data through partnerships with data brokers. Data brokers aggregate public and semi-public information into datasets that can be sold to AI developers, creating an additional layer of opacity between the original creators and the models that learn from their work.
Most AI training datasets are a mix of licensed, public-domain, and scraped proprietary content. While licensed or public-domain material raises fewer legal issues, the inclusion of copyrighted work without consent places companies in a legally grey area. From a technical perspective, AI models do not simply copy content; they tokenise and embed it into complex mathematical representations. Under UK copyright law, style is generally not protected. Copyright under the Copyright, Designs and Patents Act 1988 protects specific expressions of an idea rather than the idea, or style, alone. Nevertheless, even partial reproduction can amount to unauthorised use of another’s IP, as demonstrated by outputs that closely mimic specific artistic styles or replicate portions of text.
The UK’s new Data (Use and Access) Act 2025 is unlikely to fully prevent AI companies from scraping online content. The law stops short of introducing an ‘opt-in’ rule requiring rights-holders’ permission before data is used, favouring transparency over outright prohibition. While it includes provisions for disclosing what data has been collected and by whom, these measures rely heavily on voluntary compliance and are difficult to enforce in practice. Many AI developers train their models overseas, creating jurisdictional loopholes that make it hard to apply UK rules (see Getty Images v Stability AI below). Moreover, key amendments that would have imposed stricter restrictions were removed before the Act passed, reflecting industry pressure for flexibility. As a result, while the Act improves visibility and gives creators more grounds to challenge unauthorised scraping, it does not offer a comprehensive barrier to the practice – leaving much of the responsibility on creators to monitor and enforce their rights.
Getty Images (US) Inc & Ors v Stability AI Ltd [2025] EWHC 2863 (Ch) was hoped by many to bring a clear answer when it was heard in the High Court before the summer recess. During the process of writing this article, the judgment was handed down. The High Court dismissed most of Getty’s secondary copyright claim. Interestingly, Getty dropped the primary claim because the AI training took place outside the UK. But the High Court did find very limited trademark infringement due to the reproduction of Getty’s watermark by older versions of the Stability model. We should be clear, however, that the High Court did not address the key question of whether or not the web-scraping of online content, and subsequent use of that content to train an AI model in the UK, is a primary infringement of copyright or database rights in the UK.
Addressing the challenges posed by data harvesting requires a combination of legal, technical and policy solutions. One approach is to implement licensing and compensation mechanisms. Micro-payments for scraped content or collective licensing models could ensure that creators are remunerated for their contributions to AI training datasets. Opt-in systems, initially considered under the new Data Bill, where creators are explicitly credited or paid for their work, offer another pathway to fairness.
Technical solutions also have a role to play. Data labelling, provenance tracking and watermarking can help monitor how, and more importantly where, content is used in training datasets and generated outputs. These measures provide transparency and accountability, allowing creators to assert their rights or receive credit.
Regulatory clarification is crucial as well. We need clear guidance on whether training constitutes fair use or creates derivative works to reduce uncertainty for both creators and developers. Industry best practices, such as avoiding scraping paywalled content and respecting terms of service, can complement legal frameworks and help maintain ethical standards.
AI companies have built transformative tools that leverage the creative output of countless human authors, artists and developers. Yet the economic model that underpins these innovations often relies on the uncompensated harvesting of IP, creating ethical dilemmas and economic pressures for creators. While AI has the potential to democratise creation and enhance productivity, unchecked data harvesting risks eroding the very foundations of creative labour. Moving forward, sustainable AI development will require balancing innovation with fairness, ensuring that creators are recognised and rewarded for the content that fuels the next generation of intelligent machines. Without thoughtful management, the engines of AI innovation could run on the unpaid labour of countless human creators, undermining the diversity and vitality of the creative economy.


Chair of the Bar reflects on 2025
Q&A with criminal barrister Nick Murphy, who moved to New Park Court Chambers on the North Eastern Circuit in search of a better work-life balance
Revolt Cycling in Holborn, London’s first sustainable fitness studio, invites barristers to join the revolution – turning pedal power into clean energy
Rachel Davenport, Co-founder and Director at AlphaBiolabs, reflects on how the company’s Giving Back ethos continues to make a difference to communities across the UK
By Marie Law, Director of Toxicology at AlphaBiolabs
AlphaBiolabs has made a £500 donation to Sean’s Place, a men’s mental health charity based in Sefton, as part of its ongoing Giving Back initiative
Little has changed since Burns v Burns . Cohabiting couples deserve better than to be left on the blasted heath with the existing witch’s brew for another four decades, argues Christopher Stirling
Six months of court observation at the Old Bailey: APPEAL’s Dr Nisha Waller and Tehreem Sultan report their findings on prosecution practices under joint enterprise
Despite its prevalence, autism spectrum disorder remains poorly understood in the criminal justice system. Does Alex Henry’s joint enterprise conviction expose the need to audit prisons? asks Dr Felicity Gerry KC
With automation now deeply embedded in the Department for Work Pensions, Alexander McColl and Alexa Thompson review what we know, what we don’t and avenues for legal challenge
It’s been five years since the groundbreaking QC competition in which six Black women barristers, including the 2025 Chair of the Bar, took silk. Yet today, the number of Black KCs remains ‘critically low’. Desirée Artesi talks to Baroness Scotland KC, Allison Munroe KC and Melanie Simpson KC about the critical success factors, barriers and ideas for embedding change