Data harvesting at scale

How much longer can AI companies collect and monetise other people’s IP? Phoebe Whitlock investigates

In preparing to write this article, I visited the website interface of a well-known artificial intelligence (AI) chatbot and asked, ‘Do you harvest the intellectual property of other people to answer questions?’ To paraphrase the answer, I was assured that this chatbot does not actively access or harvest other people’s intellectual property (IP) and that it generates responses based on patterns learned from licensed, created and publicly available data. Interestingly, the chatbot concluded by saying, ‘My outputs may resemble existing works, but I’m designed to avoid reproducing copyrighted material verbatim.’ Clearly, it is a question that has been asked before.

In recent years, AI has revolutionised the way we create, communicate, and consume content. From generating realistic images to writing essays and composing music, AI models are now capable of producing outputs that closely mirror human creativity. Behind these seemingly magical capabilities lies a complex, controversial, and unpalatable reality: the massive datasets that power these models are frequently harvested from third-party IP without explicit permission or compensation.

Methods of harvesting IP

AI models, particularly large language models (LLMs) and generative image models, require immense quantities of data. They use it to learn patterns which, once learnt, can be used to generate content. One of the most common methods for obtaining data is web scraping, in which companies systematically collect publicly available information like blog posts, news articles and stock images, which is then fed into the training model.

While much of this content exists on the open internet, scraping often ignores copyright restrictions, terms of service or site-specific rules stored in files called ‘robots.txt’, which are designed to explicitly prevent automated harvesting. The sheer volume of content scraped is staggering, often running into billions of text documents or images. With training datasets not released due to commercial sensitivities, and scrapers nearly impossible to detect, it is exceedingly difficult for individual creators to track or control their work’s use.

To obscure things further, AI companies often obtain data through partnerships with data brokers. Data brokers aggregate public and semi-public information into datasets that can be sold to AI developers, creating an additional layer of opacity between the original creators and the models that learn from their work.

Most AI training datasets are a mix of licensed, public-domain, and scraped proprietary content. While licensed or public-domain material raises fewer legal issues, the inclusion of copyrighted work without consent places companies in a legally grey area. From a technical perspective, AI models do not simply copy content; they tokenise and embed it into complex mathematical representations. Under UK copyright law, style is generally not protected. Copyright under the Copyright, Designs and Patents Act 1988 protects specific expressions of an idea rather than the idea, or style, alone. Nevertheless, even partial reproduction can amount to unauthorised use of another’s IP, as demonstrated by outputs that closely mimic specific artistic styles or replicate portions of text.

The Government’s Data Bill

The UK’s new Data (Use and Access) Act 2025 is unlikely to fully prevent AI companies from scraping online content. The law stops short of introducing an ‘opt-in’ rule requiring rights-holders’ permission before data is used, favouring transparency over outright prohibition. While it includes provisions for disclosing what data has been collected and by whom, these measures rely heavily on voluntary compliance and are difficult to enforce in practice. Many AI developers train their models overseas, creating jurisdictional loopholes that make it hard to apply UK rules (see Getty Images v Stability AI below). Moreover, key amendments that would have imposed stricter restrictions were removed before the Act passed, reflecting industry pressure for flexibility. As a result, while the Act improves visibility and gives creators more grounds to challenge unauthorised scraping, it does not offer a comprehensive barrier to the practice – leaving much of the responsibility on creators to monitor and enforce their rights.

Caselaw developments

Getty Images (US) Inc & Ors v Stability AI Ltd [2025] EWHC 2863 (Ch) was hoped by many to bring a clear answer when it was heard in the High Court before the summer recess. During the process of writing this article, the judgment was handed down. The High Court dismissed most of Getty’s secondary copyright claim. Interestingly, Getty dropped the primary claim because the AI training took place outside the UK. But the High Court did find very limited trademark infringement due to the reproduction of Getty’s watermark by older versions of the Stability model. We should be clear, however, that the High Court did not address the key question of whether or not the web-scraping of online content, and subsequent use of that content to train an AI model in the UK, is a primary infringement of copyright or database rights in the UK.

Potential solutions and policy recommendations

Addressing the challenges posed by data harvesting requires a combination of legal, technical and policy solutions. One approach is to implement licensing and compensation mechanisms. Micro-payments for scraped content or collective licensing models could ensure that creators are remunerated for their contributions to AI training datasets. Opt-in systems, initially considered under the new Data Bill, where creators are explicitly credited or paid for their work, offer another pathway to fairness.

Technical solutions also have a role to play. Data labelling, provenance tracking and watermarking can help monitor how, and more importantly where, content is used in training datasets and generated outputs. These measures provide transparency and accountability, allowing creators to assert their rights or receive credit.

Regulatory clarification is crucial as well. We need clear guidance on whether training constitutes fair use or creates derivative works to reduce uncertainty for both creators and developers. Industry best practices, such as avoiding scraping paywalled content and respecting terms of service, can complement legal frameworks and help maintain ethical standards.

Balancing innovation with fairness

AI companies have built transformative tools that leverage the creative output of countless human authors, artists and developers. Yet the economic model that underpins these innovations often relies on the uncompensated harvesting of IP, creating ethical dilemmas and economic pressures for creators. While AI has the potential to democratise creation and enhance productivity, unchecked data harvesting risks eroding the very foundations of creative labour. Moving forward, sustainable AI development will require balancing innovation with fairness, ensuring that creators are recognised and rewarded for the content that fuels the next generation of intelligent machines. Without thoughtful management, the engines of AI innovation could run on the unpaid labour of countless human creators, undermining the diversity and vitality of the creative economy.

Getty Images lost its legal battle against Stability AI in the High Court, having withdrawn its primary copyright infringement claim as Getty was unable to prove the AI training took place in the UK. Stability was found by the judge to have breached Getty’s trademark in a limited number of cases. Getty is pursuing a legal case against Stability in the US, where the model was trained.

Phoebe Whitlock was called in 2019 and went on to practise as a Solicitor Advocate at a number of technology companies before coming to the Bar. She is currently in practice at The Barrister Group.

See all articles by this author See all articles in this issue