Privacy Concerns in Web Scraping: a GDPR and Serbian Privacy Law Perspective

In both situations, the data sets almost always include personal data. Thus, AI developers should carefully consider their obligations under the GDPR as well as local privacy law, depending on what applies to them.

The million-dollar question

Privacy compliance, amongst other things, should be considered as soon as training data is collected. Even if publicly available data is used for training purposes (e.g. data published on YouTube), it does not mean that such data can be freely used. This is a standard misconception amongst AI developers. Training on data sets that include personal data can take place only if the developers have a lawful legal basis for processing such data. Under the GDPR, this usually comes down to two things: consent or legitimate interest. While this may appear impossible or challenging, all AI providers training on personal data should consider privacy concerns very carefully.

The first EU guidance on the lawfulness of web scraping was encompassed under the EDPB's Report of ChatGPT's Taskforce issued in May 2024. This report indirectly supports innovation and stresses that legitimate interest might be considered as the only possible legal basis for data processing under web scraping techniques, provided that certain safeguards are applied. A couple of months prior to this, in March 2024, the UK Information Commissioner's Office (the "ICO") issued a consultation and explored the issues around the legality of web scraping. It also concluded that legitimate interest is the only remaining lawful basis for web scraping.

Pursuant to both the ICO and the EDBP's report, legitimate interest might be considered as a lawful basis for web scraping if the following criteria are met: (i) a legitimate interest exists; (ii) the processing is necessary, with personal data being adequate, relevant and limited to what is required for the purposes for which they are processed; and (iii) the interests are balanced.

Legitimate interest can serve as a lawful basis for data processing only if the interest is clearly defined and justified. Thus, when training AI models on web scraped data, this interest should not be broadly defined or vague. According to the EDPB, it is necessary not only to recognise the interest, but also to concretely justify it in terms of the purpose for which the data is collected. If the intended use of the model cannot be clearly defined in advance, it becomes challenging to justify it.

Web scraping is often considered necessary due to the volume of data required to train these models. However, according to the EDPB, even when large data sets are used, it must be ensured that unnecessary data is not collected, especially data that is not relevant to the specific training purposes. Therefore, the EDPB emphasises the importance of applying measures during data collection and excluding certain types of data from the collection process, such as public social media profiles.

Balancing interests is perhaps the most complex criterion. It is necessary to assess whether the rights and freedoms of individuals outweigh the legitimate interests of the controller. Web scraping is an invisible processing activity, where people are often unaware that their data has been collected and processed in this way. This means that individuals may lose control over their data, which can compromise their privacy rights. This necessitates the mandatory application of technical and organisational measures, such as data filtering during collection and excluding certain sources from the process.

Special approach for special categories of personal data

A particular issue arises with the scraping of special categories of personal data, such as data related to health, political views and religious beliefs. Processing this data requires the explicit consent of the individual, which further complicates the legality of web scraping. Without clear and explicit consent, processing such data may directly violate the GDPR, which strictly demands respect for privacy and individual rights.

One example where this issue arises is search engine scraping. This is what Google engages in when it collects data for the sole purpose of indexing and enabling searches. Unique to search engines, this form of scraping may be considered justified in the context of the public's right to information, as recognised by the Charter of Fundamental Rights of the European Union, but each case must still be carefully evaluated to ensure that the fundamental rights of individuals are not violated. This exception can only be justified with strict protective measures and a clear framework that limits processing to what is necessary to achieve legitimate objectives.

But that's not all

One of the key elements in ensuring GDPR compliance, especially in the context of web scraping, is the obligation to inform individuals whose data are being collected, even when consent is not the basis for processing. Article 13 of the GDPR clearly mandates that individuals must be informed prior to the processing of their data collected directly from them. However, when data is collected through web scraping, which often involves gathering data from publicly available sources, Article 14 of the GDPR (or Article 24 of the Serbian privacy law) applies. This article governs the obligation to inform individuals about the processing of their data, even when the processing is not immediately apparent or is indirect, as is the case with web scraping.

Depending on the AI product itself, the provider of AI systems might also have other obligations under the GDPR and/or local privacy laws. These obligations include legitimate interest assessment (LIA) and data protection impact assessment (DPIA), possibly with the obligation to acquire prior approval from the competent authority (depending on the AI system itself).

Final remarks

In an era of rapid AI development and widespread digitalisation, the legality of web scraping has become a critical question for AI developers. Despite the potential for innovation that web scraping offers, it is all too often forgotten that every step in this process is deeply rooted in a complex legal framework designed to protect individuals' privacy. Given that the EU AI Act will become applicable for generative AI models within a year in the EU (or three years depending on whether the models were placed on the market before 2 August 2025), or outside of the EU in specific situations, developers collecting data through web scraping should carefully analyse whether their products will be affected by this law. If yes, their products and business operations must be promptly adjusted to reflect these developments.

By Marija Vlajkovic, Partner, and Marija Lukic, Senior Associate, Schoenherr

Sidebar

Navigation

Avellum Advises Grupa Pracuj on Merger Clearance for Investment in Work.ua

Lambadarios Advises Halcyon Equity Partners on Investment in AlfaOmega Pharma Logistics

Greenberg Traurig, Clifford Chance, CMS and Wardynski & Partners Advise on CVI and Flexam Invest's CTL Logistics Bond Financing

BBH Advises SICO on Sale of Stake in Czech JV to Trelleborg Group

Walless and Cobalt Advise on Millerhawk's Acquisition of Retail Property Portfolio in Estonia

NNDKP Defends OMV Petrom’s Neptun Deep Project Against Greenpeace Challenge

Tabakov, Tabakova & Partners Advises on PDO Registration for Natural Mineral Water Hissarya

Kinstellar Advises Mitiska REIM on EBRD’s Entry into Slovak Retail Parks Joint Venture

A&O Shearman and White & Case Advise on EUR 500 Million Notes Issuance by Ceske Drahy

The Ultimate Website Checklist for Law Firms

An Uptick Despite Ongoing Turbulence in Georgia: A Buzz Interview with Ketti Kvartskhava of BLC Law Office

The Tax Burden in Slovenia: A Buzz Interview with Pia Florjancic Pozeg Vancas of Peterka Partners

A Coral Anniversary: NNDKP Law Firm Reflects on Its Story and Legacy in Romania

Serbia's Protests, Slowdown, and First Issuance: A Buzz Interview with Maja Jovancevic Setka of Karanovic & Partners

Hot Practice in Hungary: Tamas Feher on Jalsovszky's Dispute Resolution Practice

Inside Insight: Simone Quantschnigg of Vamed Care

Inside Insight: Konstantinos Argyropoulos of Space Hellas

Inside Insight: Natalia Lysa of Nestle

Inside Insight: Filip Knezevic of Vezuv

Ukrainian GCs on Trends in Hiring Local Counsels and Use of Legaltech

2025 CEE General Counsel Summit Sneak Peak: Interview with Davor Majstorovic of AMB Legal

Privacy Concerns in Web Scraping: a GDPR and Serbian Privacy Law Perspective

Tools

Typography

Our Latest Issue

The Epidemic of Generic: The Problem with Law Firm Messaging

Deal Expanded: Telekom Srbija’s EUR 1.5 Billion Move Into Regional Content

Increased Interest in Bosnia and Herzegovina: A Buzz Interview with Nebojsa Maric of Maric & Co

Galya Gugusheva Joins Gugushev & Partners as Senior Partner

New NPL Act in Hungary

Avellum Advises Grupa Pracuj on Merger Clearance for Investment in Work.ua

Lambadarios Advises Halcyon Equity Partners on Investment in AlfaOmega Pharma Logistics

Greenberg Traurig, Clifford Chance, CMS and Wardynski & Partners Advise on CVI and Flexam Invest's CTL Logistics Bond Financing

BBH Advises SICO on Sale of Stake in Czech JV to Trelleborg Group

Walless and Cobalt Advise on Millerhawk's Acquisition of Retail Property Portfolio in Estonia

News Categories

Latest News

More Analysis

Latest Analysis and Commentary

In-House Categories

Latest In-House

Tools

Typography

Share This

Our Latest Issue