Submission
Privacy Policy
Code of Ethics
Newsletter

From Data to Dispute: What Can (and Can’t) Be Done in the World of Web Scraping

The development of Artificial Intelligence systems relies heavily on vast amounts of publicly available data, raising increasingly complex legal and ethical questions. The automated collection of digital content has become a routine practice in science, industry, and the media, yet it remains controversial what exactly qualifies as lawful use. Where is the line between data collection and legal infringement, and how can we assess whether a technique is socially acceptable?

Recently, there has been a growing number of lawsuits challenging the legality of the data used to train artificial intelligence systems, particularly large language models (LLMs). More and more content creators, authors, and publishers claim that some developers have used their work without permission.. At the same time, it has become evident that LLMs require an enormous amount of training data, which primarily comes from public internet sources such as blog posts, news articles, forum comments, and Wikipedia entries. It has therefore become a key question under what legal and ethical principles data is collected, especially in the case of one of the most used techniques: web scraping.

Web scraping is an automated method by which computer programs or bots extract structured or semi-structured data from websites without human intervention. Such a tool can, for instance, crawl an entire news website to retrieve all article titles, texts, and their associated links, gather a web shop’s full product list including prices and descriptions, or extract data from statistical websites and organize it into databases for economic modeling purposes. The technique has become widespread in many areas, from linguistic research to economic analysis and investigative journalism. It is used in virtually every industrial and academic field where the large-scale analysis of publicly accessible data is required.

Still, the benefits and the risks of web scraping go hand in hand. This method can serve several socially valuable purposes. It is used by search engines to index internet content, by journalists who uncover abuses of public interest based on open data, by researchers building complex datasets, and by economists analyzing housing markets. In the field of medicine, researchers sometimes analyze data from patient forums to identify the side effects of medications. At the same time, it is important not to overlook harmful practices that raise ethical or legal concerns. These include cases where automated systems collect email addresses for unsolicited marketing campaigns, or where data extracted from personal profiles is used to distribute targeted and manipulative content.

As with many technologies, web scraping is not inherently good or bad; what matters is how it is used. However, assessing its acceptability is often complicated by the fact that web scraping frequently falls into a legal gray area. This is due to the simultaneous application of several different branches of law, such as data protection, copyright, contract law, and competition law, while national regulations and legal interpretations can vary widely across jurisdictions. Moreover, data collection is often technically simple and fast to execute. The process itself can be easy to conceal, and legal consequences often only become clear afterward, on a case-by-case basis.

In addition, most websites have specific terms of use, which often explicitly prohibit automated data extraction. However, the enforcement of these terms is inconsistent and rarely results in immediate legal consequences, since detecting a violation in the first place is far from straightforward. As a result, data collectors frequently operate at the edge of legality, especially when repurposing data originally gathered for one context to serve entirely new goals, such as training AI models.

This uncertain legal landscape is well illustrated by the case of hiQ Labs v. LinkedIn in the United States. hiQ Labs was a company that used publicly available LinkedIn profiles for its own predictive analytics, such as forecasting whether an employee was likely to leave their current job. In 2017, LinkedIn sent a cease-and-desist letter demanding that hiQ stop this activity and threatened legal action, citing violations of the CFAA and its terms of service. hiQ sued LinkedIn and obtained a preliminary injunction allowing it to continue scraping. The appellate court later ruled that collecting publicly accessible data does not constitute unauthorized access, as long as the process does not involve bypassing technical barriers such as firewalls. Although the case ultimately ended in an out-of-court settlement, the decision became a key precedent for interpreting the legality of scraping public data. Still, this does not eliminate liability for other legal issues. Breaches of contract, infringement of proprietary interests, or unfair use of data can still be unlawful.

In the European Union, the legal landscape is even more complex. Data protection regulations, copyright directives, and the sui generis database right collectively define the legal framework. In practice, the rules vary from country to country, which contributes to significant legal uncertainty. This is particularly problematic as web scraping continues to spread throughout the digital economy. The GDPR, for instance, imposes strict obligations when processing personal data, and these apply equally to data obtained through scraping. This applies even if the data is publicly accessible, because the nature of personal data does not depend on its availability. Scraping is only lawful if the data controller has a valid legal basis, such as a demonstrable legitimate interest, and is able to document it. If this is not the case, scraping may violate the provisions of the GDPR and lead to sanctions. Copyright law, moreover, restricts not only the copying of original works but also their adaptation and reuse, even when the sources are publicly available.

In addition, there is the so-called sui generis database right in the European Union, which is independent of copyright and offers separate protection for databases created through substantial investment. This protection is based not on creative content, but on the effort and resources required to compile the data and can last up to 15 years. As a result, mass copying or extraction of such a database may still constitute an infringement, even if the individual items in the database are not protected by copyright themselves.

The ethical use of this technology clearly requires adherence to certain basic principles. Website terms of use should be respected, even if their legal enforcement is sometimes difficult. It is also important to honor the restrictions indicated by the robots.txt file, which reflects the intent of the website operators, even though its contents are not legally binding. When collecting data, personal information should be avoided, or appropriate anonymization must be ensured. In light of copyright considerations, source attribution is recommended, and the purpose of data collection should always be evaluated in terms of whether it serves a socially beneficial or public interest goal.

Thus, the evaluation of web scraping depends not on the technology itself, but on how it is applied. One of the key requirements for developing LLMs is access to large datasets, but this must not come at the expense of privacy, copyright protection, or public trust. A balance must be struck between technological advancement and the protection of individual rights. The information society can only develop sustainably if our decisions are guided by the principles of fairness, autonomy, and justice.


István ÜVEGES, PhD is a Computational Linguist researcher and developer at MONTANA Knowledge Management Ltd. and a researcher at the HUN-REN Centre for Social Sciences. His main interests include the social impacts of Artificial Intelligence (Machine Learning), the nature of Legal Language (legalese), the Plain Language Movement, and sentiment- and emotion analysis.