Legal and Ethical Aspects of Web Scraping

Legal and Ethical Aspects of Web Scraping
Last edit: Jan 27, 2023

If you're involved with web scraping, you may have wondered if this kind of data collection is legal. On second thought, the biggest asset of any business right now is data. The data analytics market is expected to grow at a CAGR of 30.41%, from USD 41.39 billion in 2022 to USD 346.33 billion in 2030.

Data Analytics Market Size, 2021 to 2030 (USD Billion)
Data Analytics Market Size, 2021 to 2030 (USD Billion)

Web scrapers have revolutionized e-commerce and how businesses access data and information. In the past, data gathering, market research, product development, and investment opportunity evaluation were incredibly labor-intensive — involving manual queries of disparate sources. But web scrapers allow companies to dramatically simplify the process by automating it; the robots crawl over countless websites in a fraction of the time it would take humans. This more efficient model allows companies to analyze larger datasets with greater variety for better insights into their industry trends than ever before.

In data science, web scraping is an invaluable tool for machine learning projects. By scraping online content, machine learning programs can collect training and test data to make predictive analyses and process natural language. Web scraping offers data scientists access to Big Data, enabling them to better understand machine-learning models by revealing patterns and trends in machine-generated data to create powerful machine-learning programs. With this information, data science professionals can understand the results of machine learning systems and improve them by making them more accurate and efficient.

But there are now legal issues around web scraping because some companies don't like to have their data collected. Company owners worry about things like copyright infringement, fraud, and so on.

To get to the bottom of this issue, we've prepared an article for you with explanations.

Tired of getting blocked while scraping the web?

Try out Web Scraping API with proxy rotation, CAPTCHA bypass, and Javascript rendering.

  • 1,000 Free API Credits
  • No Credit Card Required
  • 30-Day Trial
Try now for free

Get structured data in the format you need!

We offer customized web scraping solutions that can provide any data you need, on time and with no hassle!

  • Regular, custom data delivery
  • Pay after you receive sample dataset
  • A range of output formats
Get a Quote

In a nutshell, yes. Collecting publicly available information on the Internet is legal as long as the data collected is not used for any harmful purposes and does not cause damage to the business. Still, web scraping is just a tool to automate what a person can do manually. The tool itself can not be legal or illegal, but the use of the tool can.

Problems arise when people ignore the Terms of Service (ToS) and scrape without the permission of the site owner. Although web scraping has no clear laws or conditions for its application, it is covered by many legal provisions. For example:

But despite all these laws, if the scraped data constitutes personal information, it's best to disclaim liability.

Scraping personal data

Personal data is any information relating to an identified or identifiable individual. Fragments of information that, taken together, can lead to the identification of an individual are also personal data.

Examples of personal data
Examples of personal data

Personal data can include official data about a person, such as:

  • first and last name

  • date of birth

  • phone number

  • address

  • social security number, passport number, national ID number

  • work information

  • email address

  • IP address

  • social media accounts

  • credit card number

  • behavioral data

  • medical records

The collection and use of personal information must meet specific legal criteria. The use of the data must be lawful, and there must be some legal basis for it. If you understand that web scraping involves personal information, you must think carefully about how to comply with these legal principles. But we'll talk about the laws a bit later.

Publicly available personal data 

It seems clear that making data publicly available on the Internet implies that the owner is willing to grant permission for all users to access that information. But it is a mistake to think that only private personal data is protected and that you can collect personal information from publicly available sources.

Under General Data Protection Regulation (GDPR), all personal data is protected, and it doesn't matter where it comes from. Under the CCPA, the information provided by the government, such as business data, is considered public and, therefore, not protected.

California's Privacy Rights Act (CPRA) will come into force in 2023, expanding the CCPA's definition of publicly expanding available information. For example, data previously made public by an individual will no longer be protected, which will essentially allow the scraping of essentially allowing personal data. But only in California. 

Copyrighted content scraping

In general, you cannot copy any copyrighted content without permission or a license. Copyrighted data can be music, news, scientific papers, movies, images and photos, databases, and logos.

However, not all information on the Internet can be attributed to copyright. For example, copyright does not cover product name data, product descriptions, price data, and sales numbers. So you can safely use our Amazon product scraper.

The U.S. fair use doctrine allows scrapers to access copyrighted content. To be considered fair use, extracted data must meet the following criteria:

  • the content must be substantially altered from the original content

  • the content must only be used for research or marketing purposes and not republished as proprietary content

Try our ready-made solutions for your needs

Yelp Scraper is a powerful web scraper designed to extract data from Yelp.com without the need for any coding skills. With its easy-to-use interface, you can quickly…

Yellow Pages Scraper is the perfect solution for quickly and easily extracting business data! With no coding required, you can now scrape important information from…

What do laws say about the legality of web scraping?

There are stiff penalties under the GDPR and CCPA for illegal personal data collection.

GDPR

The GDPR is a regulation that exists as the basis for laws across the continent, replacing the previous 1995 data protection directive

The GDPR came into force in 2018 and applies to the use of personal information and data privacy in all EU member states. The GDPR was also created to change how businesses and other organizations can handle the information of those who interact with them. 

There are also several special categories of sensitive personal data that the GDPR gives increased protection. For example, information about racial or ethnic origin, political views, religious beliefs, union membership, genetic and biometric data, health information, and orientation.

All companies, regardless of location, must comply with GDPR if they collect data from EU residents.

The U.S. Laws

The U.S. has many data protection laws in various states. 

For example, the purpose of the Privacy Act is to strike a balance between the government's need to keep information about people and the rights of people to be protected from unwarranted invasions of their privacy arising from the collection, storage, use, and disclosure of personal information.

The California Consumer Privacy Act (CCPA) is the law that governs how businesses around the world are allowed to handle the personal information of California residents. The Governor of California signed the law on June 28, 2018.

The CCPA gives California residents control over their personal information and the right to know what personal information is collected about a resident, whether the data is sold or disclosed, and the right to access the information and refuse to sell it.

There are also other federal laws, such as:

HiQ Labs vs. LinkedIn 

LinkedIn has entered into a dispute with data analytics company HiQ Labs. LinkedIn sent an official letter demanding that all scraping activities cease. The letter also mentioned that LinkedIn had blocked HiQ Labs' access to public profiles. In response, HiQ Labs took HiQ Labs to court, saying that scraping public data was not illegal. 

  • In 2019, the U.S. The Ninth Circuit ruled in favor of HiQ, stating that collecting publicly available data was not a violation of the CCFA. 

  • In June 2020, the Supreme Court granted LinkedIn's petition for certiorari and remanded the case to the Ninth Circuit for further proceedings. 

  • On April 18, 2022, the Ninth Circuit affirmed that scraping public data is not a violation of the CFAA. 

Facebook vs. Power Ventures

The conflict began in 2009 when Facebook sued Power Ventures for extracting customer information and posting it on its site. Facebook claimed that these actions led to violations of the CAN-SPAM Act, CFAA, DMCA, UCL, and copyright infringement.

The court left only three claims - for violations of the CAN-SPAM Act, the CFAA, and the California Penal Code - for a final decision. In the end, the decision was made in favor of Facebook, and the court ordered Power Ventures to pay Facebook the sum of $79,640.50. 

eBay vs. Bidder’s Edge Case

Ebay's conflict with Bidder's Edge is further proof of why respectful data extraction should be taken seriously. 

In April 1999, eBay allowed Bidder's Edge to view its site for 90 days. During that period, the parties tried to negotiate a formal licensing agreement but could not do so. After unsuccessful negotiations, eBay asked Bidder's Edge to stop crawling its site and BE complied. But after that, BE renewed site scanning and continued to crawl eBay's data.

  • eBay filed a motion for a preliminary injunction, enjoining BE from continuing to use the software robot to crawl the site without permission. 

  • The court held that eBay had proven the possibility of irreparable harm and that BE's activities were unauthorized.

  • As a result, the court concluded that eBay had proven the damages necessary to satisfy the trespass claim. 

Ryanair v. PR Aviation

In 2018, Ryanair's dispute with PR Aviation provided insight into how scraping can be interpreted in European courts. On Ryanair's website, visitors are subject to the ToU, which expressly prohibits data collection. PR Aviation scraped Ryanair, which sued them in the Netherlands for breach of contract.

The court ruled that there were no intellectual property rights in the information collected, namely Ryanair's database of flight times and prices. Therefore the company that scraped the web data did not violate Ryanair's intellectual property. All because the database was not the result of the required creative input necessary to grant copyright protection.

How to scrape websites legally

Check before scraping the website
Check before scraping the website

To legally scrape data, you have to do more than just follow the law. There are different kinds of agreements and policies that you should also follow when collecting information online.

Terms of Use

A Terms of Use (TOU) is a contractual agreement between a service provider and the user that outlines how they must adhere to using the site or service. It is important for sites to clarify the obligations between users and their actions, accounts, products, and technology, as this will help protect any personal information stored on the site.

Agreements can also be browsewrap and clickwrap. 

Browsewrap agreements are made when you visit a site. Sometimes they appear inconspicuously at the bottom of the screen or in a drop-down menu. In these cases, they are usually not legally binding. 

Clickwrap agreements require the user to check a box or click a button. Under the button or checkbox will be a written agreement to the website’s Terms and Conditions. Once you agree, the Terms and Conditions become legally binding.  

Robots.txt file

Today, robots.txt is an important tool for website owners and developers, serving as a communication bridge between humans and sophisticated computer programs such as web crawlers or search engines bot. Robots.txt instructs web crawlers on how to interact with websites, allowing them to provide deep insights into the structure of content, like the hierarchy of web pages and types of file formats.

The rules in Robots.txt must be carefully followed and checked for legitimate web scraping. However, if the Terms of Service or the Robots.txt file explicitly prevent content scraping, you should get permission from the website owner before collecting data.

Privacy Policy

This Privacy Policy is the document that sets forth the rules for collecting and processing users' personal information on the Web site. It would be best to read the privacy policy before using the site or registering, as it explains what data the site collects, why it collects it, and how it is used.

Data Use Agreement

A Data Use Agreement (DUA) is a document required by the privacy policy. It is used to transfer data developed by non-profit, government, or private organizations if the data is not publicly available or has restrictions on use.

Tired of getting blocked while scraping the web?

Try out Web Scraping API with proxy rotation, CAPTCHA bypass, and Javascript rendering.

  • 1,000 Free API Credits
  • No Credit Card Required
  • 30-Day Trial
Try now for free

Get structured data in the format you need!

We offer customized web scraping solutions that can provide any data you need, on time and with no hassle!

  • Regular, custom data delivery
  • Pay after you receive sample dataset
  • A range of output formats
Get a Quote

Ethics of web scraping

Some things can be done ethically or unethically. And web scraping is one of those things. The ethics of automatic data collection manifests itself differently depending on what stage of the scraping process you are in.

Without establishing ethical standards for web scraping, it can be difficult to distinguish between malicious web scrapers looking to plagiarize or profit and those who use data without breaking the law, innovating, and analyzing the market.

From an ethical point of view, given that web scraping already has many uses and professional suppliers in the marketplace, there is nothing wrong with using scraping for business purposes. However, there are rules to follow if you want to collect data ethically.

In fact, web scrapers provide a major solution for users who require data from websites and services that do not have an API available.

Web Scraping Best Practices

Web scraping is an incredibly useful tool for data collection and analysis, but it needs to be done responsibly. It’s important to remember that the web is a shared resource, and it’s in everyone’s best interest to use it respectfully. The following best practices will help ensure your web scraping activities are ethical and in compliance with the law.

Don’t overburden the target website

When scraping data from a website, proceeding gradually is key. Limiting the number of simultaneous requests helps to ensure that the scraping process doesn't impact the user experience of human visitors. Additionally, careful observation of delays between requests ensures that a scraped site remains open and accessible to all parties. If aggressive scraping is undertaken, it can create functionality issues that both impair the user experience and even potentially launch denial of service (DoS) attacks, crashing the website and rendering its content inaccessible to others. Taking it slow and scraping at the site’s lowest activity hours can proactively prevent such negative repercussions.

Respect copyrights

Any data collected from the Internet is not yours. When scraping the site, ensure you are not collecting copyrighted data. For more information on copyright issues, it is best to review the Terms and Conditions of the site and the Privacy Policy.

Scrape only the data you need

Scrape only the information you really need and will use in your work. It will minimize the risk of overloading the scraped site with undesirable traffic. Also, you will only get the data you use and will not store useless content in databases.

Be polite

Before scraping, it's worth being polite and asking if you can collect this data.

You can identify the web scraper using the user's legitimate agent string. That way, a User-Agent informing the site owners of your activity, its purpose, and its organization will appear. This is how you show respect for the site owner.

Use specialized web scraping tools

If you're collecting many data, it can be nearly impossible to check the standards of each site individually. It pays to use a specialized tool, such as web scraping API, to avoid getting in trouble. You also can turn to our specialists, who will take care of the correct information extraction and develop a scraper specifically for your purposes.

Conclusion

After reading this article, we hope you had a little insight into the legality of scraping. For example, web scraping is legal if you collect data from websites for public use or academic research.

Web scraping is illegal if you scrape sensitive information for profit, for example, by collecting personal information without permission and selling it to third parties. Passing off scraped content as your own is also unethical. 

Web scraping has a great future as a valuable and ethical tool for gathering information and even generating new information online. By respecting other sites' terms of service, following the law, and taking an ethical approach to scraping, you won't have any problems with site owners.

Tired of getting blocked while scraping the web?

Try out Web Scraping API with proxy rotation, CAPTCHA bypass, and Javascript rendering.

  • 1,000 Free API Credits
  • No Credit Card Required
  • 30-Day Trial
Try now for free

Get structured data in the format you need!

We offer customized web scraping solutions that can provide any data you need, on time and with no hassle!

  • Regular, custom data delivery
  • Pay after you receive sample dataset
  • A range of output formats
Get a Quote
Alexandra Datsenko
Alexandra Datsenko
Alexandra Datsenko’s profile

I’m a content creator who loves to write about web scraping and techniques for extracting information from websites. Here I will share with you some of my favorite tools, tips, and tricks for data extraction and how to use them in your business.