Web Scraping Without Getting Blocked

Web Scraping Without Getting Blocked
Last edit: May 04, 2023

When creating a web scraper, it is worth considering the possibility of blocking - not all services are friendly to scraping their data.

To reduce the number of bots using the site, developers use IP address recognition, HTTP request headers checking, CAPTCHA, and other methods to detect bot traffic. However, these are still possible to bypass. To do this, you need to follow some rules during scraping.

But even if the site doesn't set any forbiddance, it is worth showing respect and not harming the web pages. Follow the rules outlined in robots.txt, don't scrape data during peak hours, limit requests coming from the same IP address, and set delays between them.

Sets for Non-blocks

Firstly, set up the scraper in the right way.

Set Request Intervals

The most common mistake when creating web scrapers is using fixed intervals. People are not able to access the site after a strictly fixed period of time all 24 hours a day. 

Therefore, it is necessary to set some interval within which the time between iterations will change. As a rule, it is better to install it for two seconds or more.

Also, don't flip through the pages too fast. Stay on the web page for a while. Such imitation of user behavior will reduce the risk of blocking.

Try Our Ready-Made Solutions for Your Needs

Google SERP Scraper is the perfect tool for any digital marketer looking to quickly and accurately collect data from Google search engine results. With no coding…

Our Google Maps Scraper lets you quickly and easily extract data from Google Maps, including business type, phone, address, website, emails, ratings, number of reviews,…

Set User Agent

The User-Agent contains information about the user and the device being used. In other words, these are the data that the server receives at the time of the user visit. It helps the server to identify each visitor. And if a user with the same User-Agent makes too many requests, the server may ban him.

Therefore, it is worth considering introducing into the web scraper the ability to periodically change the User-Agent header to any other from the whole list in a random way. This will allow for avoiding blocking and continuing to collect information.

To view your own User Agent, go to DevTools (F12) and then to the Network tab.

User-Agent
Use Real User-Agent

Set Additional Request Headers

However, besides user agents, there are other headers that can sabotage the work of the scraper. Unfortunately, web scrapers and crawlers often send headers that differ from those sent by real web browsers. Therefore, it is worth taking the time to change all the headers so that they do not look like automatic requests that the bot sends.

As a rule, when using a browser by a real user, the headers "Accept", "Accept-Encoding", "Accept-Language" and "Upgrade-Insecure-Requests" are also filled in. Therefore, do not forget about them either. An example of filling such fields:

accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9
upgrade-insecure-requests: 1

Set Referer

The Referrer header shows the site from which the user came. If don't know what to enter in this field, can use "google.com". It can be either any other search engine (Yahoo.com, Bing.com, etc.) or any social media site. For example, it might look like this:

Referer: https://www.google.com/

Set Your Fingerprint Right

Whenever someone connects to a target website, their device sends a request that includes HTTP headers. These headers contain information such as the device's time zone, language, privacy settings, cookies, and more. Web headers are transmitted by the browser each time a site is visited, and together are fairly unique.

For example, a certain combination of such parameters may be unique for approximately 200,000 users. Therefore, it is worth keeping up to date with such information. An alternative is to use third-party scraping services or resident IPs. To check own fingerprints can be used in the next service.

However, not only browser fingerprints should be right but TLS ones too. It is especially important to keep track of TLS/HTTP fingerprints, which are tracked by various sites. For example, most parsers use HTTP/1.1 and most browsers use HTTP/2 when available. Therefore, requests using HTTP/1.1 will be suspicious for most sites.

Try Our Ready-Made Solutions for Your Needs

Yelp Scraper is a powerful web scraper designed to extract data from Yelp.com without the need for any coding skills. With its easy-to-use interface, you can quickly…

Yellow Pages Scraper is the perfect solution for quickly and easily extracting business data! With no coding required, you can now scrape important information from…

Other Ways to Avoid Blocks

So, if all the settings are done, it's time to move on to the main traps and rules to follow.

Use Headless Browser

First of all, it should be noted that if possible, it is worth using a headless browser. They allow imitating user behavior, reducing the risk of blocking. If such a browser interferes, it is always possible to hide it and do everything in the background mode.

It will also help to receive even the data that is loaded using JavaScript or dynamic AJAX web pages. The most common headless browser is Chrome Headless, which most scraping libraries (for example, Selenium) work with.

Headless Browser
Use Headless Browser

Headless browsers introduce various style elements such as fonts, layouts, and colors, so they are harder to recognize and distinguish from a real user.

Read more about Proxy Scraper

Use Proxy Server

If for a long time, requests come from the same place with a small interval, this behavior is not similar to that of a normal user. It's more like a bot. However, so that the target website doesn't suspect anything, one can use a proxy server.

In simple terms, a proxy is an intermediary computer that makes a request to the server instead of the client and returns a result to the client. Thus, the destination server thinks that the request is made from a completely different place, and therefore, by a completely different user.

Proxy
Use proxy

Proxies are both free and paid. However, it is better not to use free proxies for data scraping - they are extremely slow and unreliable. Ideally, one should use residential or mobile proxies. In addition, it is not enough to use one proxy. For scraping, it is better to create a whole proxy pool.

Also, it is very important to keep track of the IP address from which the request is being made. In case the location does not match the expectations of the site, it can simply block them. For example, it is unlikely that local infrastructures will be useful for foreign users. So it's better to use local proxies for parsing sites so as not to arouse suspicion.

Use CAPTCHA Solving Services

When there are too many requests, the site may offer to solve a captcha to make sure that the request is made by a real user and not a bot. In this case, services can help that, for a small fee, will automatically recognize the proposed captcha.

Avoid Honeypot Traps

To catch a bot, many websites use honeypot traps. In general, a honeypot is an empty link that does not exist on the page but is present in the source HTML code. When harvested automatically, these hooks can redirect the web scraper to decoy pages or blank pages.

In fact, they are very easy to spot. For such links, various "masking" CSS properties are specified. For example, "display: none", "visibility: hidden" or the color of the link is identical to the background of the site.

Try Our Ready-Made Solutions for Your Needs

Shopify scraper is the ultimate solution to quickly and easily extract data from any Shopify-powered store without needing any knowledge of coding or markup! All…

Avoid JavaScript

Scraping JavaScript, like images, is not actually something that causes blocking. But it is worth noting that not all libraries allow scraping such data, which means that a web scraper capable of collecting dynamic data will have more complex code and require more computing power.

Read more about Web Scraping Proxy

Using Ready API in Web Scrapers

If it seems that the listed settings and rules are too many, and the costs of proxies and captcha-solving services are too high, one can do it easier and "redirect" the interaction with the site to third-party resources.

Scrape-it.Cloud offers a REST API for scraping web pages at any scale. The service takes care of IP blocks, IP Rotations, captchas, JavaScript rendering, finding and using residential or datacenter proxies, and setting HTTP headers and custom cookies. The user sets the query and the API returns data.

Tips & Tricks for Scraping

The last thing that is also worth mentioning is the time when it is better to scrape websites and the reverse engineering method in scraping. This is necessary not only in order to avoid blocking, but also in order not to harm the site.

Scrape During Off-peak Hours

Due to the fact that crawlers move through pages faster than real users, they significantly increase the load on the server. At the same time, if parsing is performed at a time of high load on the server, the speed of work of services falls down and the site loads more slowly.

This will not only negatively affect the traffic of the site by real users but also increase the time required for data collection.

Therefore, it is worth collecting and extracting data at moments of minimal site load. It is generally recommended to run the parser after midnight local site time.

Non-Peak Hours
Scrape at Non-Peak Hours

Scrape at Different Day Times

If the site is heavily loaded daily from 8.00 am to 8.20 am, it starts to raise suspicions. Therefore, it is worth specifying some random value within which the scraping time will change.

Try Our Ready-Made Solutions for Your Needs

Zillow Scraper is the tool for real estate agents, investors, and market researchers. Its easy-to-use interface requires no coding knowledge and allows users to…

With this powerful tool, you can easily scrape accurate data from the Apartments.com website without any coding knowledge or experience. Get valuable insights into…

Reverse Engineering for Better Scraping

Reverse engineering is a commonly used development method. In short, reverse engineering involves research of software applications to understand how they function.

In the case of developing a scraper, this approach means having a primary analysis for compiling future requests. The developer tools or simply DevTools in the browser (press F12) can help to analyze web pages.

Let's try to take a closer look at google SERP. Go to the DevTools on the Network tab, then try to find something at google.com, and look at the completed request. To view the response, just click on the received request and go to the Preview tab:

Network
Preview tab at Network

This data helps to understand what exactly the request should return and in which form. The data from the header tab will help to understand what data should be sent to compile the request. The main thing is to correctly execute requests and correctly interpret the responses.

Reverse engineering of mobile applications

The situation is similar to the reverse engineering of mobile applications. Only in this case, it is necessary to intercept the request sent by the mobile application to the server. Unlike intercepting normal requests, to do it for mobile applications one should use a Man-In-The-Middle proxy, such as the Charles proxy.

Also, don’t forget that the requests sent by the mobile application are more complex and confusing.

Conclusion and Takeaways

Finally, let's take a look at what security measures sites can take and what countermeasures can be taken to bypass them.

Security Measure

Countermeasure

Browser fingerprinting

Headless Browser

Storing data in JavaScript

Headless Browser

IP-rate limits

Proxy rotation

TLS fingerprinting

Forge TLS fingerprint

CAPTCHA

CAPTCHA solving services

By following a number of simple rules that were listed above, you can not only avoid blocking but also significantly increase the efficiency of the scraper.

In addition, when creating a scraper, it is worth considering that many sites provide an API for obtaining data. And if there is such an opportunity, it is better to use them than manually collecting data from the site.

Tired of getting blocked while scraping the web?

Try out Web Scraping API with proxy rotation, CAPTCHA bypass, and Javascript rendering.

  • 1,000 Free API Credits
  • No Credit Card Required
  • 30-Day Trial
Try now for free

Collect structured data without any coding!

Our no-code scrapers make it easy to extract data from popular websites with just a few clicks.

  • CSV, XLSX, and JSON Formats
  • No Coding or Software Required
  • Save Time and Effort
Scrape with No Code
Valentina Skakun

I'm a technical writer who believes that data parsing can help in getting and analyzing data. I'll tell about what parsing is and how to use it.