Web Scraping Without Getting Blocked

Web Scraping Without Getting Blocked
Posted on
Jul 26, 2022

When creating a web scraper, it is worth considering the possibility of blocking - not all services are friendly to scraping their data.

To reduce the number of bots using the site, developers use IP address recognition, HTTP request headers checking, CAPTCHA, and other methods to detect bot traffic. However, it is still possible to bypass. To do this, you need to follow some rules during scraping.

But even if the site doesn't set any forbiddance, it is worth showing respect and not harm the web pages. Follow the rules outlined in robots.txt, don't scrape data during peak hours, limit requests coming from the same IP address, and set a delays between them.

Sets for Non-blocks

Firstly, set up the scraper in the right way.

Set Request Intervals

The most common mistake when creating web scrapers is using fixed intervals. People are not able to access the site after a strictly fixed period of time all 24 hours a day. 

Therefore, it is necessary to set some interval within which the time between iterations will change. As a rule, it is better to set it from two seconds or more.

Also, don't flip through the pages too fast. Stay on the web page for a while. Such imitation of user behavior will reduce the risk of blocking.

Tired of getting blocked while scraping the web?

Try out Web Scraping API with proxy rotation, CAPTCHA bypass, and Javascript rendering.

  • 1,000 Free API Credits
  • No Credit Card Required
  • 30-Day Trial
Try now for free

Get structured data in the format you need!

We offer customized web scraping solutions that can provide any data you need, on time and with no hassle!

  • Regular, custom data delivery
  • Pay after you receive sample dataset
  • A range of output formats
Get a Quote

Set User Agent

The User-Agent contains information about the user and the device being used. In other words these are the data that the server receives at the time of user visit. It helps server to identify each visitor. And if a user with the same User-Agent makes too many requests, the server may ban him.

Therefore, it is worth considering introducing into the web scraper the ability to periodically change the User-Agent header to any other from the whole list in random way. This will allow to avoid blocking and to continue collecting information.

To view your own User Agent, go to DevTools (F12) and then to the Network tab.

User-Agent
Use Real User-Agent

Set Additional Request Headers

However, besides user agents, there are other headers that can sabotage the work of the scraper. Unfortunately, web scrapers and crawlers often send headers that differ from those sent by real web browsers. Therefore, it is worth taking the time to change all the headers so that they do not look like automatic requests that the bot sends.

As a rule, when using a browser by a real user, the headers "Accept", "Accept-Encoding", "Accept-Language" and "Upgrade-Insecure-Requests" are also filled in. Therefore, do not forget about them either. The example of filling such fields:

accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9
upgrade-insecure-requests: 1

Set Referer

The Referrer header shows the site from which the user came. If don't know what to enter in this field, can use "google.com". It can be either any other search engine (Yahoo.com, Bing.com, etc.) or any social media site. For example, it might look like this:

Referer: https://www.google.com/

Set Your Fingerprint Right

Whenever someone connects to a target website, their device sends a request that includes HTTP headers. These headers contain information such as the device's time zone, language, privacy settings, cookies, and more. Web headers are transmitted by the browser each time a site is visited, and together are fairly unique.

For example, a certain combination of such parameters may be unique for approximately 200,000 users. Therefore, it is worth keeping up to date with such information. An alternative is to use third party scraping services or resident IPs. To check own fingerprints can be used the next service.

However, not only browser fingerprints should be right but TLS ones too. It is especially important to keep track of TLS/HTTP fingerprints, which are tracked by various sites. For example, most parsers use HTTP/1.1 and most browsers use HTTP/2 when available. Therefore, requests using HTTP/1.1 will be suspicious for most sites.

Tired of getting blocked while scraping the web?

Try out Web Scraping API with proxy rotation, CAPTCHA bypass, and Javascript rendering.

  • 1,000 Free API Credits
  • No Credit Card Required
  • 30-Day Trial
Try now for free

Get structured data in the format you need!

We offer customized web scraping solutions that can provide any data you need, on time and with no hassle!

  • Regular, custom data delivery
  • Pay after you receive sample dataset
  • A range of output formats
Get a Quote

Another Ways to Avoid Blocks

So, if all the settings are done, it's time to move on to the main traps and rules to follow.

Use Headless Browser

First of all, it should be noted that if possible, it is worth using headless browser. They allow to imitate user behavior, reducing the risk of blocking. If such a browser interferes, it is always possible to hide it and do everything in the background mode.

It will also help to receive even the data that is loaded using JavaScript or dynamic AJAX web pages. The most common headless browser is Chrome Headless, which most scraping libraries (for example, Selenium) work with.

Headless Browser
Use Headless Browser

Headless browsers introduce various style elements such as fonts, layouts, and colors, so they are harder to recognize and distinguish from a real user.

Use Proxy Server

If for a long time requests come from the same place with a small interval, this behavior is not similar to that of a normal user. It's more like a bot. However, so that the target website doesn't suspect anything, one can use a proxy server.

In simple terms, a proxy is an intermediary computer that makes a request to the server instead of the client and returns a result to the client. Thus, the destination server thinks that the request is made from a completely different place, and therefore, by a completely different user.

Proxy
Use proxy

Proxies are both free and paid. However, it is better not to use free proxies for data scraping - they are extremely slow and unreliable. Ideally, one should use a residential or mobile proxies. In addition, it is not enough to use one proxy. For scraping, it is better to create a whole proxy pool.

Also it is very important to keep track of the IP address from which the request is being made. In case the location does not match the expectations of the site, it can simply block them. For example, it is unlikely that local infrastructures will be useful for foreign users. So it's better to use local proxies for parsing sites so as not to arouse suspicion.

Use CAPTCHA Solving Services

When there are too many requests, the site may offer to solve a captcha to make sure that the request is made by a real user and not a bot. In this case, services can help that, for a small fee, will automatically recognize the proposed captcha.

Avoid Honeypot Traps

To catch a bot, many websites use honeypot traps. In general, a honeypot is an empty link that does not exist on the page, but is present in the source html code. When harvested automatically, these hooks can redirect the web scraper to decoy pages or blank pages.

In fact, they are very easy to spot. For such links, various "masking" CSS properties are specified. For example, "display: none", "visibility: hidden" or the color of the link is identical to the background of the site.

Tired of getting blocked while scraping the web?

Try out Web Scraping API with proxy rotation, CAPTCHA bypass, and Javascript rendering.

  • 1,000 Free API Credits
  • No Credit Card Required
  • 30-Day Trial
Try now for free

Get structured data in the format you need!

We offer customized web scraping solutions that can provide any data you need, on time and with no hassle!

  • Regular, custom data delivery
  • Pay after you receive sample dataset
  • A range of output formats
Get a Quote

Avoid JavaScript

Scraping JavaScript, like images, is not actually something that causes blocking. But it is worth noting that not all libraries allow scraping such data, which means that a web scraper capable of collecting dynamic data will have more complex code and require more computing power.

Using Ready API in Web Scrapers

If it seems that the listed settings and rules are too many, and the costs of proxies and captcha solving service are too high, one can do it easier and "redirect" the interaction with the site to third-party resources.

Scrape-it.Cloud offers a REST API for scraping web pages at any scale. The service takes care of IP blocks, IP Rotations, captchas, JavaScript rendering, finding and using residential proxies, data center proxies, setting HTTP headers and custom cookies. The user sets the query and API returns data.

Tips & Tricks for Scraping

The last thing that is also worth mentioning is the time when it is better to scrape websites and reverse engineering method in scraping. This is necessary not only in order to avoid blocking, but also in order not to harm the site.

Scrape During Off-peak Hours

Due to the fact that crawlers move through pages faster than a real user, they significantly increase the load on the server. At the same time, if parsing is performed at a time of high load on the server, the speed of work of services falls down and the site loads more slowly.

This will not only negatively affect the traffic of the site by real users, but also increase the time required for data collection.

Therefore, it is worth collecting and extracting data at the moments of minimal site load. It is generally recommended to run the parser after midnight local site time.

Non-Peak Hours
Scrape at Non-Peak Hours

Scrape at Different Day Time

If the site is heavily loaded daily from 8.00 am to 8.20 am, it starts to raise suspicions. Therefore, it is worth specifying some random value within which the scraping time will change.

Tired of getting blocked while scraping the web?

Try out Web Scraping API with proxy rotation, CAPTCHA bypass, and Javascript rendering.

  • 1,000 Free API Credits
  • No Credit Card Required
  • 30-Day Trial
Try now for free

Get structured data in the format you need!

We offer customized web scraping solutions that can provide any data you need, on time and with no hassle!

  • Regular, custom data delivery
  • Pay after you receive sample dataset
  • A range of output formats
Get a Quote

Reverse Engineering for Better Scraping

Reverse engineering is a commonly used development method. In short, reverse engineering involves research of software applications to understand how they function.

In the case of developing a scraper, this approach means having a primary analysis for compiling future requests. The developer tools or simply DevTools in the browser (press F12) can help to analyze web pages.

Let's try to take a closer look at google SERP. Go to the DevTools on the Network tab, then try to find something at google.com, and look at the completed request. To view the response, just click on the received request and go to the Preview tab:

Network
Preview tab at Network

This data helps to understand what exactly the request should return and in which form. The data from the header tab will help to understand what data should be sent to compile the request. The main thing is to correctly execute requests and correctly interpret the responses.

Reverse engineering of mobile applications

The situation is similar for reverse engineering of mobile applications. Only in this case, it is necessary to intercept the request sent by the mobile application to the server. Unlike intercepting normal requests, to do it for mobile applications one should use a Man-In-The-Middle proxy, such as the Charles proxy.

Also, don’t forget that the requests sent by the mobile application are more complex and confusing.

Conclusion and Takeaways

Finally, let's take a look at what security measures sites can take and what countermeasures can be taken to bypass them.

Security Measure

Countermeasure

Browser fingerprinting

Headless Browser

Storing data in JavaScript

Headless Browser

IP-rate limits

Proxy rotation

TLS fingerprinting

Forge TLS fingerprint

CAPTCHA

CAPTCHA solving services

By following a number of simple rules that were listed above, you can not only avoid blocking, but also significantly increase the efficiency of the scraper.

In addition, when creating a scraper, it is worth considering that many sites provide an API for obtaining data. And if there is such an opportunity, it is better to use them than manually collecting data from the site.

Tired of getting blocked while scraping the web?

Try out Web Scraping API with proxy rotation, CAPTCHA bypass, and Javascript rendering.

  • 1,000 Free API Credits
  • No Credit Card Required
  • 30-Day Trial
Try now for free

Get structured data in the format you need!

We offer customized web scraping solutions that can provide any data you need, on time and with no hassle!

  • Regular, custom data delivery
  • Pay after you receive sample dataset
  • A range of output formats
Get a Quote
Valentina Skakun
Valentina Skakun
Valentina Skakun’s profile

I'm a technical writer who believes that data parsing can help in getting and analyzing data. I'll tell about what parsing is and how to use it.

Request a Quote

Tell us more about you and your project information.