When creating a web scraper, it is worth considering the possibility of blocking - not all services are friendly to scraping their data.
To reduce the number of bots using the site, developers use IP address recognition, HTTP request headers checking, CAPTCHA, and other methods to detect bot traffic. However, these are still possible to bypass. To do this, you need to follow some rules during scraping.
But even if the site doesn't set any forbiddance, it is worth showing respect and not harming the web pages. Follow the rules outlined in robots.txt, don't scrape data during peak hours, limit requests coming from the same IP address, and set delays between them.
Sets for Non-blocks
Firstly, set up the scraper in the right way.
Set Request Intervals
The most common mistake when creating web scrapers is using fixed intervals. People are not able to access the site after a strictly fixed period of time all 24 hours a day.
Therefore, it is necessary to set some interval within which the time between iterations will change. As a rule, it is better to install it for two seconds or more.
Also, don't flip through the pages too fast. Stay on the web page for a while. Such imitation of user behavior will reduce the risk of blocking.
Google SERP Scraper is the perfect tool for any digital marketer looking to quickly and accurately collect data from Google search engine results. With no coding…
Our Google Maps Scraper lets you quickly and easily extract data from Google Maps, including business type, phone, address, website, emails, ratings, number of reviews,…
Set User Agent
The User-Agent contains information about the user and the device being used. In other words, these are the data that the server receives at the time of the user visit. It helps the server to identify each visitor. And if a user with the same User-Agent makes too many requests, the server may ban him.
Therefore, it is worth considering introducing into the web scraper the ability to periodically change the User-Agent header to any other from the whole list in a random way. This will allow for avoiding blocking and continuing to collect information.
To view your own User Agent, go to DevTools (F12) and then to the Network tab.
Set Additional Request Headers
However, besides user agents, there are other headers that can sabotage the work of the scraper. Unfortunately, web scrapers and crawlers often send headers that differ from those sent by real web browsers. Therefore, it is worth taking the time to change all the headers so that they do not look like automatic requests that the bot sends.
As a rule, when using a browser by a real user, the headers "Accept", "Accept-Encoding", "Accept-Language" and "Upgrade-Insecure-Requests" are also filled in. Therefore, do not forget about them either. An example of filling such fields:
accept-encoding: gzip, deflate, br
The Referrer header shows the site from which the user came. If don't know what to enter in this field, can use "google.com". It can be either any other search engine (Yahoo.com, Bing.com, etc.) or any social media site. For example, it might look like this:
Set Your Fingerprint Right
Whenever someone connects to a target website, their device sends a request that includes HTTP headers. These headers contain information such as the device's time zone, language, privacy settings, cookies, and more. Web headers are transmitted by the browser each time a site is visited, and together are fairly unique.
For example, a certain combination of such parameters may be unique for approximately 200,000 users. Therefore, it is worth keeping up to date with such information. An alternative is to use third-party scraping services or resident IPs. To check own fingerprints can be used in the next service.
However, not only browser fingerprints should be right but TLS ones too. It is especially important to keep track of TLS/HTTP fingerprints, which are tracked by various sites. For example, most parsers use HTTP/1.1 and most browsers use HTTP/2 when available. Therefore, requests using HTTP/1.1 will be suspicious for most sites.
Yelp Scraper is a powerful web scraper designed to extract data from Yelp.com without the need for any coding skills. With its easy-to-use interface, you can quickly…
Yellow Pages Scraper is the perfect solution for quickly and easily extracting business data! With no coding required, you can now scrape important information from…
Other Ways to Avoid Blocks
So, if all the settings are done, it's time to move on to the main traps and rules to follow.
Use Headless Browser
First of all, it should be noted that if possible, it is worth using a headless browser. They allow imitating user behavior, reducing the risk of blocking. If such a browser interferes, it is always possible to hide it and do everything in the background mode.
Headless browsers introduce various style elements such as fonts, layouts, and colors, so they are harder to recognize and distinguish from a real user.
Read more about Proxy Scraper
Use Proxy Server
If for a long time, requests come from the same place with a small interval, this behavior is not similar to that of a normal user. It's more like a bot. However, so that the target website doesn't suspect anything, one can use a proxy server.
In simple terms, a proxy is an intermediary computer that makes a request to the server instead of the client and returns a result to the client. Thus, the destination server thinks that the request is made from a completely different place, and therefore, by a completely different user.
Proxies are both free and paid. However, it is better not to use free proxies for data scraping - they are extremely slow and unreliable. Ideally, one should use residential or mobile proxies. In addition, it is not enough to use one proxy. For scraping, it is better to create a whole proxy pool.
Also, it is very important to keep track of the IP address from which the request is being made. In case the location does not match the expectations of the site, it can simply block them. For example, it is unlikely that local infrastructures will be useful for foreign users. So it's better to use local proxies for parsing sites so as not to arouse suspicion.
Use CAPTCHA Solving Services
When there are too many requests, the site may offer to solve a captcha to make sure that the request is made by a real user and not a bot. In this case, services can help that, for a small fee, will automatically recognize the proposed captcha.
Avoid Honeypot Traps
To catch a bot, many websites use honeypot traps. In general, a honeypot is an empty link that does not exist on the page but is present in the source HTML code. When harvested automatically, these hooks can redirect the web scraper to decoy pages or blank pages.
In fact, they are very easy to spot. For such links, various "masking" CSS properties are specified. For example, "display: none", "visibility: hidden" or the color of the link is identical to the background of the site.
Shopify scraper is the ultimate solution to quickly and easily extract data from any Shopify-powered store without needing any knowledge of coding or markup! All…
Read more about Web Scraping Proxy
Using Ready API in Web Scrapers
If it seems that the listed settings and rules are too many, and the costs of proxies and captcha-solving services are too high, one can do it easier and "redirect" the interaction with the site to third-party resources.
Tips & Tricks for Scraping
The last thing that is also worth mentioning is the time when it is better to scrape websites and the reverse engineering method in scraping. This is necessary not only in order to avoid blocking, but also in order not to harm the site.
Scrape During Off-peak Hours
Due to the fact that crawlers move through pages faster than real users, they significantly increase the load on the server. At the same time, if parsing is performed at a time of high load on the server, the speed of work of services falls down and the site loads more slowly.
This will not only negatively affect the traffic of the site by real users but also increase the time required for data collection.
Therefore, it is worth collecting and extracting data at moments of minimal site load. It is generally recommended to run the parser after midnight local site time.
Scrape at Different Day Times
If the site is heavily loaded daily from 8.00 am to 8.20 am, it starts to raise suspicions. Therefore, it is worth specifying some random value within which the scraping time will change.
Zillow Scraper is the tool for real estate agents, investors, and market researchers. Its easy-to-use interface requires no coding knowledge and allows users to…
With this powerful tool, you can easily scrape accurate data from the Apartments.com website without any coding knowledge or experience. Get valuable insights into…
Reverse Engineering for Better Scraping
Reverse engineering is a commonly used development method. In short, reverse engineering involves research of software applications to understand how they function.
In the case of developing a scraper, this approach means having a primary analysis for compiling future requests. The developer tools or simply DevTools in the browser (press F12) can help to analyze web pages.
Let's try to take a closer look at google SERP. Go to the DevTools on the Network tab, then try to find something at google.com, and look at the completed request. To view the response, just click on the received request and go to the Preview tab:
This data helps to understand what exactly the request should return and in which form. The data from the header tab will help to understand what data should be sent to compile the request. The main thing is to correctly execute requests and correctly interpret the responses.
Reverse engineering of mobile applications
The situation is similar to the reverse engineering of mobile applications. Only in this case, it is necessary to intercept the request sent by the mobile application to the server. Unlike intercepting normal requests, to do it for mobile applications one should use a Man-In-The-Middle proxy, such as the Charles proxy.
Also, don’t forget that the requests sent by the mobile application are more complex and confusing.
Conclusion and Takeaways
Finally, let's take a look at what security measures sites can take and what countermeasures can be taken to bypass them.
Forge TLS fingerprint
CAPTCHA solving services
By following a number of simple rules that were listed above, you can not only avoid blocking but also significantly increase the efficiency of the scraper.
In addition, when creating a scraper, it is worth considering that many sites provide an API for obtaining data. And if there is such an opportunity, it is better to use them than manually collecting data from the site.