Web scraping allows not only reading various data from web pages but also helps to organize it for further analysis. The output data can be stored in the most user-friendly format, whether it be a table (for example, in a CSV file) or an API.
At the same time, web scraping in Python is not just extracting data from CSS selectors. This is a reliable and easy way to get a huge amount of data quickly.
Redfin Scraper is an innovative tool that provides real estate agents, investors, and data analysts with a convenient way of collecting and organizing property-related…
Zillow Scraper is the tool for real estate agents, investors, and market researchers. Its easy-to-use interface requires no coding knowledge and allows users to…
Web Page Scraping Fundamentals
To parse data, one needs to know in which form it is stored, as well as understand the basic principles of its transmission. The transfer of information in the browser is carried out using the HTTP (Hyper Text Transfer Protocol), which uses client-server communication. It means that there is a client (one who requests data) and a server (one who provides data).
For example, the server can transmit an HTML page. HTML is a hypertext markup language of a web page that helps the browser to understand what it should display on the loaded site.
Get fast, real-time access to structured Google search results with our SERP API. No blocks or CAPTCHAs - ever. Streamline your development process without worrying…
Gain instant access to a wealth of business data on Google Maps, effortlessly extracting vital information like location, operating hours, reviews, and more in HTML…
The client can be a browser, parser, or something else that can request information. The server is a resource that the client accesses to obtain information (for example, the Nginx or Apache web server).
It looks like this:
- The client opens the connection.
- The client requests data.
- The server returns the requested data.
- The server closes the connection.
The request might look like this:
:authority: scrape-it.cloud
:method: GET
:path: /blog/
:scheme: https
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9,en;q=0.8;q=0.7
cookie: PHPSESSID=lj21547rbeor092lf7q1tbv2kj; _gcl_au=1.1.46260893.1654510660; _ga=GA1.1.87541067.1654500661; _clck=16uoci|1|f25|0; _ga_QSH330BHPP=GS1.1.1654773897.3.1.1654695637.58; _clsk=ac0mn0|1654695838342|7|1|h.clarity.ms/collect
dnt: 1
referer: https://scrape-it.cloud/
sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="102", "Google Chrome";v="102"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
sec-fetch-dest: document
sec-fetch-mode: navigate
sec-fetch-site: same-origin
sec-fetch-user: ?1
upgrade-insecure-requests: 1
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36
To view the request that is generated when one goes to the web page, go to DevTools (either right-click on the page and select Inspect, or press F12). In DevTools go to the Network tab, refresh the page and select its address from the list. After that, will be opened a request that was generated by the browser.

The data that is stored in the cookies and referrer (but in browser referer) sections is important for the parser. Cookies - to confirm authentication, the referrer - in case the website restricts access to information depending on which page the user came from.
Accept and User-Agent will also be useful. Accept shows the content type of the server's response (text/plain, text/html, image/jpeg, etc.), and User-Agent stores information about the client.
Get fast, real-time access to structured Google search results with our SERP API. No blocks or CAPTCHAs - ever. Streamline your development process without worrying…
Gain instant access to a wealth of business data on Google Maps, effortlessly extracting vital information like location, operating hours, reviews, and more in HTML…
Instrumental for Scraping
With the growing popularity of web scraping, the number of libraries and frameworks for scraping is growing too. However, the most popular, complete, well-documented, and used are just a few: Beautiful Soup, Requests, Scrapy, lxml, Selenium, URLlib, and Pyppeteer.
To find the most suitable, as well as to find out their advantages and disadvantages, it is worth considering each of them in more detail.

Extract data with Requests
Requests is the basic scraping library that everyone comes across in one way or another.
What is the Request library?
The Requests library was created to do it easier to make HTTP requests. It is a simple library, so it doesn't take much practice to work with it. It supports the entire Restful API with all its methods (PUT, GET, DELETE, and POST).
While using the Requests library, one doesn't need to enter a query string for URLs oneself. Also, over the years of use, Request library has acquired a huge amount of useful and well-written documentation.
As a rule, it is already a built-in Python library. However, if for some reason it is not there, then one can install it oneself. To do this, go to the terminal and enter the line:
pip install requests
Once the library is installed, it can be used in projects:
import requests
To get some page using the requests.get method:
import requests
page = requests.get("example.com")
page
What is the Requests library for?
The Requests library supports file uploads, connection timeouts, cookies and sessions, authentication, SSL browser verification, and all methods of interaction with the REST API (PUT, GET, DELETE, POST).
However, it has one disadvantage - there is no ability to parse dynamic data because Requests does not interact with JavaScript code.
So, it is a good idea to use it in all cases where it doesn't need to parse dynamic data.
Start with Beautiful Soup
Today the Beautiful Soup library or simply BS4 is the most popular of all the Python libraries used for scraping.
What is the Beautiful Soup library?
The Beautiful Soup library was created for parsing HTML. Due to the fact that many things are handled automatically (for example, processing invalid HTML), it is suitable for beginners.
The output has a tree format, making it easy to find elements and extract the information one needs. BS4 also determines the encoding automatically, what allows to process even HTML pages with special characters.
The disadvantage of BS4 is low flexibility and scalability, as well as slowness. However, the built-in parser can be easily replaced with a faster one.
To install BS4, just enter the line in the terminal
pip install beautifulsoup4
After that, it can be used for scraping. For example, to scrape all titles, a little code is enough:
from bs4 import BeautifulSoup
soup = BeautifulSoup(contents, 'html.parser')
soup.find_all('title')
Note that for correct work the Request library must be included in the project.
To output good formatted page code, one can use the following:
print(soup.prettify())
Let's say one needs to collect all titles of products that are stored on the page.

At the same time, print(soup.prettify()) returned the following code:
<!DOCTYPE html>
<html>
<head>
<title>A sample shop</title>
</head>
<body>
<div class="product-item">
<img src="example.com\item1.jpg">
<div class="product-list">
<h3>Pen</h3>
<span class="price">10$</span>
<a href="example.com\item1.html" class="button">Buy</a>
</div>
</div>
<div class="product-item">
<img src="example.com\item2.jpg">
<div class="product-list">
<h3>Book</h3>
<span class="price">20$</span>
<a href="example.com\item2.html" class="button">Buy</a>
</div>
</div>
</body>
</html>
Next, get the type of all page elements:
[type(item) for item in list(soup.children)]
BS4 will return something like this:
[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
The first two contain information about the page itself, and only the last contains information about its elements. To get information about product names, select all data related to bs4.element.Tag:
html = list(soup.children)[2]
# items count from 0
To check the ordinal number of an element:
list(html.children)
Returns:
['n', <head> <title>A sample shop</title> </head>, 'n', <body> <div>…</div> </body>, 'n']
To get all the <body> elements, which come fourth in a row:
body = list(html.children)[4]
The next elements are checked in the same way:
divit = list(body.children)[1]
divli = list(divit.children)[2]
h3 = list(divli.children)[0]
To extract the product names, one needs to do the following:
h3.get_text()
However, Beautiful Soup allows one to automate most of the processes and all the above code can be replaced with more concise:
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('h3')[0].get_text()
What is the Beautiful Soup library for?
Beautiful Soup is the best choice for those who are just getting started with scrapers because it will do most of the work automatically. This library is also suitable for those who need to extract data from not structured sites.
However, Beautiful Soup is bad for big web scraping projects.
Collect all the data with Scrapy
Scrapy is one of the best frameworks for scraping with python.
What is Scrapy framework?
Scrapy is an open-source framework that allows to load an HTML page and save them in the desired form (for example, a CSV file). Since requests are executed and processed in parallel, it has a high execution speed. It is the most suitable option for solving complex architectural data collection and processing tasks.
To install Scrapy, just enter in the terminal:
pip install scrapy
To get started with Scrapy:
scrapy shell

To get the HTML content of a website one can use the fetch function. So, let's try:
Now, to make sure Scrapy saved the page and views it in the browser, use the following:
view(response)
print(response.text)
To get some more specific information with Scrapy, one needs to use CSS selector.
What is the Scrapy framework for?
Scrapy is very resource-dependent. It may require a separate server to maintain enough performance.
Moreover this framework is not suitable for beginners. Beginners can be pushed away by everything from installation problems on some systems to it's overflow for simple problems.
But despite all the disadvantages, Scrapy is still one of the best frameworks for large projects. It can be used for requests management, preserving user sessions, following redirects, and handling output pipelines.
Try out Web Scraping API with proxy rotation, CAPTCHA bypass, and Javascript rendering.
We offer customized web scraping solutions that can provide any data you need, on time and with no hassle!
Tired of getting blocked while scraping the web?
Try now for free
Get structured data in the format you need!
Get a Quote
Lxml Library
Lxml is one of the fast, powerful yet simple parsing libraries.
What is the Lxml library?
Lxml is a parsing library. It can work with HTML and XML files. Like Scrapy, Lxml is ideal for extracting data from large datasets. However, unlike Beautiful Soup, it cannot parse poorly designed HTML.
To install Lxml library go to terminal and write:
pip install lxml
Let's return to example with Pen and Book. First of all, one needs to include libraries in to project:
from lxml import html
import requests
Then retrieve the web page with data:
page = requests.get('http://example.com/item1.html')
tree = html.fromstring(page.content)
The information is in two elements - the title is in <h3> and the price is in <span>:
<h3>Pen</h3>
<span class="price">10$</span>
Get the data:
titles = tree.xpath('//h3/text()')
prices = tree.xpath('//span[@class="price"]/text()')
To display on the screen:
print('Titles: ', titles)
print('Prices: ', prices)
This will display:
Titles: ['Pen', 'Book']
Prices: ['10$','20$']
What is the Lxml library for?
In cases where performance is a concern, Lxml is a great option. It is also useful when one needs to process a huge amounts of data.
However, despite the functionality and speed of the library, it is not very popular among beginners due to insufficient documentation, which makes it quite difficult.
Selenium for Scraping
Some websites are written using JavaScript, a language that allows developers to dynamically fill up fields and menus. While most Python libraries can fetch data only from static web pages, Selenium allows to process with dynamic data.
What is the Selenium library?
Selenium is a Python library also known as a web-driver which allows one to simulate the user's behavior on the page, because a real browser is launched to work. Webdriver is the first browser automation protocol developed by the W3C that located between the client and the browser and translates client commands into web browser actions.
This allows one to fully process the data on the page. Despite it, Selenium is a beginner-friendly tool. To install go to the terminal and write:
pip install selenium
Then enable ChromeDriver for Chrome browser or Geckodriver for Mozilla FireFox. Now, Selenium is ready for use.
To get the URL of the page, run the script:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.example.com/items")
This script will launch the browser and get the URL. However, it is desirable to make the parsing process hidden from the user. For these purposes, the so-called headless mode is used, which takes away the graphical shell from the browser and allows it to work in the background.
In Selenium, it can be enabled through options keyword argument. So the final example code will be:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True # hide GUI
options.add_argument("--window-size=1920,1080")
# set window size to native GUI size
options.add_argument("start-maximized")
# ensure window is full-screen
driver = webdriver.Chrome(options=options)
driver.get("https://www.example.com/items")
To parse dynamic data one should launch the browser and tell it to go to example.com. Then wait for the page to load and get it's content.
To parse HTML tags (for example titles) using selectors one can use the next:
titles = driver.get_elements_by_css_selector('h3')
for title in tiles:
print(title.text)
Remember that parser will work until a command to close will be used:
driver.quit()
What is the Selenium library for?
The main disadvantage is that the tool is very slow, consumes a lot of memory and CPU time. However, it is the best tool for parsing data from pages generated with JavaScript.
Parse URLs into components with URLlib
Before scraping data, it is necessary to parse links that will be used for further scraping. And Urllib is one of the best tools for working with URLs. Read also about using cULR in Python here.
What is URLlib library?
URLlib is a package with several modules. Offers a basic set of functionality for parsing web pages such as authentication, redirects, cookies, etc. Suitable for parsing a limited number of pages, followed by simple data processing.
It is a built-in Python library, however, if it is not installed, just write in the terminal:
pip install urllib
It supports the following URL schemes: file, ftp, gopher, hdl, http, https, imap, mailto, mms, news, nntp, prospero, rsync, rtsp, rtspu, sftp, shttp, sip, sips, snews, svn, svn+ssh, telnet, wais, ws, wss.
A request to get data using URLlib in the general case looks like this:
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
Where:
- URL: page address;
- data: for GET is empty; for POST the type of bytes;
- timeout: timeout in seconds;
- cafile: CA certificate required when getting an HTTPS link;
- capath: path to the CA certificate;
- context: is of type ssl.SSLContext, used to specify SSL settings.
What is the URLlib library for?
URLlib gives better control over requests in the Requests library, but it is also more complicated.
And with all the pros and cons, URLLib is the best library for collecting links.
Zillow API Python is a Python library that provides convenient access to the Zillow API. It allows developers to retrieve real estate data such as property details,…
Our Shopify Node API offers a seamless solution, allowing you to get data from these Shopify stores without the need to navigate the complexities of web scraping,…
Alternative Solution: Pyppeteer
Puppeteer is a tool developed by Google based on Node.js. And Pyppeteer it is something like Puppeteer for Python.
What is Pyppeteer?
Pyppeteer is a Python wrapper for the JavaScript (Node) Puppeteer library. It works similar to Selenium, supporting both headless and non-headless mode. However, Pyppeteer support is limited to the Chromium browser.
Pyppeteer uses Python's asynchronous mechanism so it requires Python 3.5 or higher to run. To install just write:
pip3 install pyppeteer
Then try to run this in the interpreter:
import pyppeteer
If pyppeteer was installed successfully then there will be not mistake. So, the most simple example of pyppeteer use is:
import asyncio
import pyppeteer
async def main():
browser = await pyppeteer.launch()
page = await browser.newPage()
await page.goto('https://example.com/')
await page.screenshot({'path': 'items/item1/item1.png'})
await browser.close()
asyncio.get_event_loop().run_until_complete(main())
What is the Pyppeteer for?
Pyppetter already has a built-in Chromium browser. Also, it consumes less CPU time and RAM than Selenium. But it can't work with other browsers and it doesn't have complete documentation, because this is non-official Puppeteer shell from a Japanese developer.
So, Pyppeteer will be the best choice if it is not possible to use Selenium due to its resource consumption, but at the same time one needs to write a powerful scraper.
Summary
So what is the best library? Unfortunately, there is no clear answer to this question. The choice of library depends on the goals, scope and resource capabilities of each person.
For example, for a beginner, Requests Library and Beautiful Soup will be enough to solve simple problems. For those who wants to try something more complex, and their project is large enough, can choose Pyppeteer or Selenium.
So there is a table that can help one to make a decision and choose the most suitable library or framework.
Characteristic | Requests | Beautiful Soup | Scrapy | lxml | Selenium | URLlib | Pyppeteer |
---|---|---|---|---|---|---|---|
Purpose | Simplify making HTTP requests | Parsing | Scraping | Parsing | Simplify making HTTP requests | Parsing URLs | Parsing |
Beginner-friendly | Yes | Yes | No | No | Yes | No | No |
Speed | Fast | Fast | Slow | Very fast | Slow | Fast | Fast |
Documentation | Excellent | Excellent | Good | Good | Good | May be better | May be better |
JavaScript Support | No | No | No | No | Yes | No | Yes |
CPU and Memory Usage | Low | Low | High | Low | High | Low | High |
Useful for projects | Large and small | Small | Large and small | Large and small | Large and small | Large and small | Large and small |
Tricks & Tips for Better scraping
When parsing data, there are always some common tasks that everyone faces. They are quite common, as are the solutions to them. Their use saves time and simplifies tasks.
Getting Internal Links
The BeautifulSoup library can be used to get relevant content. For example, extract internal links from a page. To simplify the task, let's think that internal links are links starting with a slash.
internalLinks = [
a.get('href') for a in soup.find_all('a')
if a.get('href') and a.get('href').startswith('/')]
print(internalLinks)
Once one have links, one can remove the duplicates and scrape them.ead also
Read more about Proxy for Scraping
Getting Links to Social Networks and Email
Another typical task is to collect email addresses and social network links. To do this, it is necessary to go through all the links, checking for the presence of mailto (by which one can pull out the email) and social network domains. Such URLs need to be added to the list and displayed.
links = [a.get('href') for a in soup.find_all('a')]
to_extract = ["facebook.com", "twitter.com", "mailto:"]
social_links = []
for link in links:
for social in to_extract:
if link and social in link:
social_links.append(link)
print(social_links)
Automatic Table scraping
As a rule, tables are well-formatted and structured, so they are easy to scrape. The rows of the tables are in the tr
tag, and the columns are in the td
or th
. In general, an HTML table looks like this:
<table>
<tr>
<td>1 row 1 column</td>
<td>1 row 2 column</td>
</tr>
<tr>
<td>2 row 1 column</td>
<td>2 row 2 column</td>
</tr>
</table>
In order to collect data from the table, it is necessary to check all rows, check the columns in each of them and display the content on the screen:
table = soup.find("table", class_="sortable")
output = []
for row in table.findAll("tr"):
new_row = []
for cell in row.findAll(["td", "th"]):
collapsible.extract()
new_row.append(cell.get_text().strip())
output.append(new_row)
print(output)
Getting information from metadata
Not all data is stored in the code, some can be found through Schema.org. And this metadata can be got this way:
metaDescription = soup.find("meta", {'name': 'description'})
print(metaDescription['content'])
This will parse the data from the meta description.
Getting Hidden Product Information
To get hidden information, you need to find it through an element in the HTML code and just parse it like any other. For example, to parse hidden in tag <itemprop> information about product brand:
brand = soup.find('meta', itemprop="brand")
print(brand['content'])
Conclusion and Takeaways
There are many tools for data parsing in Python, from libraries to frameworks that allowщту to save data in tables, API, or any other way.
However, everyone decides for himself which tool is more suitable for him to use. One has to pay for simplicity and convenience with limited functionality or speed, for greater speed and functionality - with resource intensity.
However, if one select a tool for each individual project, he can get the best result.
Read also about Google Maps Scraper