Web Scraping Using Selenium Python

Web Scraping Using Selenium Python
Posted on
Jun 24, 2022

Python is one of the most common languages for building scrapers. There are many libraries, frameworks, and utilities for it, from simple ones, like the Requests library or PycURL, to more feature-rich and advanced ones, like Selenium or Puppeteer.

However, the most commonly used library (after Requests, of course) is Selenium, which allows you to scrape not only static web pages but dynamic web page content too. You can save the received data in a convenient format, including a CSV file.

Preparation for Scraping

To scrape data from websites you need 3 things: selenium, chrome browser, and chrome driver. To Install selenium just use:

pip install selenium

Chrome browser and chrome driver one can download from official sites.

Tired of getting blocked while scraping the web?

Try out Web Scraping API with proxy rotation, CAPTCHA bypass, and Javascript rendering.

  • 1,000 Free API Credits
  • No Credit Card Required
  • 30-Day Trial
Try now for free

Get structured data in the format you need!

We offer customized web scraping solutions that can provide any data you need, on time and with no hassle!

  • Regular, custom data delivery
  • Pay after you receive sample dataset
  • A range of output formats
Get a Quote

Static Scraping with Selenium

Static scraping ignores JavaScript. It fetches web pages from the server without the help of a browser. However, if there is no need to parse dynamically loaded data, then it makes no sense to use a headless browser, which also consumes resources and makes scraping slower. So, if you need to scrape static data from websites it will be better to try another library, for example, Beautiful soup is one of the best libraries for static web page content.

Dynamic Scraping with Selenium

The reason why static scraping can't get all the data is that the page code shows the content that was received from the server. But the final DOM (Document Object Model) received by the browser may be very different. Once the page is loaded from the server, JavaScript is free to manipulate the DOM as it sees fit.

First Steps

Firstly let's try to do something easy. For example, start webdriver chrome and go to the example.com page. Open Python interpreter and write:

from selenium import webdriver
DRIVER_PATH = 'C:\chromedriver.exe' #or any other path of webdriver
driver = webdriver.Chrome(executable_path=DRIVER_PATH)

If all is correct such a window of chrome will be opened:

Chrome WebDriver
Chrome WebDriver

Then just write the source URL to go:

driver.get('https://example.com/')

Webdriver will open the page itself.

Getting page
Getting page using WebDriver

And to make the example more useful let's get the HTML code of this page. This command will print the HTML code of the web page:

print (driver.page_source)

The result:

<html><head>
    <title>Example Domain</title>

    <meta charset="utf-8">
    <meta http-equiv="Content-type" content="text/html; charset=utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;

    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body></html>

Using the Selenium API, one can select and scrape specific data using CSS selectors or XPath method. They can help to find web element and scrape it.

More Selenium Functions

However, Selenium web scraping is something more than getting web page HTML code. As a rule, it is necessary to obtain only specific data and then structure it. Besides, it's not always convenient that the browser is in active mode. It is much more convenient to use the background one, so Selenium can work with a headless browser.

Extract Data Using Headless Browser

Headless Browser is a web browser without a graphical user interface (GUI) that is controlled using a command-line interface.

As a rule, this approach is used so that the open browser window does not interfere with the scraping process and does not waste PC resources. In headless mode, the browser strips of all GUI elements and lets it runs silently in the background, which reduces resource consumption by more than 10 times.

To use headless mode just add:

options = driver.ChromeOptions() 
options.headless = True 

After that, the browser window will not open, and scraping will take place in the background.

Using a Proxy in Web Page Scraping

It is important to use a proxy when scraping web pages for several reasons:

  1. There is a chance to overcome the limits on the number of requests to the site.

  2. Because of the number of requests, the scraper will be considered a DDoS attack.

  3. The lack of a proxy reduces the security and privacy of online activity.

Unfortunately, Selenium has limited proxy support. As a driver option, you can add a proxy without authentication.

proxy = "12.345.67.890:1234" # your proxy 
options.add_argument("--proxy-server=%s" % proxy) 

However, for greater convenience, Selenium supports the ability to use a proxy server and proxy with authentication. For this you need to install an additional selenium package:

pip install selenium-wire

After that one can use a function with authentication:

options = {
    "proxy": {
        "http": f"http://{proxy_username}:{proxy_password}@{proxy_url}:{proxy_port}",
        "verify_ssl": False,
    },
}

Filling Forms with Selenium Web Driver

Almost every site has forms, whether it's a filter or an authentication form. A web form contains web elements such as input fields, checkboxes, radio buttons, search forms, links, dropdown menus, and submit buttons to collect user data.

In order to receive data from the form and transfer it back, firstly one needs to find the form, and then enter text into it or select the suggested value.

Input field

The most common form is the input field. An input field is a text field that stores data entered by the user. To work with it, it is necessary to understand how information is entered (send_keys() method), removed (clear() method) and get input text (get_attribute() method). For example:

driver.find_element_by_id('here_id_name').send_keys("Test")
driver.find_element_by_id('here_id_name').clear()
nameTest = driver.find_element_by_id('here_id_name').get_attribute("value")

Chechbox

A checkbox is a small box in which two options are available: either select or deselect.

However, before selecting a checkbox, one needs to know its current state, because a second selection will change it to the opposite. For such a check, one can use a boolean variable (0 or 1, True or False, etc.). This will print True if the checkbox is already selected or False if not:

acc_boolean = driver.find_element_by_id('acception').is_selected()
print(type(acc_boolean))

To click the button on the website or checkbox just use:

driver.find_element_by_id('accaption').click()

Scraping Infinite Scroll Pages with Selenium

To scrape multiply pages just use scrolling to go to the next page. The best way for it is to use JavaScript execution. JavaScript execution in Selenium is used to execute the code written in Javascript within the Selenium automation framework.

This interface allows one to use such scrolling operates as scroll down, scroll to element, horizontal scroll, etc. A simple example of scrolling will be:

javaScript = "window.scrollBy(0,1000);"
driver.execute_script(javaScript)

Conclusion and Takeaways

Selenium is an extremely flexible tool that allows getting data loaded dynamically. Moreover, the use of a proxy allows for improving protection during scraping.

Unlike other libraries, Selenium uses a google chrome webdriver (whether in active or headless)mode for scraping, which makes the process more like manual data collection.

Selenium also allows automated website actions: search for elements, enter forms, select buttons and send results. This makes it possible to automate even user authentication.

Tired of getting blocked while scraping the web?

Try out Web Scraping API with proxy rotation, CAPTCHA bypass, and Javascript rendering.

  • 1,000 Free API Credits
  • No Credit Card Required
  • 30-Day Trial
Try now for free

Get structured data in the format you need!

We offer customized web scraping solutions that can provide any data you need, on time and with no hassle!

  • Regular, custom data delivery
  • Pay after you receive sample dataset
  • A range of output formats
Get a Quote
Valentina Skakun
Valentina Skakun
Valentina Skakun’s profile

I'm a technical writer who believes that data parsing can help in getting and analyzing data. I'll tell about what parsing is and how to use it.

Request a Quote

Tell us more about you and your project information.