Web Scraping Walmart Using Python: Understanding How to Scrape Data Quickly and Easily

Web Scraping Walmart Using Python: Understanding How to Scrape Data Quickly and Easily
Posted on
Dec 06, 2022

In this tutorial, we will use Python to scrape product data from Walmart's website. We will explore different ways to parse and extract the data we need and discuss some of the potential challenges associated with web scraping.

Why Scrape Walmart Product Data?

Walmart offers a vast selection of products in its online store, making it an ideal candidate for web scraping projects. By leveraging web scraping techniques, you can automate the collection of product data such as prices, availability, images, descriptions, rating, reviews, and more. This data can then be used in your applications or analysis projects.

Introduction to Walmart Scraping

To scrape product data, you first need to analyze it. Let's look at the product page - they are all of the same types, and after analyzing one, you can extract needed web data using the written script.

Walmart Product Page
Walmart product page

We are interested in the following fields for scraping:

  1. Product name
  2. Product price
  3. The number of reviews
  4. Rating
  5. Main image

Of course, you may need other fields with different data for specific purposes like description or brand, but we will only consider the primary product data in this example. The other details can be extracted in the same way. So, as an example, let's build a scraper that will collect links from a .csv file, follow each of them and save the resulting data to another file.

You can also collect product URLs from pages and categories using a script to collect all the products, but we will not discuss it in this guide. We previously published an article about scraping data from Amazon and looked at how you can organize the transition to the search results page and collect all product links. For Walmart, this can be done similarly.

Try our ready-made solutions for your needs

Our Amazon Best Sellers Scraper extracts the top most popular products from the Best Sellers category and downloads all the necessary data, such as product name,…

Our pre-built Amazon Product Scraper is designed to pull all detailed product information, including reviews, prices, descriptions, images and brand from departments,…

Walmart is problematic enough to extract data from because it does not support product scraping and has an anti-spam system. Tracking and blocking IP addresses will likely block most scrapers trying to access their site. Therefore, before creating a Walmart scraper in Python, we note that if you write your own scraper, you will also need to take care of bypassing blocks.

If there is no opportunity or time to organize a system for bypassing locks, you can scrape data using the web scraping API, which will solve these problems. How to scrape Walmart using our API will be covered in detail at the end of the article.

Read more: Web Scraping with Python: from Fundamentals to Practice

Installing Python Libraries for Scraping

Before searching for elements to scrape, let's create a Python file and include the necessary libraries. We will use:

  1. The Requests library for executing requests.
  2. The BeautifulSoup library for simplifying the parsing of the web page.

If you don't have the BeautifulSoup library (the requests library is built-in), use the following command on the command line:

pip install beautifulsoup4

Include these libraries in a Python file:

from bs4 import BeautifulSoup
import requests

Let's create or clear the file (overwrite) in which we will store the data from Walmart. Let's name this file result.csv and define the columns that will be in it:

with open("result.csv", "w") as f:
  f.write("title; price; rating; reviews; image\n")

After that, open the links.csv file, which will store links to product pages that we will scrape. We will go through all of them in turn and perform data processing for each of them. The following code will do it:

with open("links.csv", "r+") as links:
  for link in links:

Now we get the entire code of the page with which we will work further and parse it using BeautifulSoup:

	html_text = requests.get(link).text
    	soup = BeautifulSoup(html_text, 'lxml')

At this stage, we already have the product page code, and we can display it, for example, using print(soup).

Extracting Data from Product Pages

Way 1. Parsing the Page Body

To select only the necessary information, let's analyze the page to find selectors and attributes that uniquely describe the essential data.

Let's go to the Walmart website and look at the product page again. Let's find the product title first. To do this, go to DevTools (F12) and find the element code (Ctrl + Shift + C and select the element):

Product Title Selector
Product title selector

In this case, it is in the <h1> tag, and its itemprop attribute has the value "name". Let's enter the value of the text stored in this tag in the variable title:

title = soup.find('h1', attrs={'itemprop': 'name'}).text

Let's do the same, but now for the price:

Product Price Selector
Product price selector

Get the content of the <span> tag with the value of the itemprop attribute equal to "price":

  price = soup.find('span', attrs={'itemprop': 'price'}).text

For review and rating, the analysis is performed similarly so that we will write a ready-made result:

    reviews = soup.find('a', attrs={'itemprop': 'ratingCount'}).text
    rating = soup.find('span', attrs={'class': 'rating-number'}).text

With the image, everything is a little different. Let's look at the product page and the item code:

Product Image Selector
Product image selector

This time, you need to get the src attribute's value, which is stored in the <img> tag with the "db" class and loading attribute equal "eager". Thanks to the BeautifulSoup library, this is quite easy to do:

image = soup.find('img', attrs={'class': 'db','loading': 'eager'})["src"]

We should remember that in this case, not every page will return the image. So, it is better to use information from the meta-tag:

image = soup.find('meta', property="og:image")["content"]

If we start scraping with the above variables, we will get the following product details:

Canon EOS Rebel T100 Digital SLR Camera with 18-55mm Lens Kit, 18 Megapixel Sensor, Wi-Fi, DIGIC4+, SanDisk 32GB Memory Card and Live View Shooting
Now $299.00
172 reviews
(4.5)
https://i5.walmartimages.com/asr/ce0f57f7-ad6f-4e0b-a7ae-f751068597c2_1.b7e1f1bab1fd7f98cb9aef1ae9b783fb.png

Let’s save the received data to the previously created/cleared file:

    try:
        with open("result.csv", "a") as f:
            f.write(str(title)+"; "+str(price)+"; "+str(reviews)+"; "+str(rating)+"; "+str(image)+"\n")
    except Exception as e:
        print("There is no data")

It is advisable to do this in a try…except block because if there is no data or the link is not specified correctly, the program execution will be interrupted. The try...except block allows you to "catch" such an error, report it and continue the program. 

If you run this program, the scraped data will be saved in the file result.csv:

Here we see several problems. First, there is extra text in the price, reviews, and rating columns. Secondly, if there is a discount, the price column may display incorrect information or 2 prices.

Of course, it's not a problem to leave only numeric values with regular expressions, but why complicate the task if there is an easier way?

Way 2. Parsing the JSON-LD Structured Data

If you carefully look at the page code, you will notice that all the necessary information is stored not only in the page body. Let's pay attention to the <head>…</head> tag, more precisely to its <script nonce type="application/ld+json">…</script> tag.

Walmart “Product” Schema Markup in JSON format
Walmart “Product” Schema Markup in JSON format

This is the “Product” Schema Markup in JSON format. The product schema allows adding specific product attributes to product listings that can appear as rich results on the search engine results page (SERP). Let's copy it and format it into a convenient form for research:

{
    "@context": "https://schema.org",
    "@type": "Product",
    "image": "https://i5.walmartimages.com/asr/ce0f57f7-ad6f-4e0b-a7ae-f751068597c2_1.b7e1f1bab1fd7f98cb9aef1ae9b783fb.png",
    "name": "Canon EOS Rebel T100 Digital SLR Camera with 18-55mm Lens Kit, 18 Megapixel Sensor, Wi-Fi, DIGIC4+, SanDisk 32GB Memory Card and Live View Shooting",
    "sku": "771229626",
    "gtin13": "013803300550",
    "description": "<p>Creating distinctive stories with DSLR quality photos and Full HD movies is easier than you think with the 18 Megapixel Canon EOS Rebel T100. Share instantly and shoot remotely via your compatible smartphone with Wi-Fi and the Canon Camera Connect app. The powerful 18 Megapixel sensor has up to 19 times more surface area than many smartphones, and you can instantly transfer photos and movies to your smart device. The Canon EOS Rebel T100 has a Scene Intelligent Auto feature that allows you to simply frame and shoot for great results. It also features Guided Live View shooting with Creative Auto mode, and you can add unique finishes with Creative Filters. The Canon EOS Rebel T100 makes it fast and easy to share all the moments that matter.</p>",
    "model": "T100",
    "brand": {
        "@type": "Brand",
        "name": "Canon"
    },
    "offers": {
        "@type": "Offer",
        "url": "https://www.walmart.com/ip/Canon-EOS-Rebel-T100-Digital-SLR-Camera-with-18-55mm-Lens-Kit-18-Megapixel-Sensor-Wi-Fi-DIGIC4-SanDisk-32GB-Memory-Card-and-Live-View-Shooting/771229626",
        "priceCurrency": "USD",
        "price": 299,
        "availability": "https://schema.org/InStock",
        "itemCondition": "https://schema.org/NewCondition",
        "availableDeliveryMethod": "https://schema.org/OnSitePickup"
    },
    "review": [
        {
            "@type": "Review",
            "name": "Great camera for beginners",
            "datePublished": "January 4, 2020",
            "reviewBody": "Love this camera....",
            "reviewRating": {
                "@type": "Rating",
                "worstRating": 1,
                "ratingValue": 5,
                "bestRating": 5
            },
            "author": {
                "@type": "Person",
                "name": "Sparkles"
            }
        },
        {
            "@type": "Review",
            "name": "Perfect for beginners",
            "datePublished": "January 7, 2020",
            "reviewBody": "I am so in love with this camera!...",
            "reviewRating": {
                "@type": "Rating",
                "worstRating": 1,
                "ratingValue": 5,
                "bestRating": 5
            },
            "author": {
                "@type": "Person",
                "name": "Brazilchick32"
            }
        },
        {
            "@type": "Review",
            "name": "Great camera",
            "datePublished": "January 17, 2020",
            "reviewBody": "I really love all the features this camera has. Every time I use it, I'm discovering a new one. I'm pretty technologically challenged, but this hasn't hindered me. The zoom and focus give very detailed and sharp images. I cannot wait to take it on my next trip as right now I've only photographed the dog a million times",
            "reviewRating": {
                "@type": "Rating",
                "worstRating": 1,
                "ratingValue": 5,
                "bestRating": 5
            },
            "author": {
                "@type": "Person",
                "name": "userfriendly"
            }
        }
    ],
    "aggregateRating": {
        "@type": "AggregateRating",
        "ratingValue": 4.5,
        "bestRating": 5,
        "reviewCount": 172
    }
}

Now it is obvious that this information is enough and it is stored in a more convenient form. That is, it is necessary to read the contents of the <head>…</head> tag and select from it a tag that stores data in JSON. Then set the variables to the following values:

  1. Title value of the [name] attribute.

  2. Price value of the [offers][price] attribute.

  3. Reviews value of the [aggregateRating][reviewCount] attribute.

  4. Rating value of the [aggregateRating][ratingValue] attribute.

  5. Image value of the [image] attribute.

To be able to work with JSON, it will be enough to include the built-in library:

import json

Let's create a data variable in which we put the JSON data:

data = (json.loads(soup.find('script', attrs={'type': 'application/ld+json'}).text))

After that, we will enter the data into the appropriate variables:

    title = data['name']
    price = data['offers']['price']
    reviews = data['aggregateRating']['reviewCount']
    rating = data['aggregateRating']['ratingValue']
    image = data['image']

The rest will remain the same. Check the script execution:

Results table with extracted product data
Results table with extracted product data

Full script:

from bs4 import BeautifulSoup
import requests
import json

with open("result.csv", "w") as f:
  f.write("title; price; rating; reviews; image\n")
with open("links.csv", "r+") as links:
  for link in links:
    html_text = requests.get(link).text
    soup = BeautifulSoup(html_text, 'lxml')
    data = (json.loads(soup.find('script', attrs={'type': 'application/ld+json'}).text))
    title = data['name']
    price = data['offers']['price']
    reviews = data['aggregateRating']['reviewCount']
    rating = data['aggregateRating']['ratingValue']
    image = data['image']
    try:
        with open("result.csv", "a") as f:
            f.write(str(title)+"; "+str(price)+"; "+str(reviews)+"; "+str(rating)+"; "+str(image)+"\n")
    except Exception as e:
        print("There is no data")

Walmart Anti-Bot Protection

However, as mentioned at the beginning of the article, the service monitors activity that looks like bots and blocks it, offering to solve the captcha. So instead of getting a page with data, another page will come one time:

Walmart 'Press & Hold' CAPTCHA
Walmart 'Press & Hold' CAPTCHA

This is a suggestion to solve the captcha. To avoid this, you should comply with the conditions we described earlier.

You can also add headers to the code, which will slightly reduce the chance of blocking:

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"}

Then the request will look a little different:

    html_text = requests.get(link, headers=headers).text

The result of the scraper:

Unfortunately, this also doesn’t always help to avoid blocking. This is where the web scraping API comes in.

Scrape Walmart Product Data Using the Web Scraping API

Let's look at a use case for the web scraping API, which will take care of blocking avoidance. Part of the script described earlier will not change. We will only modify it. 

Create a scrape-it.cloud account. After that, you will receive 1,000 credits. You need an API key, which can be found in your account in the Dashboard section:

Scrape-It.Cloud API Key
Get Scrape-It.Cloud API key on the dashboard

We will also need the JSON library, so let's add it to the project:

import json

Now let's assign the API value to the URL variable, and put the API key and content type in the headers:

url = "https://api.scrape-it.cloud/scrape"
headers = {
  'x-api-key': 'YOUR-API-KEY',
  'Content-Type': 'application/json'
}

To add a dynamic query:

   temp = """{
  "url": """+"\""+str(link)+"\""+""",
  "block_resources": False,
  "wait": 0,
  "screenshot": True,
  "proxy_country": "US",
  "proxy_type": "datacenter"
}"""
    payload = json.dumps(temp)
    response = requests.request("POST", url, headers=headers, data=payload)

Get the data in "content" which stores the code of the Walmart page:

html_text = json.loads(response.text)["scrapingResult"]["content"]

Let's look at the complete code:

from bs4 import BeautifulSoup
import requests
import json

url = "https://api.scrape-it.cloud/scrape"
headers = {
  'x-api-key':  'YOUR-API-KEY',
  'Content-Type': 'application/json'
}

with open("result.csv", "w") as f:
  f.write("title; price; rating; reviews; image\n")
with open("links.csv", "r+") as links:
  for link in links:
    html_text = requests.get(link).text
    soup = BeautifulSoup(html_text, 'lxml')
    data = (json.loads(soup.find('script', attrs={'type': 'application/ld+json'}).text))
    title = data['name']
    price = data['offers']['price']
    reviews = data['aggregateRating']['reviewCount']
    rating = data['aggregateRating']['ratingValue']
    image = data['image']
    try:
        with open("result.csv", "a") as f:
            f.write(str(title)+"; "+str(price)+"; "+str(reviews)+"; "+str(rating)+"; "+str(image)+"\n")
    except Exception as e:
        print("There is no data")

In addition to the specified attributes, others can be set in the request body. You can read more in the documentation or try it in your account on the web scraping API tab.

Scraping Walmart using API
Scraping Walmart using API

You can make the request visually and then get the code in one of the supported programming languages.

Conclusion and Takeaways

Finding and scraping data to be used for e-commerce is a time-consuming task, especially when done manually. It's also hard to know whether it's accurate, objective data that will help you make the right business decisions.

Without high-quality data, it's hard to develop and implement the marketing strategies that your business needs to grow, succeed, and meet clients' needs. It can also affect your reputation, as you may be perceived as unreliable if your customers or partners find out your data is unreliable.

Scraping Walmart product data with Python is a great way to quickly gather large amounts of valuable information about products available on their website. By using powerful libraries like Beautiful Soup and Selenium along with XPath queries or Regexes, developers can easily extract specific pieces of information from any webpage they wish and turn them into structured datasets ready for use in their applications or analysis projects. And using the web scraping API to automate the data collection process will allow you not to worry about blocking, captcha, proxying, using headers, and much more.

Tired of getting blocked while scraping the web?

Try out Web Scraping API with proxy rotation, CAPTCHA bypass, and Javascript rendering.

  • 1,000 Free API Credits
  • No Credit Card Required
  • 30-Day Trial
Try now for free

Get structured data in the format you need!

We offer customized web scraping solutions that can provide any data you need, on time and with no hassle!

  • Regular, custom data delivery
  • Pay after you receive sample dataset
  • A range of output formats
Get a Quote
Valentina Skakun
Valentina Skakun
Valentina Skakun’s profile

I'm a technical writer who believes that data parsing can help in getting and analyzing data. I'll tell about what parsing is and how to use it.

Request a Quote

Tell us more about you and your project information.