The Complete Guide to Scraping Amazon Product Data using Python

The Complete Guide to Scraping Amazon Product Data using Python
Posted on
Aug 22, 2022

Data scraping can be used for tracking the market situation, and competitor prices and responding to changes in a timely manner. Amazon data scraping can be relevant for those who place their goods on this site or for those who decide to do dropshipping.

There are several tools that can help to scrape data from Amazon:

  1. Web scraping services. Anyone can scrape amazon data without special skills, however, they have a high cost.
  2. Creating scraper in the programming languages Python, NodeJS, C#, or others. Require special knowledge and skills, and shareware (proxies and CAPTCHA-solving services will be needed).
  3. Using the web scraping API, which combines the features of the first two options.

Using the web scraping API is one of the best options for scraping data from Amazon. You can also use our no-code Amazon scraper. These solutions have a reasonably low cost (compared to web scraping services) and don't require as much special knowledge as creating a scraper from scratch.

Get all the product details you need in just a few clicks!

Don't waste time and money on solutions that don't provide the data you need. Get accessible and up-to-date e-commerce data with these scrapers.

  • 1,000 rows included
  • No credit card required
  • 30-day trial

Scraping Amazon Data Using Web Scraping API

To use the web scraping API, sign up for Scrape-It.Cloud. To do this, click "Try for Free" and fill login information. Now you can create a simple request to scrape some URLs. To do this, go to the "Request Builder" tab and enter the address of the Amazon category page for scraping in the URL field:

Run the request from the site by clicking "Run Script". However, it will return the page code, not specific data. In order to get certain data, let's research the page that will be scraped.

Build the Request for scraping Data

Let's collect data from the page with books on Amazon (their name and price). On the product page, go to DevTools (press F12 or right-click on an empty space on the page and go to Inspect).

Now let's find the CSS selectors for the title of the book and its price. To view the code of a specific element on the page, click the element selection button or the hotkeys Ctrl+Shift+C. Then choose the title of the book.

Click on the element's code to get the element's CSS selector:

The resulting CSS selector is:

h2 > a > span

Similarly, get the CSS selector for the price:

a.a-size-base:nth-child(1) > span.a-price:nth-child(2) > span:nth-child(1)

Return to the site and use Extraction Rules to extract the required data:

Now the query returns data on the name and cost of goods. To reduce the risk of blocking, let's use US residential proxies:

To improve the request, let's use Python and add features such as moving to a new page and saving data to a CSV file.

Use Python to Scrape Amazon Product Data with Web Scraping API

Firstly, install the libraries. In this example will be used 4 libraries:

  1. Requests Library. The requests library is used to make a request to the site.
  2. JSON Library. Needed to process JSON requests.
  3. Pandas Library. Library for working with files. Will be used to write information to a file.
  4. Time library. After the request is completed, it is desirable to make a small delay.

The Time, Requests, and JSON libraries are built-in python libraries. If Python is installed, but there are no libraries, try to update python. Enter on the command line:

pip install -upgrade pip

To install the Pandas library use the command:

pip install pandas

Get Request on Python

To get a request on Python to improve it further, go to the Request Builder on the "Python" tab:

Here the http.client library is used instead of Requests. Of course, it can also be used, but to have more practice, let's get a request for the Requests library. It is more popular, easier to work with, and has a large online community.

For more experience with requests, download Postman. Copy the request from the Request Builder in the cURL tab. Go to Postman and click Import, and enter the copied query on the raw tab:

Then press Import and go to the code snippet:

Select Python Requests from the drop-down menu to view code using this library:

So, the request code for scraping the price and name of Amazon products in Python using the Requests library will be:

import requests
import json

url = "https://api.scrape-it.cloud/scrape"

payload = json.dumps({
  "extract_rules": {
    "Title": "h2 > a > span",
    "Price": "a.a-size-base:nth-child(1) > span.a-price:nth-child(2) > span:nth-child(1)"
  },
  "wait": 0,
  "screenshot": True,
  "block_resources": False,
  "window_height": 1080,
  "window_width": 1920,
  "url": "https://www.amazon.com/s?k=books&i=stripbooks&page=1"
})
headers = {
  'x-api-key': 'YOUR-API-KEY',
  'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

Let's see what exactly this query does:

  1. Connecting the Requests and JSON libraries.
  2. Setting the API URL.
  3. Serialization of the request body and saving it to the payload variable.
  4. Set headings. Put the API key from the dashboard here.
  5. Execution of the request.
  6. The output of results to the console (on the screen).

The screenshot, block_resources, window_height, and window_width parameters are optional and can be removed.

Go to the Next Page of Amazon with Python

Let's improve the basic request. Look at the amazon site and try to go to the following pages, to see what exactly changes in the address:

https://www.amazon.com/s?k=books&i=stripbooks&page=1
https://www.amazon.com/s?k=books&i=stripbooks&page=2
https://www.amazon.com/s?k=books&i=stripbooks&page=3
…

To work with pagination create a loop that will change the page number after the request is executed. Python uses a for … in … construction for this:

for i in range(1,15):

Where 1 is the start page number and 14 (15-1) is the last page number. That is, initially, the variable i will be assigned the value 1. After each execution of the body of construction for … in … , the value of i will be increased by 1, and so on until it reaches 14.

After that, the function will exit. Put the entire request (except the libraries import) in a for … in … construction, indented by one tab or 3 spaces.

Let's add the base_url variable, which will store the unchanged part of the URL:

base_url = "https://www.amazon.com/s?k=books&i=stripbooks"

Enter such the request body into the temp variable so that, depending on the value of the i variable, the page number will be changed. After that, serialize and enter the received value into the payload variable:

temp = """{
    "extract_rules": {
        "Title" : "h2 > a > span",
        "Price" : "a.a-size-base:nth-child(1) > span.a-price:nth-child(2) > span:nth-child(1)"
    },
    "wait": 0,
    "url": """+"\""+base_url + "&page={0}".format(i)+"\""+""",
    "proxy_country": "US",
    "proxy_type": "residential"
    }"""
    payload = (json.dumps(json.loads(temp)))

After executing the request on the current page, an automatic transition to the next one will be performed, and so on up to page 14 inclusive. To make it clear on which page the code is currently being executed, let's add the output to the console of the current page:

print('Processing {0}...'.format('page {0}'.format(i)))

When the query is executed, the entry Processing i page… will be displayed, where instead of i the number of the current page will be printed.

Saving Amazon Product Data to .csv File

The above code prints the received data to the console but does not save it anywhere. Even the value in the response variable is overwritten with every execution of the request.

To be able to access the data in the future, let's save it to a CSV file. To do this, import the pd function from the Pandas library (in the same place where other libraries are imported):

import pandas as pd

After the data has been received from the request, let's add a variable that will store it:

data = json.loads(response.text)

Then one can work with them. Consider the JSON structure of the response:

{
  "status": "ok",
  "scrapingResult": {
    "extractedData": {
      "Title": [
        "Things We Never Got Over"
      ],
      "Price": [
        "$13.89"
      ]
    },
    "content": "<!DOCTYPE html>...</html>",
    "headers": {
      "accept-ch": "ect,rtt,downlink,device-memory,sec-ch-device-memory,viewport-width,sec-ch-viewport-width,dpr,sec-ch-dpr",
      ...
      "x-xss-protection": "1;"
    },
    "cookies": [
      {
        "name": "ubid-main",
        "value": "131-6441959-9790827",
        "domain": ".amazon.com",
        "path": "/",
        "expires": 1692096501.073783,
        "size": 28,
        "httpOnly": false,
        "secure": true,
        "session": false,
        "sameParty": false,
        "sourceScheme": "Secure",
        "sourcePort": 443
      }
    ]
  }
}

In this case, one needs the next data:

  "scrapingResult": {
   	 "extractedData": {
      		"Title": [
        		"Things We Never Got Over"
      		],
     		 "Price": [
      			"$13.89"
      		]
    	}

The required web data is in data['scrapingResult']['extractedData']['Title'] and data['scrapingResult']['extractedData']['Price']. Get all the strings from the title attribute and put them sequentially into the title variable, adding new data to the end of the variable:

for item in data['scrapingResult']['extractedData']['Title']:
                title.append([item])

In the same way put the data in the price variable:

for item in data['scrapingResult']['extractedData']['Price']:
                price.append([item])

After that, put the data one by one into new columns in a .csv file using the Pandas library:

df = pd.DataFrame(title, columns=['title'])
df['price'] = pd.DataFrame(price, columns=['price'])
df.to_csv('file.csv',encoding='utf-8-sig', mode='a', header=False,  index=False)

It is important to append data with record type "a", which means that all the data will be appended to the end of the file. Otherwise, each time the request is executed, the file will be overwritten.

Using Exceptions to Catch Errors

If the query fails on one of the pages, the program will be exited. To prevent this from happening, use try{}..except Error{}.

Most often, errors will be encountered during execution:

  1. Returning an empty value. The site returned an empty value and python cannot process it.
  2. ErrorKey. The site returned the request with an error, meaning it was not completed.

To avoid the first error, use a conditional if statement to test for the content of data['scrapingResult']['extractedData']['Title'] and data['scrapingResult']['extractedData']['Price'], and do the processing values only if the content is not empty:

if (data['scrapingResult']['extractedData']['Title'] is not None) and (data['scrapingResult']['extractedData']['Price'] is not None):

To catch the situation when the request was not executed, add a try{}..except Error{} block, inside which all data processing and saving operations will be performed.

    try:
        if (data['scrapingResult']['extractedData']['Title'] is not None) and (data['scrapingResult']['extractedData']['Price'] is not None):
            for item in data['scrapingResult']['extractedData']['Title']:
                title.append([item])
            for item in data['scrapingResult']['extractedData']['Price']:
                price.append([item])
            df = pd.DataFrame(title, columns=['title'])
            df['price'] = pd.DataFrame(price, columns=['price'])
            df.to_csv('file.csv',encoding='utf-8-sig', mode='a', header=False,  index=False)
        else:
            print('Error {0}...'.format('page {0}'.format(i)))
    except KeyError:
        print('Key error {0}...'.format('page {0}'.format(i)))

With this approach, even if an exception is encountered during the execution of the program, the program will not stop processing but will skip the page and move on to the next one.

Adding a Delay Between Requests

There are two ways to add a delay between requests: change the wait in the body of the request, or use the built-in library time.

The delay can be set on the site in the Query Builder.

Or change the wait value:

  "extract_rules": {
    "Title": "h2 > a > span",
    "Price": "a.a-size-base:nth-child(1) > span.a-price:nth-child(2) > span:nth-child(1)"
  },
  "wait": 10000,

But changing the value of wait is easy, so let's look at how to include the sleep function from the built-in time library.

Add the sleep function import:

from time import sleep

Then at the end of the program add a delay command for 2 seconds (or any other time):

sleep(2)

And to make the delay random, use the built-in random library:

from random import randint

And change the delay to random in the range from 2 to 10 seconds:

sleep(randint(2,10))

Now, after executing the request, the program will "fall asleep" for a random time from 2 to 10 seconds.

Scraping More Product Details

So, there is a fully working version of the scraper, which will crawl through all the category pages and collect available product data. However, this information may not be enough. Let's change and refine the code so that the scraper works like this:

  1. Setting the category link and the number of crawl pages.
  2. The scraper collects product links from the entire page.
  3. A crawl is performed on all collected links.
  4. The following data is collected from each product page:
    1. Product Name.
    2. Price.
    3. Description.
    4. Rating.
    5. The number of reviews.
    6. Link to image.
  1. Saving data to a file. 
  2. Move to the next page.

Let's start by analyzing the Amazon product page. The product name is in the <h1> tag and has the id “title”:

That is, the name of the product can be found using the #title selector.

To simplify the rest of the elements, let's use the browser's built-in function to copy the CSS selector:

So the resulting selectors look like this:

  1. Product name - #title.

  2. The price - #a-autoid-8-announce > span.a-color-base.

  3. The description - #bookDescription_feature_div.

  4. Rating - #acrPopover > span.a-declarative > a > i.a-icon.a-icon-star.

  5. The number of reviews - #acrCustomerReviewText.

  6. Link to image. It can't be built with an API, so a little later, let's use regular expressions to find them.

Once CSS selectors have become known, it is advisable to check them. This can be done from the browser console in DevTools on the console tab. To do this, use the command $(“selector”) to find the first element with that selector on the page, and $$(“selector”) to find all elements with that selector.

And set up extraction rules:

{
"Title" : "#title",
"Price" : "#a-autoid-8-announce > span.a-color-base ",
"Description" : "#bookDescription_feature_div ",
"Rating" : "#acrPopover > span.a-declarative > a > i.a-icon.a-icon-star",
"Review" : "a > #acrCustomerReviewText"
}

Let's change the code so that each category page collects links to products, and then navigates to each page where the actions would also be performed.

Collecting ASINs with Regular Expressions

Each product on Amazon has its own unique code - ASIN. And in order to go to the product page, it is enough to find out this code and substitute it in the link "https://www.amazon.com/dp/ASIN", where instead of ASIN indicates the unique code of the product.

When making a request to the scrape-it.cloud API, the request also returns the content attribute, which contains all the code of the page from which the data was collected. Get a unique product code using regular expressions from this attribute.

Let's look at the products page.

All product information is stored in a div tag that has a data-asin that contains the product's unique code. The structure of each element is identical, that is, to get all the codes, find all the text in the page code, which is between "div data-asin="" and "data-index". And it's easy to do with regular expressions.

To use regular expressions, including the re library:

import re

Then do a search. To search by text, the following command is executed:

re.findall(r'WHAT-TO-FIND', WHERE-FIND)

Let's fill in the data and put the value in the ASIN variable:

asin = re.findall(r'div\sdata-asin=\W(\S+?)\W\sdata-index=\W', data['scrapingResult']['content'])

This command means:

  1. Find all occurrences of the text “div data-asin=”all characters up to the nearest limit” data-index=””

  2. Enter the value in the ASIN variable.

The special characters "\W", "\S", "\s", "?" were used in re.findall. \W means that instead of this character, any non-letter, non-number, and non-underscore can be found in the text. \S means that instead of this character, any non-space character can be found in the text. \s means that instead of this character, any whitespace character (space, tab, end of the line, etc.) can be found in the text. +? this means that the minimum possible number of characters will be received, the number of which is not known.

Processing each page With Product Data

Firstly, set up a crawl for each page. To do this, use for ... in .... Only this time let's find out the number of goods. It will match the number of unique codes stored in ASIN:

for j in range(1, len(asin)):

This means that it will traverse from 1 element up to the number of elements stored in ASIN.

The page processing code will differ from the category passcode only in the extract_rules field and a large number of columns to save. Therefore, let's look at the image link and compose regular expressions for them as well. Item code:

<img alt="The Reunion by [Kiersten Modglin]" src="https://m.media-amazon.com/images/I/412Jcikm7xL.jpg" onload="this.onload='';setCSMReq('af');" data-a-image-name="ebooksImageBlockFront" class="a-dynamic-image frontImage" id="ebooksImgBlkFront" width="217px" data-a-dynamic-image="{&quot;https://m.media-amazon.com/images/I/412Jcikm7xL._SY346_.jpg&quot;:[217,346],&quot;https://m.media-amazon.com/images/I/412Jcikm7xL.jpg&quot;:[313,500]}" data-a-manual-replacement="true" style="height: 266.667px; width: 167.245px; overflow: hidden; position: relative; top: 0px; left: 0px;">

Final regular expression:

image = re.findall(r'src=\W(\S+?)\Wonload', data['scrapingResult']['content'])

And the code for crawling the page will be the next:

asin = re.findall(r'div\sdata-asin=\W(\S+?)\W\sdata-index=\W', data['scrapingResult']['content'])
for j in range(1, len(asin)):
	href = "https://www.amazon.com/dp/"+asin[j]
	print(href)

	temp = """{
	"extract_rules": {
		"Title" : "#title",
		"Price" : "#a-autoid-8-announce > span.a-color-base ",
		"Description" : "#bookDescription_feature_div ",
		"Rating" : "#acrPopover > span.a-declarative > a > i.a-icon.a-icon-star",
		"Review" : "a > #acrCustomerReviewText"
        },
	"wait": 0,
	"url": """+href+""",
	"proxy_country": "US",
        "proxy_type": "residential"
        }"""
        payload = (json.dumps(json.loads(temp))) 

        headers = {
           'x-api-key': 'YOUR-API-KEY',
           'Content-Type': 'application/json'
        }

        response = requests.request("POST", url, headers=headers, data=payload)

         print('Processing {0}...'.format('item {0}'.format(j)))
         data = json.loads(response.text)
         try:
         	if (data['scrapingResult']['extractedData']['Title'] is not None) and (data['scrapingResult']['extractedData']['Price'] is not None):
                	for item in data['scrapingResult']['extractedData']['Title']:
                            title.append([item])
                        for item in data['scrapingResult']['extractedData']['Price']:
                            price.append([item])
                        for item in data['scrapingResult']['extractedData']['Description']:
                            descrip.append([item])
                        for item in data['scrapingResult']['extractedData']['Rating']:
                            rating.append([item])
                        for item in data['scrapingResult']['extractedData']['Review']:
                            review.append([item])
                        df = pd.DataFrame(title, columns=['title'])
                        df['price'] = pd.DataFrame(price, columns=['price'])
                        df['description'] = pd.DataFrame(descrip, columns=['description'])
                        df['rating'] = pd.DataFrame(rating, columns=['rating'])
                        df['review'] = pd.DataFrame(review, columns=['review'])
                        image = re.findall(r'src=\W(\S+?)\Wonload', data['scrapingResult']['content'])
                        df['image'] = pd.DataFrame(image, columns=['image'])
                        df.to_csv('file.csv',encoding='utf-8-sig', mode='a', header=False,  index=False)
                else:
                    print('Error {0}...'.format('item {0}'.format(j)))
          except KeyError:
                print('Key error {0}...'.format('item {0}'.format(j)))

The complete code for scraping amazon product data can be found on GitHub.

Conclusion and Takeaways

Finally, let's save the code in a .py file and run it on the command line.

A file.csv was created in the documents folder, where all the collected data was entered:

The data was saved in .csv format separated by commas:

To process them, save them as .xls.

Tired of getting blocked while scraping the web?

Try out Web Scraping API with proxy rotation, CAPTCHA bypass, and Javascript rendering.

  • 1,000 Free API Credits
  • No Credit Card Required
  • 30-Day Trial
Try now for free

Get structured data in the format you need!

We offer customized web scraping solutions that can provide any data you need, on time and with no hassle!

  • Regular, custom data delivery
  • Pay after you receive sample dataset
  • A range of output formats
Get a Quote
Valentina Skakun
Valentina Skakun
Valentina Skakun’s profile

I'm a technical writer who believes that data parsing can help in getting and analyzing data. I'll tell about what parsing is and how to use it.

Request a Quote

Tell us more about you and your project information.