How to scrape Amazon product reviews using Selenium + Python

How to scrape Amazon product reviews using Selenium + Python

Python

During the work on a project I needed some test data to work with. Basically, I needed reviews: short and long, positive and negative ones, with stars and all those nice things. I didn’t want to use some random data since it wouldn’t be realistic. So I decided to quickly borrow some data from Amazon.

Because I didn’t want to spend too much time on this, my first thought was to google for some existing solutions or libraries. I found dozens of repositories on GitHub, but most of them were outdated or didn’t work at all. I also tried Scrapy library, but looks like Amazon has some protection against it and forces you to login on every move.

I gave up and decided to write a simple Python script using Selenium WebDriver. It’s not the best solution, but it works (at least in 2024), it’s simple and easy to understand.

Installation

First, create a virtual environment and activate it:

python -m venv .venv
source .venv/bin/activate

Second, install the required packages from the requirements.txt file:

pip install -r requirements.txt

The requirements file contains the following packages:

selenium
pandas

We need Selenium to interact with the browser and scrape the data, and Pandas to store the data in a DataFrame and to JSON file after.

Script

Here is the script that scrapes Amazon product reviews:

import random
import time
from typing import List
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.remote.webelement import WebElement
import pandas as pd

# Base URL of the Amazon website.
base_url = "https://www.amazon.com"
# ASINs of the products to scrape reviews from.
asins = ['B01N5IB20Q']
# Number of pages to scrape reviews from. Set to 0 to scrape all reviews. 
pages = 10
# Result list to store the scraped reviews.
result = []

driver = webdriver.Firefox()
driver.maximize_window()
try:
    for asin in asins:
        url = f"{base_url}/product-reviews/{asin}"
        print(f"Scraping reviews for ASIN: {asin}, URL: {url}...")
        driver.get(url)
        page = 1
        
        while page <= pages or pages == 0:
            try:
                # random sleep to behave more like a human
                time.sleep(random.randint(2, 5))
                review_elements: List[WebElement] = driver.find_elements(By.CSS_SELECTOR, "#cm_cr-review_list div.review")
                for review_element in review_elements:
                    verifiedBadgeCnt = len(review_element.find_elements(By.CSS_SELECTOR, "span[data-hook=avp-badge]"))
                    ratingElem = review_element.find_elements(By.CSS_SELECTOR, "*[data-hook=review-star-rating]>span")
                    reviewTitleElem = review_element.find_elements(By.CSS_SELECTOR, "*[data-hook=review-title]>span")
                    reviewLinkElem = review_element.find_elements(By.CSS_SELECTOR, "a[data-hook=review-title]")
                    authorElem = review_element.find_elements(By.CSS_SELECTOR, "span.a-profile-name")
                    locDateElem = review_element.find_elements(By.CSS_SELECTOR, "span[data-hook=review-date]")
                    item = {
                        'asin': asin,
                        'title': reviewTitleElem[1].text if len(reviewTitleElem) > 1 else None,
                        'text': "".join(map(lambda x: x.text, review_element.find_elements(By.CSS_SELECTOR, "span[data-hook=review-body]"))).strip(),
                        'rating': ratingElem[0].get_attribute("textContent") if ratingElem else None,
                        'location_and_date': locDateElem[0].text if locDateElem else None,
                        'verified': bool(verifiedBadgeCnt > 0),
                        'author': authorElem[0].text if authorElem else None,
                        'link': reviewLinkElem[0].get_attribute("href") if reviewLinkElem else None,
                    }
                    result.append(item)
                page += 1
                next_page_element = driver.find_elements(By.CSS_SELECTOR, ".a-pagination .a-last a")
                if next_page_element:
                    next = next_page_element[0]
                    href = next.get_attribute("href")
                    print(f"Clicking next page [{href}]")
                    next.click()
            except Exception as e:
                print(f"Error scraping page {page} for ASIN {asin}.")
                print(f"Error: {e}")

    print(f"Total reviews scraped: {len(result)}")
    df = pd.DataFrame.from_records(result, columns=['asin', 'title', 'text', 'rating', 'location_and_date', 'verified', 'author', 'link'])
    filename = f"review_{time.strftime('%Y%m%d%H%M%S')}.json"
    df.to_json(filename, orient='records')
    print(f"Saved to {filename}")
finally:
    driver.quit() 

The script is quite simple. It scrapes reviews for the specified ASINs and saves them to a JSON file. You can adjust the asins list to scrape reviews for different products. The pages variable controls the number of pages to scrape reviews from. Set it to 0 to scrape all reviews.

To run the script, execute the following command:

python scrape_amazon_reviews.py

Conclusion

Now you have a simple Python script to borrow few product reviews from Amazon. You can use this data for testing, prototyping, or educational purposes. Keep in mind that scraping data from websites might be against their terms of service, so use it responsibly and don’t abuse it. Happy scraping coding!