Scenario
A website has a product listing page with links to individual product detail pages. You need to scrape data from both the list page and detail pages to create a complete dataset.
Task
Write a Python script at /home/interview/scrape_products.py that scrapes the product listing at http://shop.local/products/, follows links to individual product pages, extracts information from both pages, and saves the combined data to /home/interview/products.csv.
Note: The beautifulsoup4 and requests packages are already installed.
Example
Expected output in /home/interview/products.csv:
id,name,price,brand,description,stock_status,rating
1,Wireless Mouse,$29.99,TechBrand,High-precision wireless mouse...,In Stock,4.5
2,Mechanical Keyboard,$89.99,KeyMaster,RGB mechanical keyboard...,In Stock,4.8
...
Step 1: Explore the website structure
curl http://shop.local/products/
The listing page shows product cards in a grid. Each detail page contains additional information like brand, description, stock status, and rating.
Step 2: Create the scraping script
nano /home/interview/scrape_products.py
Write a script that scrapes both the list page and detail pages:
import requests
from bs4 import BeautifulSoup
import csv
# Scrape the main product listing page
list_url = 'http://shop.local/products/'
response = requests.get(list_url)
soup = BeautifulSoup(response.content, 'html.parser')
products = []
# Find all product cards
for card in soup.find_all('div', class_='product-card'):
name = card.find('div', class_='product-name').text.strip()
price = card.find('div', class_='price').text.strip()
detail_link = card.find('a', class_='btn')['href']
# Build full URL for detail page
detail_url = f'http://shop.local{detail_link}'
# Scrape the detail page
detail_response = requests.get(detail_url)
detail_soup = BeautifulSoup(detail_response.content, 'html.parser')
# Extract additional details from product-id div
product_id_text = detail_soup.find('div', class_='product-id').text
product_id = product_id_text.split('|')[0].replace('Product ID:', '').strip()
brand = detail_soup.find('div', class_='brand').text.replace('Brand:', '').strip()
description = detail_soup.find('div', class_='description').find('br').next_sibling.strip()
stock_div = detail_soup.find('div', class_='stock')
stock_status = stock_div.text.replace('Status:', '').strip()
rating_text = detail_soup.find('div', class_='rating').text
rating = rating_text.split()[1].split('/')[0]
# Combine data from both pages
product = {
'id': product_id,
'name': name,
'price': price,
'brand': brand,
'description': description,
'stock_status': stock_status,
'rating': rating
}
products.append(product)
# Save to CSV
with open('/home/interview/products.csv', 'w', newline='') as csvfile:
fieldnames = ['id', 'name', 'price', 'brand', 'description', 'stock_status', 'rating']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(products)
print(f"Scraped {len(products)} products")
Step 3: Run the script
python3 /home/interview/scrape_products.py
Step 4: Verify the output
head /home/interview/products.csv
wc -l /home/interview/products.csv
Should show 51 lines (header + 50 products) with all columns populated.