Amazon: Category Stats — Data Analysis Interview Q&A (2026)

Category Stats

Amazon ☯️ Medium Spark DataFrame Joins Aggregation GroupBy

Beginner Mode

Start your terminal to use beginner mode.

Objective

You are analyzing order data for an e-commerce platform that stores product information and order records in separate tables.

Task

Join the products DataFrame with the orders DataFrame on product_id, then group by product category to calculate two metrics: the average price of products ordered (avg_price) and the total number of orders (total_orders_count). The average price should reflect every individual order row, not just distinct products. Save your result as result_df.

File Path

Products: /home/interview/products.csv
Orders: /home/interview/orders.csv
Starter script: /home/interview/category_stats.py

Schema

products.csv

Column	Type
product_id	integer
category	string
price	float

orders.csv

Column	Type
order_id	integer
product_id	integer
quantity	integer

Expected output schema

Column	Type
category	string
avg_price	float
total_orders_count	integer

Example

Given this sample input:

products

product_id	category	price
1	Apparel	25.99
2	Apparel	35.99
3	Footwear	50.00

orders

order_id	product_id	quantity
101	1	2
102	1	1
103	2	3
104	3	1

The output would be:

category	avg_price	total_orders_count
Apparel	29.32	3
Footwear	50.00	1

Apparel has three order rows (two for product 1 at $25.99, one for product 2 at $35.99), so avg_price = (25.99 + 25.99 + 35.99) / 3 = 29.32. Footwear has one order for product 3 at $50.00.

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("PrepareshSpark").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

products_df = spark.read.csv("/home/interview/products.csv", header=True, inferSchema=True)
orders_df = spark.read.csv("/home/interview/orders.csv", header=True, inferSchema=True)

joined_df = products_df.join(orders_df, on="product_id")

result_df = joined_df.groupBy("category").agg(
    F.avg("price").alias("avg_price"),
    F.count("order_id").alias("total_orders_count")
)

# --- Do not edit below this line ---
result_df.coalesce(1).write.csv("/home/interview/output", header=True, mode="overwrite")
spark.stop()

Explanation

Step 1: Loading the DataFrames

products_df = spark.read.csv("/home/interview/products.csv", header=True, inferSchema=True)
orders_df = spark.read.csv("/home/interview/orders.csv", header=True, inferSchema=True)

Each CSV is loaded as its own DataFrame. inferSchema=True tells Spark to detect column types automatically, so price becomes a double and product_id becomes an integer rather than everything being treated as strings.

Step 2: Joining Products with Orders

joined_df = products_df.join(orders_df, on="product_id")

This performs an inner join on product_id, matching each order row to its corresponding product. After the join, every row represents one order and carries the product's category and price alongside the order details. This is important because the same product can appear in multiple order rows, and we want the price counted once per order row, not once per distinct product.

Step 3: Grouping and Aggregating

result_df = joined_df.groupBy("category").agg(
    F.avg("price").alias("avg_price"),
    F.count("order_id").alias("total_orders_count")
)

groupBy("category") collects all rows sharing the same category. Inside agg(), F.avg("price") computes the mean of the price column across all rows in the group. Since the join duplicated the price for every order row, this naturally gives us the per-order average rather than the per-product average. F.count("order_id") counts how many order rows exist in each category. The .alias() calls rename the output columns to match the expected schema.

Step 4: Why avg Works This Way

It is worth understanding why F.avg("price") gives the correct result here. After the join, if Product A ($25.99) has 2 orders and Product B ($35.99) has 1 order in the same category, the joined DataFrame has 3 rows for that category with prices [25.99, 25.99, 35.99]. The average of those 3 values is (25.99 + 25.99 + 35.99) / 3 = 29.32. If you had averaged only the distinct product prices, you would get (25.99 + 35.99) / 2 = 30.99, which is not what we want.

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Track

	Question	Difficulty	Company	Access

Need more practice in this area? Explore more questions →