Objective
You are analyzing order data for an e-commerce platform that stores product information and order records in separate tables.
Task
Join the products DataFrame with the orders DataFrame on product_id, then group by product category to calculate two metrics: the average price of products ordered (avg_price) and the total number of orders (total_orders_count). The average price should reflect every individual order row, not just distinct products. Save your result as result_df.
File Path
- Products:
/home/interview/products.csv
- Orders:
/home/interview/orders.csv
- Starter script:
/home/interview/category_stats.py
Schema
products.csv
| Column |
Type |
| product_id |
integer |
| category |
string |
| price |
float |
orders.csv
| Column |
Type |
| order_id |
integer |
| product_id |
integer |
| quantity |
integer |
Expected output schema
| Column |
Type |
| category |
string |
| avg_price |
float |
| total_orders_count |
integer |
Example
Given this sample input:
products
| product_id |
category |
price |
| 1 |
Apparel |
25.99 |
| 2 |
Apparel |
35.99 |
| 3 |
Footwear |
50.00 |
orders
| order_id |
product_id |
quantity |
| 101 |
1 |
2 |
| 102 |
1 |
1 |
| 103 |
2 |
3 |
| 104 |
3 |
1 |
The output would be:
| category |
avg_price |
total_orders_count |
| Apparel |
29.32 |
3 |
| Footwear |
50.00 |
1 |
Apparel has three order rows (two for product 1 at $25.99, one for product 2 at $35.99), so avg_price = (25.99 + 25.99 + 35.99) / 3 = 29.32. Footwear has one order for product 3 at $50.00.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("PrepareshSpark").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
products_df = spark.read.csv("/home/interview/products.csv", header=True, inferSchema=True)
orders_df = spark.read.csv("/home/interview/orders.csv", header=True, inferSchema=True)
joined_df = products_df.join(orders_df, on="product_id")
result_df = joined_df.groupBy("category").agg(
F.avg("price").alias("avg_price"),
F.count("order_id").alias("total_orders_count")
)
# --- Do not edit below this line ---
result_df.coalesce(1).write.csv("/home/interview/output", header=True, mode="overwrite")
spark.stop()
Explanation
Step 1: Loading the DataFrames
products_df = spark.read.csv("/home/interview/products.csv", header=True, inferSchema=True)
orders_df = spark.read.csv("/home/interview/orders.csv", header=True, inferSchema=True)
Each CSV is loaded as its own DataFrame. inferSchema=True tells Spark to detect column types automatically, so price becomes a double and product_id becomes an integer rather than everything being treated as strings.
Step 2: Joining Products with Orders
joined_df = products_df.join(orders_df, on="product_id")
This performs an inner join on product_id, matching each order row to its corresponding product. After the join, every row represents one order and carries the product's category and price alongside the order details. This is important because the same product can appear in multiple order rows, and we want the price counted once per order row, not once per distinct product.
Step 3: Grouping and Aggregating
result_df = joined_df.groupBy("category").agg(
F.avg("price").alias("avg_price"),
F.count("order_id").alias("total_orders_count")
)
groupBy("category") collects all rows sharing the same category. Inside agg(), F.avg("price") computes the mean of the price column across all rows in the group. Since the join duplicated the price for every order row, this naturally gives us the per-order average rather than the per-product average. F.count("order_id") counts how many order rows exist in each category. The .alias() calls rename the output columns to match the expected schema.
Step 4: Why avg Works This Way
It is worth understanding why F.avg("price") gives the correct result here. After the join, if Product A ($25.99) has 2 orders and Product B ($35.99) has 1 order in the same category, the joined DataFrame has 3 rows for that category with prices [25.99, 25.99, 35.99]. The average of those 3 values is (25.99 + 25.99 + 35.99) / 3 = 29.32. If you had averaged only the distinct product prices, you would get (25.99 + 35.99) / 2 = 30.99, which is not what we want.