Amazon: Factory Data Cleanup — Data Engineering Interview Q&A (2026)

Factory Data Cleanup

Amazon ☯️ Medium Spark Drop Duplicates Joins

Beginner Mode

Start your terminal to use beginner mode.

Objective

In a manufacturing company, data is constantly collected about products and manufacturing processes. You are given two DataFrames: the first contains information about the products, and the second logs the manufacturing processes they undergo.

Task

Write a PySpark function that removes any exact duplicate rows from both DataFrames, and then combines them using their ProductID. Save the resulting DataFrame as result_df. Ensure the final columns match the order specified in the output schema.

File Path

Products Dataset: /home/interview/products.csv
Processes Dataset: /home/interview/processes.csv
Starter script: /home/interview/factory_cleanup.py

Schema

products.csv

Column Name	Data Type	Description
ProductID	integer	Unique identifier for each product
ProductName	string	Name of the product
Category	string	Category of the product

processes.csv

Column Name	Data Type	Description
ProcessID	integer	Unique identifier for each manufacturing process
ProductID	integer	Identifier for the product associated with the process
ProcessName	string	Name of the manufacturing process
Duration	float	Duration of the process in hours

Expected Output Schema

Column Name	Data Type	Description
ProductID	integer	Unique identifier for each product
ProductName	string	Name of the product
Category	string	Category of the product
ProcessID	integer	Unique identifier for each manufacturing process
ProcessName	string	Name of the manufacturing process
Duration	float	Duration of the process in hours

Example

Given this sample input:

products_df

ProductID	ProductName	Category
1	Widget A	Type1
2	Widget B	Type1
3	Widget C	Type2
4	Widget D	Type2
1	Widget A	Type1

manufacturing_processes_df

ProcessID	ProductID	ProcessName	Duration
1001	1	Cutting	1.5
1002	2	Cutting	1.6
1003	3	Cutting	1.8
1004	4	Cutting	1.5
1005	1	Shaping	2.0

The output would be:

ProductID	ProductName	Category	ProcessID	ProcessName	Duration
1	Widget A	Type1	1001	Cutting	1.5
1	Widget A	Type1	1005	Shaping	2.0
2	Widget B	Type1	1002	Cutting	1.6
3	Widget C	Type2	1003	Cutting	1.8
4	Widget D	Type2	1004	Cutting	1.5

Notice that the duplicate "Widget A" row in the products table was removed before the join, preventing duplicate rows in the final output.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PrepareshSpark").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

products_df = spark.read.csv("/home/interview/products.csv", header=True, inferSchema=True)
processes_df = spark.read.csv("/home/interview/processes.csv", header=True, inferSchema=True)

# Clean both DataFrames by removing duplicate rows
clean_products = products_df.dropDuplicates()
clean_processes = processes_df.dropDuplicates()

# Join the cleaned DataFrames
joined_df = clean_products.join(clean_processes, on="ProductID", how="inner")

# Reorder columns to match the expected schema
result_df = joined_df.select(
    "ProductID", "ProductName", "Category", 
    "ProcessID", "ProcessName", "Duration"
)

# --- Do not edit below this line ---
result_df.coalesce(1).write.csv("/home/interview/output", header=True, mode="overwrite")
spark.stop()

Explanation

Step 1: Removing Duplicates

clean_products = products_df.dropDuplicates()
clean_processes = processes_df.dropDuplicates()

Before joining, you must ensure your data is clean. Joining tables that contain duplicate rows can lead to a "combinatorial explosion" where duplicates multiply against each other, inflating your metrics. PySpark provides the .dropDuplicates() method (or .distinct()) which, when called without arguments, evaluates every column in the row and removes exact identical matches.

Step 2: Joining on ProductID

joined_df = clean_products.join(clean_processes, on="ProductID", how="inner")

To combine the product details with their respective manufacturing steps, we use a standard .join(). Because ProductID exists in both DataFrames, we pass it as the on parameter. An inner join ensures we only keep records that successfully match across both tables.

Step 3: Selecting and Reordering Columns

result_df = joined_df.select(
    "ProductID", "ProductName", "Category", 
    "ProcessID", "ProcessName", "Duration"
)

By default, PySpark will usually place the join key (ProductID) at the front of the resulting DataFrame, but the remaining columns might be in an unpredictable order. Using .select() at the very end guarantees that the final output matches the requested schema exactly.

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Track

	Question	Difficulty	Company	Access

Need more practice in this area? Explore more questions →