Factory Data Cleanup
Beginner Mode

Start your terminal to use beginner mode.

Objective

In a manufacturing company, data is constantly collected about products and manufacturing processes. You are given two DataFrames: the first contains information about the products, and the second logs the manufacturing processes they undergo.

Task

Write a PySpark function that removes any exact duplicate rows from both DataFrames, and then combines them using their ProductID. Save the resulting DataFrame as result_df. Ensure the final columns match the order specified in the output schema.

File Path

  • Products Dataset: /home/interview/products.csv
  • Processes Dataset: /home/interview/processes.csv
  • Starter script: /home/interview/factory_cleanup.py

Schema

products.csv

Column Name Data Type Description
ProductID integer Unique identifier for each product
ProductName string Name of the product
Category string Category of the product

processes.csv

Column Name Data Type Description
ProcessID integer Unique identifier for each manufacturing process
ProductID integer Identifier for the product associated with the process
ProcessName string Name of the manufacturing process
Duration float Duration of the process in hours

Expected Output Schema

Column Name Data Type Description
ProductID integer Unique identifier for each product
ProductName string Name of the product
Category string Category of the product
ProcessID integer Unique identifier for each manufacturing process
ProcessName string Name of the manufacturing process
Duration float Duration of the process in hours

Example

Given this sample input:

products_df

ProductID ProductName Category
1 Widget A Type1
2 Widget B Type1
3 Widget C Type2
4 Widget D Type2
1 Widget A Type1

manufacturing_processes_df

ProcessID ProductID ProcessName Duration
1001 1 Cutting 1.5
1002 2 Cutting 1.6
1003 3 Cutting 1.8
1004 4 Cutting 1.5
1005 1 Shaping 2.0

The output would be:

ProductID ProductName Category ProcessID ProcessName Duration
1 Widget A Type1 1001 Cutting 1.5
1 Widget A Type1 1005 Shaping 2.0
2 Widget B Type1 1002 Cutting 1.6
3 Widget C Type2 1003 Cutting 1.8
4 Widget D Type2 1004 Cutting 1.5

Notice that the duplicate "Widget A" row in the products table was removed before the join, preventing duplicate rows in the final output.

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Sign In

Track

Question Difficulty Company Access
Need more practice in this area? Explore more questions →