Start your terminal to use beginner mode.
Objective
In a manufacturing company, data is constantly collected about products and manufacturing processes. You are given two DataFrames: the first contains information about the products, and the second logs the manufacturing processes they undergo.
Task
Write a PySpark function that removes any exact duplicate rows from both DataFrames, and then combines them using their ProductID. Save the resulting DataFrame as result_df. Ensure the final columns match the order specified in the output schema.
File Path
- Products Dataset:
/home/interview/products.csv - Processes Dataset:
/home/interview/processes.csv - Starter script:
/home/interview/factory_cleanup.py
Schema
products.csv
| Column Name | Data Type | Description |
|---|---|---|
| ProductID | integer | Unique identifier for each product |
| ProductName | string | Name of the product |
| Category | string | Category of the product |
processes.csv
| Column Name | Data Type | Description |
|---|---|---|
| ProcessID | integer | Unique identifier for each manufacturing process |
| ProductID | integer | Identifier for the product associated with the process |
| ProcessName | string | Name of the manufacturing process |
| Duration | float | Duration of the process in hours |
Expected Output Schema
| Column Name | Data Type | Description |
|---|---|---|
| ProductID | integer | Unique identifier for each product |
| ProductName | string | Name of the product |
| Category | string | Category of the product |
| ProcessID | integer | Unique identifier for each manufacturing process |
| ProcessName | string | Name of the manufacturing process |
| Duration | float | Duration of the process in hours |
Example
Given this sample input:
products_df
| ProductID | ProductName | Category |
|---|---|---|
| 1 | Widget A | Type1 |
| 2 | Widget B | Type1 |
| 3 | Widget C | Type2 |
| 4 | Widget D | Type2 |
| 1 | Widget A | Type1 |
manufacturing_processes_df
| ProcessID | ProductID | ProcessName | Duration |
|---|---|---|---|
| 1001 | 1 | Cutting | 1.5 |
| 1002 | 2 | Cutting | 1.6 |
| 1003 | 3 | Cutting | 1.8 |
| 1004 | 4 | Cutting | 1.5 |
| 1005 | 1 | Shaping | 2.0 |
The output would be:
| ProductID | ProductName | Category | ProcessID | ProcessName | Duration |
|---|---|---|---|---|---|
| 1 | Widget A | Type1 | 1001 | Cutting | 1.5 |
| 1 | Widget A | Type1 | 1005 | Shaping | 2.0 |
| 2 | Widget B | Type1 | 1002 | Cutting | 1.6 |
| 3 | Widget C | Type2 | 1003 | Cutting | 1.8 |
| 4 | Widget D | Type2 | 1004 | Cutting | 1.5 |
Notice that the duplicate "Widget A" row in the products table was removed before the join, preventing duplicate rows in the final output.
Terminal requires a larger screen
Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.
Linux Terminal Environment
Write and execute your solution in the terminal below.
Track
| Question | Difficulty | Company | Access |
|---|
Amazon