Comprehensive Project Aggregation
Beginner Mode

Start your terminal to use beginner mode.

Objective

You are working as a data analyst for a large construction company. Your task is to analyze data spanning across three different DataFrames relating to the company's projects, workforce, and machinery.

Task

Write a PySpark function that aggregates the following information for each project:

  1. The project's duration in days.
  2. The total number of employees working on the project.
  3. The number of unique roles across all employees in the project.
  4. The total cost of equipment allocated to the project.

Save the resulting DataFrame as result_df and ensure the columns match the exact order of the Output Schema. Use left joins to ensure no projects are dropped if they are missing employees or equipment. Order the final output by project_id in ascending order.

File Path

  • Projects Dataset: /home/interview/projects.csv
  • Employees Dataset: /home/interview/employees.csv
  • Equipment Dataset: /home/interview/equipment.csv
  • Starter script: /home/interview/project_metrics.py

Schema

projects.csv

Column Name Data Type
project_id int
project_name string
start_date date
end_date date
budget int

employees.csv

Column Name Data Type
employee_id int
first_name string
last_name string
role string
project_id int

equipment.csv

Column Name Data Type
equipment_id int
equipment_name string
project_id int
cost int

Expected Output Schema

Column Name Data Type
project_id int
project_name string
start_date date
end_date date
duration_days int
total_employees int
unique_roles int
total_equipment_cost int

Example

Given this sample input:

projects

project_id project_name start_date end_date budget
1 Skyscraper 2022-01-01 2022-12-31 15000000
2 Bridge 2022-03-01 2022-08-31 5000000
3 Tunnel 2022-06-01 2023-01-31 10000000

employees

employee_id first_name last_name role project_id
1 John Doe Engineer 1
2 Jane Smith Architect 1
3 Jim Brown Project Manager 1
4 Emily Davis Engineer 2
5 Alan Johnson Architect 2

equipment

equipment_id equipment_name project_id cost
1 Crane 1 25000
2 Excavator 1 15000
3 Bulldozer 2 20000
4 Loader 2 10000
5 Crane 3 25000

The output would be:

project_id project_name start_date end_date duration_days total_employees unique_roles total_equipment_cost
1 Skyscraper 2022-01-01 2022-12-31 364 3 3 40000
2 Bridge 2022-03-01 2022-08-31 183 2 2 30000
3 Tunnel 2022-06-01 2023-01-31 244 null null 25000

(Note: Tunnel has equipment but no employees, resulting in null values for those specific metrics).

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Sign In

Track

Question Difficulty Company Access
Need more practice in this area? Explore more questions →