Start your terminal to use beginner mode.
Objective
You are analyzing the Private Equity (PE) landscape. You have three datasets: pe_firms (the parent companies), pe_funds (the financial vehicles managed by the firms), and pe_investments (the individual companies those funds bought into).
Task
Write a PySpark script that combines all three DataFrames into a single comprehensive view.
Because the data is messy, some investments may not map to a known fund, and some funds may not map to a known firm. You must use a join strategy that ensures no records are lost from any of the three tables during the merge. After merging, drop any rows where absolutely every column is null.
Save the resulting DataFrame as result_df. Select the columns in the exact order specified in the schema below.
File Path
- Firms:
/home/interview/pe_firms.csv - Funds:
/home/interview/pe_funds.csv - Investments:
/home/interview/pe_investments.csv - Starter script:
/home/interview/pe_merger.py
Schema
pe_firms.csv
| Column Name | Data Type | Description |
|---|---|---|
| firm_id | integer | the unique identifier of the PE firm |
| firm_name | string | the name of the PE firm |
| founded_year | integer | the year the PE firm was founded |
| location | string | the location of the PE firm |
pe_funds.csv
| Column Name | Data Type | Description |
|---|---|---|
| fund_id | integer | the unique identifier of the PE fund |
| firm_id | integer | the unique identifier of the PE firm |
| fund_name | string | the name of the PE fund |
| fund_size | integer | the size of the PE fund in millions of dollars |
| fund_start_year | integer | the year the PE fund was started |
| fund_end_year | integer | the year the PE fund ended or is expected to end |
pe_investments.csv
| Column Name | Data Type | Description |
|---|---|---|
| investment_id | integer | the unique identifier of the PE investment |
| fund_id | integer | the unique identifier of the PE fund |
| company_name | string | the name of the company receiving the investment |
| investment_amount | integer | the amount of the investment in millions of dollars |
| investment_date | string | the date of the investment |
Expected Output Schema
| Column Name | Data Type |
|---|---|
| investment_id | integer |
| fund_id | integer |
| firm_id | integer |
| firm_name | string |
| founded_year | integer |
| location | string |
| fund_name | string |
| fund_size | integer |
| fund_start_year | integer |
| fund_end_year | integer |
| company_name | string |
| investment_amount | integer |
| investment_date | string |
Example
Given this sample input:
pe_firms
| firm_id | firm_name | founded_year | location |
|---|---|---|---|
| 1 | ABC Fund | 2010 | New York |
| 2 | XYZ Fund | 2005 | London |
| 3 | DEF Fund | 2015 | Paris |
pe_funds
| fund_id | firm_id | fund_name | fund_size | fund_start_year | fund_end_year |
|---|---|---|---|---|---|
| 101 | 1 | ABC I | 100 | 2010 | 2015 |
| 102 | 1 | ABC II | 150 | 2015 | 2020 |
| 103 | 2 | XYZ I | 200 | 2010 | 2018 |
pe_investments
| investment_id | fund_id | company_name | investment_amount | investment_date |
|---|---|---|---|---|
| 1001 | 101 | Company A | 10 | 2012-05-15 |
| 1002 | 101 | Company B | 20 | 2013-06-20 |
| 1003 | 102 | Company C | 30 | 2016-07-25 |
The output would be:
| investment_id | fund_id | firm_id | firm_name | founded_year | location | fund_name | fund_size | fund_start_year | fund_end_year | company_name | investment_amount | investment_date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1001 | 101 | 1 | ABC Fund | 2010 | New York | ABC I | 100 | 2010 | 2015 | Company A | 10 | 2012-05-15 |
| 1002 | 101 | 1 | ABC Fund | 2010 | New York | ABC I | 100 | 2010 | 2015 | Company B | 20 | 2013-06-20 |
| 1003 | 102 | 1 | ABC Fund | 2010 | New York | ABC II | 150 | 2015 | 2020 | Company C | 30 | 2016-07-25 |
| null | 103 | 2 | XYZ Fund | 2005 | London | XYZ I | 200 | 2010 | 2018 | null | null | null |
| null | null | 3 | DEF Fund | 2015 | Paris | null | null | null | null | null | null | null |
Terminal requires a larger screen
Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.
Linux Terminal Environment
Write and execute your solution in the terminal below.
Track
| Question | Difficulty | Company | Access |
|---|
NVIDIA