Private Equity Firm Mergers
Beginner Mode

Start your terminal to use beginner mode.

Objective

You are analyzing the Private Equity (PE) landscape. You have three datasets: pe_firms (the parent companies), pe_funds (the financial vehicles managed by the firms), and pe_investments (the individual companies those funds bought into).

Task

Write a PySpark script that combines all three DataFrames into a single comprehensive view.

Because the data is messy, some investments may not map to a known fund, and some funds may not map to a known firm. You must use a join strategy that ensures no records are lost from any of the three tables during the merge. After merging, drop any rows where absolutely every column is null.

Save the resulting DataFrame as result_df. Select the columns in the exact order specified in the schema below.

File Path

  • Firms: /home/interview/pe_firms.csv
  • Funds: /home/interview/pe_funds.csv
  • Investments: /home/interview/pe_investments.csv
  • Starter script: /home/interview/pe_merger.py

Schema

pe_firms.csv

Column Name Data Type Description
firm_id integer the unique identifier of the PE firm
firm_name string the name of the PE firm
founded_year integer the year the PE firm was founded
location string the location of the PE firm

pe_funds.csv

Column Name Data Type Description
fund_id integer the unique identifier of the PE fund
firm_id integer the unique identifier of the PE firm
fund_name string the name of the PE fund
fund_size integer the size of the PE fund in millions of dollars
fund_start_year integer the year the PE fund was started
fund_end_year integer the year the PE fund ended or is expected to end

pe_investments.csv

Column Name Data Type Description
investment_id integer the unique identifier of the PE investment
fund_id integer the unique identifier of the PE fund
company_name string the name of the company receiving the investment
investment_amount integer the amount of the investment in millions of dollars
investment_date string the date of the investment

Expected Output Schema

Column Name Data Type
investment_id integer
fund_id integer
firm_id integer
firm_name string
founded_year integer
location string
fund_name string
fund_size integer
fund_start_year integer
fund_end_year integer
company_name string
investment_amount integer
investment_date string

Example

Given this sample input:

pe_firms

firm_id firm_name founded_year location
1 ABC Fund 2010 New York
2 XYZ Fund 2005 London
3 DEF Fund 2015 Paris

pe_funds

fund_id firm_id fund_name fund_size fund_start_year fund_end_year
101 1 ABC I 100 2010 2015
102 1 ABC II 150 2015 2020
103 2 XYZ I 200 2010 2018

pe_investments

investment_id fund_id company_name investment_amount investment_date
1001 101 Company A 10 2012-05-15
1002 101 Company B 20 2013-06-20
1003 102 Company C 30 2016-07-25

The output would be:

investment_id fund_id firm_id firm_name founded_year location fund_name fund_size fund_start_year fund_end_year company_name investment_amount investment_date
1001 101 1 ABC Fund 2010 New York ABC I 100 2010 2015 Company A 10 2012-05-15
1002 101 1 ABC Fund 2010 New York ABC I 100 2010 2015 Company B 20 2013-06-20
1003 102 1 ABC Fund 2010 New York ABC II 150 2015 2020 Company C 30 2016-07-25
null 103 2 XYZ Fund 2005 London XYZ I 200 2010 2018 null null null
null null 3 DEF Fund 2015 Paris null null null null null null null

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Sign In

Track

Question Difficulty Company Access
Need more practice in this area? Explore more questions →