Airline Multi-Table Joins and String Lengths
SAP ☯️ Medium SparkComplex Joins
Beginner Mode

Start your terminal to use beginner mode.

Objective

As a Data Analyst at a major airline, you are presented with three DataFrames. The first, flights, contains the flight data including the flight_id, origin_airport, and destination_airport. The second, airports, consists of details of each airport including airport_id and airport_name. The third, planes, contains details about planes, including plane_id and plane_model.

Task

Write a PySpark function that combines these DataFrames to calculate the length of the string names for the origin airport, destination airport, and plane model.

In the output:

  • The flight_id column corresponds to the id of each flight in the flights DataFrame.
  • origin_airport_name_length and destination_airport_name_length columns represent the lengths of the names of the origin and destination airports respectively.
  • plane_model_length is the length of the plane model's name.

Please note that the origin_airport and destination_airport columns in the flights DataFrame correspond to the airport_id in the airports DataFrame. The flight_id in the flights DataFrame corresponds to the plane_id in the planes DataFrame.

Remember that the lengths of the strings should be calculated after removing leading and trailing whitespaces. Save your resulting DataFrame as result_df and order the output by flight_id in ascending order.

File Path

  • Flights Dataset: /home/interview/flights.csv
  • Airports Dataset: /home/interview/airports.csv
  • Planes Dataset: /home/interview/planes.csv
  • Starter script: /home/interview/busy_airline.py

Schema

flights.csv

Column Type
flight_id int
origin_airport string
destination_airport string

airports.csv

Column Type
airport_id string
airport_name string

planes.csv

Column Type
plane_id int
plane_model string

Expected Output Schema

Column Type
flight_id int
origin_airport_name_length int
destination_airport_name_length int
plane_model_length int

Example

Given this sample input:

flights

flight_id origin_airport destination_airport
1 A1 B1
2 A2 B2
3 A3 B3

airports

airport_id airport_name
A1 San Francisco
B1 Los Angeles
A2 New York
B2 Boston
A3 Miami
B3 Orlando

planes

plane_id plane_model
1 Airbus A320
2 Boeing 737
3 Airbus A380

The expected output would be:

flight_id origin_airport_name_length destination_airport_name_length plane_model_length
1 13 11 11
2 8 6 10
3 5 7 11

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Sign In

Track

Question Difficulty Company Access
Need more practice in this area? Explore more questions →