Deduplicating Activity Logs
Beginner Mode

Start your terminal to use beginner mode.

Objective

A top tech company is facing a challenge with user churn. They maintain three different sources of data: user accounts, user activities, and user exit surveys. All these data sources are represented as three separate DataFrames.

Task

Unfortunately, due to some system glitches, there are duplicates within the df_activities DataFrame. Duplicate rows are defined as rows that have the exact same user_id, activity_date, and activity_type.

Write a PySpark function that removes the duplicates from df_activities and combines all three DataFrames. Ensure that if there is no corresponding row for a user in any of the input DataFrames, the respective columns contain null values in the output DataFrame.

Save your result as result_df. The final output must be sorted by user_id in ascending order, then by activity_date in descending order, and finally by activity_type in ascending order (to break ties deterministically).

File Path

  • Accounts Dataset: /home/interview/accounts.csv
  • Activities Dataset: /home/interview/activities.csv
  • Exit Surveys Dataset: /home/interview/exit_surveys.csv
  • Starter script: /home/interview/churn_analysis.py

Schema

accounts.csv

Column Name Type
user_id String
account_created_date Date
location String

activities.csv

Column Name Type
user_id String
activity_date Date
activity_type String

exit_surveys.csv

Column Name Type
user_id String
exit_date Date
exit_reason String

Expected Output Schema

Column Name Type
user_id String
account_created_date Date
location String
activity_date Date
activity_type String
exit_date Date
exit_reason String

Example

Given this sample input:

df_accounts

user_id account_created_date location
U001 2023-01-01 New York
U002 2023-01-05 Chicago
U003 2023-01-10 San Francisco

df_activities

user_id activity_date activity_type
U001 2023-02-01 Login
U001 2023-02-01 Login
U002 2023-02-05 File Upload
U002 2023-02-05 File Upload
U003 2023-02-10 Logout

df_exit_surveys

user_id exit_date exit_reason
U001 2023-03-01 Moved to a competitor
U002 2023-03-05 Not user-friendly
U003 2023-03-10 High pricing

The expected output would be:

user_id account_created_date location activity_date activity_type exit_date exit_reason
U001 2023-01-01 New York 2023-02-01 Login 2023-03-01 Moved to a competitor
U002 2023-01-05 Chicago 2023-02-05 File Upload 2023-03-05 Not user-friendly
U003 2023-01-10 San Francisco 2023-02-10 Logout 2023-03-10 High pricing

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Sign In

Track

Question Difficulty Company Access
Need more practice in this area? Explore more questions →