Consolidating User Interactions
Beginner Mode

Start your terminal to use beginner mode.

Objective

You are a web developer working with various teams on your company's website. You have access to three separate DataFrames, each representing different types of user interactions with your website: visits, likes, and comments.

Task

All three DataFrames represent distinct user interactions and share the user_id and page_id columns. However, their timestamp columns have different names.

Write a PySpark function that combines these three DataFrames into one unified table.

  1. The timestamp column should be standardized to interaction_time.
  2. You must add a new column called interaction_type that indicates the type of interaction ('visit', 'like', or 'comment').

Save your resulting DataFrame as result_df. Ensure the output matches the exact schema order requested. Sort the final DataFrame chronologically by interaction_time (ascending). If multiple interactions happen at the exact same time, break ties by sorting interaction_type (ascending) and then user_id (ascending).

File Path

  • Visits Dataset: /home/interview/page_visits.csv
  • Likes Dataset: /home/interview/page_likes.csv
  • Comments Dataset: /home/interview/page_comments.csv
  • Starter script: /home/interview/user_interactions.py

Schema

page_visits.csv

Column Name Data Type
user_id string
page_id string
visit_time timestamp

page_likes.csv

Column Name Data Type
user_id string
page_id string
like_time timestamp

page_comments.csv

Column Name Data Type
user_id string
page_id string
comment_time timestamp

Expected Output Schema

Column Name Data Type
user_id string
page_id string
interaction_time timestamp
interaction_type string

Example

Given this sample input:

page_visits

user_id page_id visit_time
U1 P1 2023-01-01 12:00:00
U2 P3 2023-01-02 15:30:00
U3 P2 2023-01-03 10:45:00

page_likes

user_id page_id like_time
U1 P2 2023-01-02 14:20:00
U2 P1 2023-01-03 16:40:00
U3 P3 2023-01-04 18:55:00

page_comments

user_id page_id comment_time
U1 P3 2023-01-03 13:00:00
U2 P2 2023-01-04 17:10:00
U3 P1 2023-01-05 19:25:00

The expected output would be:

user_id page_id interaction_time interaction_type
U1 P1 2023-01-01 12:00:00 visit
U1 P2 2023-01-02 14:20:00 like
U2 P3 2023-01-02 15:30:00 visit
U3 P2 2023-01-03 10:45:00 visit
U1 P3 2023-01-03 13:00:00 comment
U2 P1 2023-01-03 16:40:00 like
U2 P2 2023-01-04 17:10:00 comment
U3 P3 2023-01-04 18:55:00 like
U3 P1 2023-01-05 19:25:00 comment

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Sign In

Track

Question Difficulty Company Access
Need more practice in this area? Explore more questions →