Analyzing Self-Interactions on Social Media
Beginner Mode

Start your terminal to use beginner mode.

Objective

You are given a DataFrame that represents user interactions on a popular social media platform. Each row represents a single interaction between two users.

Task

Write a PySpark function that finds users who have interacted with themselves. This is possible when a user makes a post and then likes, comments, or shares it themselves.

Filter the DataFrame to find rows where user1_id matches user2_id. Calculate the total number of self-interactions for each of these users.

Save your resulting DataFrame as result_df. Ensure the output strictly matches the requested Output Schema (rename user1_id to user_id). Sort the final output by user_id in ascending order. Explicitly cast self_interaction_count to integer type using .cast("int").

File Path

  • Interactions Dataset: /home/interview/interactions.csv
  • Starter script: /home/interview/self_interactions.py

Schema

interactions.csv

Column Name Data Type
interaction_id integer
user1_id integer
user2_id integer
interaction_type string
timestamp timestamp

Expected Output Schema

Column Name Data Type
user_id integer
self_interaction_count integer

Example

Given this sample input:

input_df

interaction_id user1_id user2_id interaction_type timestamp
1 1001 2002 like 2023-01-01 10:00:00
2 1002 1002 comment 2023-01-01 11:00:00
3 1003 2003 share 2023-01-02 10:00:00
4 1004 1004 like 2023-01-02 11:00:00
5 1005 2005 comment 2023-01-03 10:00:00

The expected output would be:

user_id self_interaction_count
1002 1
1004 1

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Sign In

Track

Question Difficulty Company Access
Need more practice in this area? Explore more questions →