Start your terminal to use beginner mode.
Objective
One of the most common operations in Spark is aggregating data by a key and finding the top result. This involves a groupBy followed by a sort, both of which are wide transformations that trigger shuffles (Spark redistributes data across partitions so rows with the same key end up together). Understanding how to chain these operations efficiently is fundamental to writing performant Spark jobs.
Task
An orders dataset with 5,000 records is available at /home/interview/orders.csv. A starter script has been created for you at /home/interview/shuffle_metrics.py with a SparkSession and the CSV already loaded into a DataFrame called df.
Find the order status with the highest count and print the result in the exact format: most common status = <status>, count = <N>
Example
most common status = completed, count = 969
Terminal requires a larger screen
Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.
Linux Terminal Environment
Write and execute your solution in the terminal below.
Airbnb
Revolut
Accenture
Adobe
Google
LinkedIn
Samsung
Datadog
Wix
Dropbox
Meta
OpenAI
Hulu
Uber
X
DoorDash
Anthropic
Amazon
ActivisionBlizzard
Vercel
Crypto.Com
Zscaler
DeutscheBank
Apple
GoDaddy
GitLab
BMW
PayPal
Snowflake
AMD
Twilio
Atlassian
JPMorgan
NVIDIA
IBM
Databricks
Coinbase
Cisco
Robinhood
Twitter
Microsoft
Palantir
Netflix
VMware
Cloudflare
Stripe
Lyft
Salesforce
GitHub
Bloomberg
Walmart
SAP
HashiCorp
Instacart
Mastercard
Intel
Visa
Tesla