Repartition
Beginner Mode

Start your terminal to use beginner mode.

Objective

When Spark reads a small file, it often creates just 1 partition, which means only 1 task processes the entire dataset. In production, you'll frequently want more partitions so Spark can process data in parallel across multiple cores. The repartition() method lets you control this by redistributing data across a specified number of partitions, where each partition maps to exactly one parallel task.

Task

An orders dataset with 5,000 records is available at /home/interview/orders.csv. A starter script has been created for you at /home/interview/repartition_tasks.py with a SparkSession and the CSV already loaded into a DataFrame called df. Repartition the DataFrame to 8 partitions and print the resulting task count in the exact format: task count = 8

Example

task count = 8

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Sign In

Track

Question Difficulty Company Access
Need more practice in this area? Explore more questions →