Mining Operations Aggregation
Beginner Mode

Start your terminal to use beginner mode.

Objective

A mining company extracts rare minerals from various locations worldwide. It maintains two DataFrames to track its operations: mines, which holds metadata about each location, and extraction, which logs the daily mineral output.

Task

Write a PySpark function that calculates the total quantity of each mineral extracted per location.

The total_quantity column should contain the sum of all quantities of a particular mineral extracted at a specific location, and it must be cast to a Double type. The resulting rows should be sorted first by location (in ascending order) and then by mineral (in ascending order). Save your result as result_df.

File Path

  • Mines Dataset: /home/interview/mines.csv
  • Extraction Dataset: /home/interview/extraction.csv
  • Starter script: /home/interview/mining_aggregation.py

Schema

mines.csv

Column Name Data Type
id Integer
name String
location String

extraction.csv

Column Name Data Type
mine_id Integer
date Date
mineral String
quantity Double

Expected Output Schema

Column Name Data Type
location String
mineral String
total_quantity Double

Example

Given this sample input:

mines

id name location
1 Mine Alpha Australia
2 Mine Beta Canada
3 Mine Gamma South Africa

extraction

mine_id date mineral quantity
1 2023-06-30 Gold 1000.0
2 2023-06-30 Silver 1200.0
3 2023-06-30 Diamond 800.0
1 2023-06-29 Gold 900.0
2 2023-06-29 Silver 1300.0
3 2023-06-29 Diamond 750.0

The expected output would be:

location mineral total_quantity
Australia Gold 1900.0
Canada Silver 2500.0
South Africa Diamond 1550.0

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Sign In

Track

Question Difficulty Company Access
Need more practice in this area? Explore more questions →