Filtering and Formatting Archaeological Data
NVIDIA 👶 Easy Spark
Beginner Mode

Start your terminal to use beginner mode.

Objective

In the field of archaeology, efficient handling of data is paramount. To manage the collected artifacts, an archaeological team uses a Data Warehouse. Your task is to perform a transformation operation on this data. You will be provided with a DataFrame artifacts.

Task

Write a PySpark function that converts the Material column to upper case and filters the dataset to only include artifacts where the Quantity is strictly greater than 100.

Save your resulting DataFrame as result_df. Ensure the output exactly matches the requested Output Schema.

File Path

  • Artifacts Dataset: /home/interview/artifacts.csv
  • Starter script: /home/interview/archaeology.py

Schema

artifacts.csv

Column Name Data Type
ID String
Item String
Period String
Material String
Quantity Integer

Expected Output Schema

Column Name Data Type
ID String
Item String
Period String
Material String
Quantity Integer

Example

Given this sample input:

artifacts

ID Item Period Material Quantity
1 Pottery Prehistoric clay 150
2 Weapon Medieval metal 90
3 Jewel Roman gold 200

The expected output would be:

ID Item Period Material Quantity
1 Pottery Prehistoric CLAY 150
3 Jewel Roman GOLD 200

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Sign In

Track

Question Difficulty Company Access
Need more practice in this area? Explore more questions →