NVIDIA: Filtering and Formatting Archaeological Data — Data Analysis Interview Q&A (2026)

Filtering and Formatting Archaeological Data

Beginner Mode

Start your terminal to use beginner mode.

Objective

In the field of archaeology, efficient handling of data is paramount. To manage the collected artifacts, an archaeological team uses a Data Warehouse. Your task is to perform a transformation operation on this data. You will be provided with a DataFrame artifacts.

Task

Write a PySpark function that converts the Material column to upper case and filters the dataset to only include artifacts where the Quantity is strictly greater than 100.

Save your resulting DataFrame as result_df. Ensure the output exactly matches the requested Output Schema.

File Path

Artifacts Dataset: /home/interview/artifacts.csv
Starter script: /home/interview/archaeology.py

Schema

artifacts.csv

Column Name	Data Type
ID	String
Item	String
Period	String
Material	String
Quantity	Integer

Expected Output Schema

Column Name	Data Type
ID	String
Item	String
Period	String
Material	String
Quantity	Integer

Example

Given this sample input:

artifacts

ID	Item	Period	Material	Quantity
1	Pottery	Prehistoric	clay	150
2	Weapon	Medieval	metal	90
3	Jewel	Roman	gold	200

The expected output would be:

ID	Item	Period	Material	Quantity
1	Pottery	Prehistoric	CLAY	150
3	Jewel	Roman	GOLD	200

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("PrepareshSpark").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

artifacts = spark.read.csv("/home/interview/artifacts.csv", header=True, inferSchema=True)

# Step 1: Filter for records where Quantity is strictly greater than 100
filtered_df = artifacts.filter(F.col("Quantity") > 100)

# Step 2: Convert the 'Material' column strings to uppercase
result_df = filtered_df.withColumn("Material", F.upper(F.col("Material")))

# Step 3: Select columns to ensure schema order
result_df = result_df.select("ID", "Item", "Period", "Material", "Quantity")

# --- Do not edit below this line ---
result_df.coalesce(1).write.csv("/home/interview/output", header=True, mode="overwrite")
spark.stop()

Explanation

Step 1: Filtering the Data

filtered_df = artifacts.filter(F.col("Quantity") > 100)

The first objective is to discard any rows where the quantity is 100 or less. PySpark's .filter() method evaluates a condition for every row in the DataFrame. By passing F.col("Quantity") > 100, we instruct PySpark to only keep the rows that return True for that specific condition.

Step 2: String Transformation

result_df = filtered_df.withColumn("Material", F.upper(F.col("Material")))

Next, we need to standardize the text formatting in the Material column. We use .withColumn() to overwrite the existing Material column. Inside it, we apply the F.upper() function to the column, which converts all lowercase characters into uppercase letters.

Step 3: Selecting the Final Columns

result_df = result_df.select("ID", "Item", "Period", "Material", "Quantity")

Finally, we chain a .select() statement to ensure the columns are arranged in the exact order requested by the prompt's Output Schema.

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Track

	Question	Difficulty	Company	Access

Need more practice in this area? Explore more questions →