Objective
In the field of archaeology, efficient handling of data is paramount. To manage the collected artifacts, an archaeological team uses a Data Warehouse. Your task is to perform a transformation operation on this data. You will be provided with a DataFrame artifacts.
Task
Write a PySpark function that converts the Material column to upper case and filters the dataset to only include artifacts where the Quantity is strictly greater than 100.
Save your resulting DataFrame as result_df. Ensure the output exactly matches the requested Output Schema.
File Path
- Artifacts Dataset:
/home/interview/artifacts.csv
- Starter script:
/home/interview/archaeology.py
Schema
artifacts.csv
| Column Name |
Data Type |
| ID |
String |
| Item |
String |
| Period |
String |
| Material |
String |
| Quantity |
Integer |
Expected Output Schema
| Column Name |
Data Type |
| ID |
String |
| Item |
String |
| Period |
String |
| Material |
String |
| Quantity |
Integer |
Example
Given this sample input:
artifacts
| ID |
Item |
Period |
Material |
Quantity |
| 1 |
Pottery |
Prehistoric |
clay |
150 |
| 2 |
Weapon |
Medieval |
metal |
90 |
| 3 |
Jewel |
Roman |
gold |
200 |
The expected output would be:
| ID |
Item |
Period |
Material |
Quantity |
| 1 |
Pottery |
Prehistoric |
CLAY |
150 |
| 3 |
Jewel |
Roman |
GOLD |
200 |
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("PrepareshSpark").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
artifacts = spark.read.csv("/home/interview/artifacts.csv", header=True, inferSchema=True)
# Step 1: Filter for records where Quantity is strictly greater than 100
filtered_df = artifacts.filter(F.col("Quantity") > 100)
# Step 2: Convert the 'Material' column strings to uppercase
result_df = filtered_df.withColumn("Material", F.upper(F.col("Material")))
# Step 3: Select columns to ensure schema order
result_df = result_df.select("ID", "Item", "Period", "Material", "Quantity")
# --- Do not edit below this line ---
result_df.coalesce(1).write.csv("/home/interview/output", header=True, mode="overwrite")
spark.stop()
Explanation
Step 1: Filtering the Data
filtered_df = artifacts.filter(F.col("Quantity") > 100)
The first objective is to discard any rows where the quantity is 100 or less. PySpark's .filter() method evaluates a condition for every row in the DataFrame. By passing F.col("Quantity") > 100, we instruct PySpark to only keep the rows that return True for that specific condition.
Step 2: String Transformation
result_df = filtered_df.withColumn("Material", F.upper(F.col("Material")))
Next, we need to standardize the text formatting in the Material column. We use .withColumn() to overwrite the existing Material column. Inside it, we apply the F.upper() function to the column, which converts all lowercase characters into uppercase letters.
Step 3: Selecting the Final Columns
result_df = result_df.select("ID", "Item", "Period", "Material", "Quantity")
Finally, we chain a .select() statement to ensure the columns are arranged in the exact order requested by the prompt's Output Schema.