Objective
You are given two DataFrames, df_temperature and df_pressure, related to some thermodynamics experiments.
Task
Write a PySpark function that combines these DataFrames and creates the Result column using the formula for the Ideal Gas Law: Pressure * Temperature = Result. For each ExperimentID, multiply the Pressure and Temperature to calculate Result.
If there's an ExperimentID present in one DataFrame but not in the other, ignore that ExperimentID. Your solution should only contain ExperimentIDs that exist in both DataFrames.
Save your resulting DataFrame as result_df. Order the final output in ascending order by ExperimentID.
File Path
- Temperature Dataset:
/home/interview/temperature.csv
- Pressure Dataset:
/home/interview/pressure.csv
- Starter script:
/home/interview/thermodynamics.py
Schema
temperature.csv
| Column Name |
Data Type |
| ExperimentID |
integer |
| Temperature |
double |
pressure.csv
| Column Name |
Data Type |
| ExperimentID |
integer |
| Pressure |
double |
Expected Output Schema
| Column Name |
Data Type |
| ExperimentID |
integer |
| Result |
double |
Example
Given this sample input:
df_temperature
| ExperimentID |
Temperature |
| 1 |
273.15 |
| 2 |
293.15 |
| 3 |
313.15 |
df_pressure
| ExperimentID |
Pressure |
| 1 |
1.0 |
| 3 |
2.0 |
| 4 |
1.5 |
The expected output would be:
| ExperimentID |
Result |
| 1 |
273.15 |
| 3 |
626.3 |
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("PrepareshSpark").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
df_temperature = spark.read.csv("/home/interview/temperature.csv", header=True, inferSchema=True)
df_pressure = spark.read.csv("/home/interview/pressure.csv", header=True, inferSchema=True)
# Step 1: Perform an inner join to keep only matching ExperimentIDs
joined_df = df_temperature.join(df_pressure, on="ExperimentID", how="inner")
# Step 2: Calculate the Result column by multiplying Pressure and Temperature
result_df = joined_df.withColumn("Result", F.col("Pressure") * F.col("Temperature"))
# Step 3: Select the required columns and order by ExperimentID ascending
result_df = result_df.select("ExperimentID", "Result").orderBy(F.col("ExperimentID").asc())
# --- Do not edit below this line ---
result_df.coalesce(1).write.csv("/home/interview/output", header=True, mode="overwrite")
spark.stop()
Explanation
Step 1: Merging with an Inner Join
joined_df = df_temperature.join(df_pressure, on="ExperimentID", how="inner")
The prompt strictly requires that we only calculate the result for experiments that exist in both datasets, discarding the rest. An inner join on the ExperimentID column perfectly handles this. Any ExperimentID missing from either the temperature or pressure table will naturally be dropped during the join.
Step 2: Column Multiplication
result_df = joined_df.withColumn("Result", F.col("Pressure") * F.col("Temperature"))
To compute the final thermodynamics value, we can perform basic arithmetic directly between PySpark columns. By multiplying F.col("Pressure") by F.col("Temperature"), we generate the required output and save it to a new column named "Result".
Step 3: Selecting and Sorting
result_df = result_df.select("ExperimentID", "Result").orderBy(F.col("ExperimentID").asc())
Finally, we drop the intermediate Pressure and Temperature columns by chaining a .select() statement that strictly pulls the two columns required by the Output Schema. We then append .orderBy() to satisfy the ascending sort requirement.