Scenario
Raw e-commerce order data contains inconsistencies, null values, and missing derived fields that need to be processed before analysis.
Task
Write a Python script at /home/interview/etl_pipeline.py that builds a complete ETL pipeline to:
- Read raw data from
/home/interview/orders.json
- Clean null values: fill missing
discount values with 0, drop rows where customer_email or product_name is null
- Standardize formats:
- Convert
customer_email to lowercase
- Remove all non-digit characters from
customer_phone
- Parse
order_date and ship_date from mixed date formats to datetime
- Convert
product_category to title case
- Calculate derived fields:
final_price = unit_price × quantity - discount (rounded to 2 decimals)
order_month = extracted from order_date (format: YYYY-MM)
customer_segment = Budget (<50), Standard (50-200), Premium (>200) based on final_price
days_to_ship = difference in days between ship_date and order_date
- Export the cleaned data to
/home/interview/cleaned_orders.csv
Use method chaining and the .pipe() function to create a single transformation pipeline.
Note: pandas is already installed.
Step 1: Examine the raw data
python3 -c "import pandas as pd; df = pd.read_json('/home/interview/orders.json'); print(df.head()); print(df.info())"
Review the data quality issues and column types.
Step 2: Create the ETL pipeline script
nano /home/interview/etl_pipeline.py
Write a complete pipeline using method chaining:
import pandas as pd
from datetime import datetime
def clean_nulls(df):
"""Fill discount nulls with 0, drop rows with null email or product"""
df['discount'] = df['discount'].fillna(0)
df = df.dropna(subset=['customer_email', 'product_name'])
return df
def standardize_formats(df):
"""Standardize email, phone, dates, and categories"""
# Email to lowercase
df['customer_email'] = df['customer_email'].str.lower()
# Phone: remove all non-digits
df['customer_phone'] = df['customer_phone'].str.replace(r'[^\d]', '', regex=True)
# Product category to title case
df['product_category'] = df['product_category'].str.title()
# Parse dates from multiple formats
df['order_date'] = pd.to_datetime(df['order_date'], format='mixed', errors='coerce')
df['ship_date'] = pd.to_datetime(df['ship_date'], format='mixed', errors='coerce')
return df
def segment(price):
if price < 50:
return 'Budget'
elif price <= 200:
return 'Standard'
else:
return 'Premium'
def calculate_derived_fields(df):
"""Calculate final_price, order_month, customer_segment, days_to_ship"""
# Final price (rounded to 2 decimals)
df['final_price'] = (df['unit_price'] * df['quantity'] - df['discount']).round(2)
# Order month
df['order_month'] = df['order_date'].dt.strftime('%Y-%m')
# Customer segment: Budget (<50), Standard (50-200), Premium (>200)
df['customer_segment'] = df['final_price'].apply(segment)
# Days to ship
df['days_to_ship'] = (df['ship_date'] - df['order_date']).dt.days
return df
# Execute ETL pipeline with chaining
df = (
pd.read_json('/home/interview/orders.json')
.pipe(clean_nulls)
.pipe(standardize_formats)
.pipe(calculate_derived_fields)
)
# Export to CSV
df.to_csv('/home/interview/cleaned_orders.csv', index=False)
print(f"ETL pipeline completed: {len(df)} orders processed")
Step 3: Run the pipeline
python3 /home/interview/etl_pipeline.py
Step 4: Verify the output
head /home/interview/cleaned_orders.csv
Should show cleaned data with standardized formats and all derived fields.