Datadog: Build Complete ETL Pipeline with Data Cleaning and Transformations — Data Engineering Interview Q&A (2026)

Build Complete ETL Pipeline with Data Cleaning and Transformations

Datadog 🧠 Hard Programming Python Pandas

Beginner Mode

Start your terminal to use beginner mode.

Scenario

Raw e-commerce order data contains inconsistencies, null values, and missing derived fields that need to be processed before analysis.

Task

Write a Python script at /home/interview/etl_pipeline.py that builds a complete ETL pipeline to:

Read raw data from /home/interview/orders.json
Clean null values: fill missing discount values with 0, drop rows where customer_email or product_name is null
Standardize formats:
- Convert customer_email to lowercase
- Remove all non-digit characters from customer_phone
- Parse order_date and ship_date from mixed date formats to datetime
- Convert product_category to title case
Calculate derived fields:
- final_price = unit_price × quantity - discount (rounded to 2 decimals)
- order_month = extracted from order_date (format: YYYY-MM)
- customer_segment = Budget (<50), Standard (50-200), Premium (>200) based on final_price
- days_to_ship = difference in days between ship_date and order_date
Export the cleaned data to /home/interview/cleaned_orders.csv

Use method chaining and the .pipe() function to create a single transformation pipeline.

Note: pandas is already installed.

Step 1: Examine the raw data

python3 -c "import pandas as pd; df = pd.read_json('/home/interview/orders.json'); print(df.head()); print(df.info())"

Review the data quality issues and column types.

Step 2: Create the ETL pipeline script

nano /home/interview/etl_pipeline.py

Write a complete pipeline using method chaining:

import pandas as pd
from datetime import datetime

def clean_nulls(df):
    """Fill discount nulls with 0, drop rows with null email or product"""
    df['discount'] = df['discount'].fillna(0)
    df = df.dropna(subset=['customer_email', 'product_name'])
    return df

def standardize_formats(df):
    """Standardize email, phone, dates, and categories"""
    # Email to lowercase
    df['customer_email'] = df['customer_email'].str.lower()
    
    # Phone: remove all non-digits
    df['customer_phone'] = df['customer_phone'].str.replace(r'[^\d]', '', regex=True)
    
    # Product category to title case
    df['product_category'] = df['product_category'].str.title()
    
    # Parse dates from multiple formats
    df['order_date'] = pd.to_datetime(df['order_date'], format='mixed', errors='coerce')
    df['ship_date'] = pd.to_datetime(df['ship_date'], format='mixed', errors='coerce')
    
    return df

def segment(price):
    if price < 50:
        return 'Budget'
    elif price <= 200:
        return 'Standard'
    else:
        return 'Premium'

def calculate_derived_fields(df):
    """Calculate final_price, order_month, customer_segment, days_to_ship"""
    # Final price (rounded to 2 decimals)
    df['final_price'] = (df['unit_price'] * df['quantity'] - df['discount']).round(2)
    
    # Order month
    df['order_month'] = df['order_date'].dt.strftime('%Y-%m')
    
    # Customer segment: Budget (<50), Standard (50-200), Premium (>200)
    df['customer_segment'] = df['final_price'].apply(segment)
    
    # Days to ship
    df['days_to_ship'] = (df['ship_date'] - df['order_date']).dt.days
    
    return df

# Execute ETL pipeline with chaining
df = (
    pd.read_json('/home/interview/orders.json')
    .pipe(clean_nulls)
    .pipe(standardize_formats)
    .pipe(calculate_derived_fields)
)

# Export to CSV
df.to_csv('/home/interview/cleaned_orders.csv', index=False)

print(f"ETL pipeline completed: {len(df)} orders processed")

Step 3: Run the pipeline

python3 /home/interview/etl_pipeline.py

Step 4: Verify the output

head /home/interview/cleaned_orders.csv

Should show cleaned data with standardized formats and all derived fields.

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Track

	Question	Difficulty	Company	Access

Need more practice in this area? Explore more questions →