Build Complete ETL Pipeline with Data Cleaning and Transformations
Beginner Mode

Start your terminal to use beginner mode.

Scenario

Raw e-commerce order data contains inconsistencies, null values, and missing derived fields that need to be processed before analysis.

Task

Write a Python script at /home/interview/etl_pipeline.py that builds a complete ETL pipeline to:

  1. Read raw data from /home/interview/orders.json
  2. Clean null values: fill missing discount values with 0, drop rows where customer_email or product_name is null
  3. Standardize formats:
    • Convert customer_email to lowercase
    • Remove all non-digit characters from customer_phone
    • Parse order_date and ship_date from mixed date formats to datetime
    • Convert product_category to title case
  4. Calculate derived fields:
    • final_price = unit_price × quantity - discount (rounded to 2 decimals)
    • order_month = extracted from order_date (format: YYYY-MM)
    • customer_segment = Budget (<50), Standard (50-200), Premium (>200) based on final_price
    • days_to_ship = difference in days between ship_date and order_date
  5. Export the cleaned data to /home/interview/cleaned_orders.csv

Use method chaining and the .pipe() function to create a single transformation pipeline.

Note: pandas is already installed.

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Sign In

Track

Question Difficulty Company Access
Need more practice in this area? Explore more questions →