DoorDash: Regex Extract — Data Engineering Interview Q&A (2026)

Regex Extract

DoorDash 👶 Easy Spark Regex String Operations

Beginner Mode

Start your terminal to use beginner mode.

Objective

A geologist is working with a dataset containing information about different rock samples. The dataset contains a description field with a mixture of letters and numbers representing the rock type and its approximate age.

Task

Extract the numeric parts from the description column to create a new column called age.

In the resulting DataFrame, the age column should contain only the numeric part extracted using a regular expression. If there is no numeric part in the description, the age column should contain an empty string ("").

Save your result as result_df, ensuring the final columns are ordered exactly as sample_id, description, and age.

File Path

Dataset: /home/interview/samples.csv
Starter script: /home/interview/extract_age.py

Schema

samples.csv

Column Name	Data Type
sample_id	string
description	string

Expected Output Schema

Column Name	Data Type
sample_id	string
description	string
age	string

Constraints:

The input DataFrame will have at least 1 row and at most $10^4$ rows.
The sample_id column will only contain unique alphanumeric strings with 1 to 50 characters.
The description column will contain alphanumeric strings with 1 to 100 characters.
The numeric part, if present, will be a positive integer.

Example

Given this sample input:

input_df

sample_id	description
S1	Basalt_450Ma
S2	Sandstone_300Ma
S3	Limestone
S4	Granite_200Ma
S5	Marble_1800Ma

The output would be:

sample_id	description	age
S1	Basalt_450Ma	450
S2	Sandstone_300Ma	300
S3	Limestone
S4	Granite_200Ma	200
S5	Marble_1800Ma	1800

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("PrepareshSpark").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

df = spark.read.csv("/home/interview/samples.csv", header=True, inferSchema=True)

result_df = df.withColumn(
    "age", 
    F.regexp_extract(F.col("description"), r"(\d+)", 1)
)

result_df = result_df.select("sample_id", "description", "age")

# --- Do not edit below this line ---
result_df.coalesce(1).write.csv("/home/interview/output", header=True, mode="overwrite")
spark.stop()

Explanation

Step 1: Using regexp_extract

result_df = df.withColumn(
    "age", 
    F.regexp_extract(F.col("description"), r"(\d+)", 1)
)

When you need to pull specific patterns out of a messy string, PySpark's regexp_extract function is the best tool for the job. It takes three arguments:

The column you want to search (F.col("description")).
The regular expression pattern to look for.
The group index to extract.

Step 2: Defining the Regex Pattern

The pattern r"(\d+)" is used to find the numbers:

\d means "any digit from 0 to 9".
+ means "one or more of the preceding character" (so it captures "450" instead of just "4").
() creates a "capture group". This is what PySpark will actually pull out.

By passing 1 as the third argument to regexp_extract, we tell PySpark to return the contents of the first capture group. Because regexp_extract naturally returns an empty string if it fails to find a match, we don't need any extra F.when() logic to handle cases like "Limestone".

Step 3: Selecting Final Schema Order

result_df = result_df.select("sample_id", "description", "age")

Finally, we use .select() to ensure the output columns match the exact order requested in the expected output schema.

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Track

	Question	Difficulty	Company	Access

Need more practice in this area? Explore more questions →