Start your terminal to use beginner mode.
Objective
A geologist is working with a dataset containing information about different rock samples. The dataset contains a description field with a mixture of letters and numbers representing the rock type and its approximate age.
Task
Extract the numeric parts from the description column to create a new column called age.
In the resulting DataFrame, the age column should contain only the numeric part extracted using a regular expression. If there is no numeric part in the description, the age column should contain an empty string ("").
Save your result as result_df, ensuring the final columns are ordered exactly as sample_id, description, and age.
File Path
- Dataset:
/home/interview/samples.csv - Starter script:
/home/interview/extract_age.py
Schema
samples.csv
| Column Name | Data Type |
|---|---|
| sample_id | string |
| description | string |
Expected Output Schema
| Column Name | Data Type |
|---|---|
| sample_id | string |
| description | string |
| age | string |
Constraints:
- The input DataFrame will have at least 1 row and at most $10^4$ rows.
- The
sample_idcolumn will only contain unique alphanumeric strings with 1 to 50 characters. - The
descriptioncolumn will contain alphanumeric strings with 1 to 100 characters. - The numeric part, if present, will be a positive integer.
Example
Given this sample input:
input_df
| sample_id | description |
|---|---|
| S1 | Basalt_450Ma |
| S2 | Sandstone_300Ma |
| S3 | Limestone |
| S4 | Granite_200Ma |
| S5 | Marble_1800Ma |
The output would be:
| sample_id | description | age |
|---|---|---|
| S1 | Basalt_450Ma | 450 |
| S2 | Sandstone_300Ma | 300 |
| S3 | Limestone | |
| S4 | Granite_200Ma | 200 |
| S5 | Marble_1800Ma | 1800 |
Terminal requires a larger screen
Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.
Linux Terminal Environment
Write and execute your solution in the terminal below.
Track
| Question | Difficulty | Company | Access |
|---|
DoorDash