Palantir: Extract Schema Information from Parquet File Using PyArrow — Data Engineering Interview Q&A (2026)

86. Extract Schema Information from Parquet File Using PyArrow

Beginner Mode

Start your terminal to use beginner mode.

Scenario

A Parquet file contains structured data and you need to extract and document its schema information for analysis purposes.

Task

Write a Python script at /home/interview/extract_schema.py using pyarrow that reads /home/interview/data.parquet, extracts the schema information (column names, data types, compression codec, row count, and file size), and saves the output as JSON to /home/interview/schema_info.json.

Note: The pyarrow module is already installed.

Example

Expected output format in /home/interview/schema_info.json:

{
  "file": "/home/interview/data.parquet",
  "row_count": 390,
  "file_size_bytes": 45632,
  "file_size_kb": 44.56,
  "compression_codec": "SNAPPY",
  "columns": [
    {
      "name": "id",
      "type": "int64"
    },
    {
      "name": "name",
      "type": "string"
    },
    ...
  ]
}

Step 1: Create the Python script

nano /home/interview/extract_schema.py

Write a script using pyarrow to extract schema information:

import pyarrow.parquet as pq
import os
import json

# Read Parquet file metadata
parquet_file = pq.ParquetFile('/home/interview/data.parquet')

# Get schema
schema = parquet_file.schema_arrow

# Get metadata
metadata = parquet_file.metadata
row_count = metadata.num_rows
file_size = os.path.getsize('/home/interview/data.parquet')

# Get compression codec
compression = metadata.row_group(0).column(0).compression

# Build schema info dictionary
schema_info = {
    "file": "/home/interview/data.parquet",
    "row_count": row_count,
    "file_size_bytes": file_size,
    "file_size_kb": round(file_size / 1024, 2),
    "compression_codec": str(compression),
    "columns": []
}

for field in schema:
    schema_info["columns"].append({
        "name": field.name,
        "type": str(field.type)
    })

# Write to JSON file
with open('/home/interview/schema_info.json', 'w') as f:
    json.dump(schema_info, f, indent=2)

print("Schema information extracted successfully")

Step 2: Run the script

python3 /home/interview/extract_schema.py

Step 3: Verify the output

cat /home/interview/schema_info.json

Should display schema information in JSON format with column names, data types, compression codec, row count, and file size.

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Question	Difficulty	Company	Access
Managing High I/O Processes	Easy	Revolut	Free
Docker Multi-Architecture Image	Easy	Accenture	Free
Average Order Value	Easy	Accenture	Free
Join Employees and Departments	Easy	Adobe	Free
Filter Orders by Date Range	Easy	Google	Free
Find Customers Without Orders	Easy	LinkedIn	Free
Use COALESCE for Null Handling	Easy	Samsung	Free
Merge Multiple Address Fields	Easy	Datadog	Free
String Concatenation in SELECT	Easy	Wix	Free
Find Nth Highest Revenue	Easy	Dropbox	Free
Self-Join to Identify Missing Supervisors	Easy	Meta	Free
Year-over-Year Revenue Growth	Easy	OpenAI	Free
Above Average Price Products	Medium	Hulu	Free
Calculate Cumulative Sales	Medium	Uber	Free
Find Overlapping Date Ranges	Medium	X	Free
Set Operation: INTERSECT	Medium	DoorDash	Free
Subquery for Best Order per Customer	Medium	Anthropic	Free
Ranking with Dense_Rank	Medium	Amazon	Free
Median Salary by Job Title	Medium	ActivisionBlizzard	Free
String Splitting and Aggregation	Medium	Vercel	Free
Salary Comparison with CTE Aggregation	Medium	Crypto.Com	Free
String Pattern Extraction in Descriptions	Medium	Zscaler	Free
Nested Subquery for Latest Record	Medium	DoorDash	Free
Window Function for Moving Average	Medium	DeutscheBank	Free
Re-enrollment Rate Calculator	Medium	Google	Free
String Pattern Matching Using LIKE	Medium	Apple	Free
Merge Employee and Department Records	Hard	Anthropic	Free
Sequence Products by Price	Hard	GoDaddy	Free
Combine Data from Multiple Sources into Unified Report	Hard	Vercel	Free
Export SQLite Database to Parquet Format with Metadata	Hard	GitLab	Free
Top Categories by Average Price	Hard	Samsung	Free
Customer Order Aggregation	Medium	BMW	Free
Filter Popular Videos on a Streaming Platform	Easy	Apple	Free
Replace Keywords in Social Media Post Text	Easy	PayPal	Free
Filter Movies with Missing Box Office Data	Easy	DoorDash	Free
Daily Category Sales	Easy	Snowflake	Free
Filter and Uppercase Artifacts	Easy	AMD	Free
Combine Customer Orders and Products	Medium	Twilio	Free
Anonymize User PII Data for a Social Media Platform	Medium	Atlassian	Free
Product Sales and Inventory Data	Medium	PayPal	Free
Products and Duplicates	Medium	JPMorgan	Free
Mortgage Rate Calculator	Medium	NVIDIA	Free
Weekend Order Detection	Medium	IBM	Free
Flooring Company Data	Medium	Databricks	Free
Rank Top Products by Revenue per Category	Hard	Coinbase	Free
Highest SEO Score Pages per Domain	Hard	Cisco	Free
Math Expressions	Hard	IBM	Free
CSV and Partitions	Easy	Atlassian	Free
Repartition	Easy	Robinhood	Free
Broadcast Join	Easy	Databricks	Free
Correcting Social Media Posts	Easy	Twitter	Free
Daily Category Sales Aggregation	Easy	Microsoft	Free
Cache and Performance	Medium	Palantir	Free
Filter Popular Videos	Medium	Netflix	Free
Anonymize User PII	Medium	Meta	Free
Call Center Daily Stats	Medium	VMware	Free
Venture Capital Sector Analysis	Medium	Cloudflare	Free
Window Functions without Partitions	Medium	Google	Free
Calculating PE Portfolio Values	Medium	IBM	Free
Mountain Climber Logs	Hard	Stripe	Free
Global & Domain SEO Leaders	Hard	Amazon	Free
Tracking Customer Purchase History	Hard	Coinbase	Free
Merge Customer Records from Two Sources	Easy	Lyft	Free
Filter Funded Startups	Easy	Salesforce	Free
Assign Row Numbers to Authors per Paper	Medium	Cloudflare	Free
Amusement Park Rating Anomalies	Medium	GitHub	Free
Usage and Accuracy per Model Type	Medium	VMware	Free
Find the Last Climber per Mountain	Medium	Bloomberg	Free
Track Product Purchases	Hard	Microsoft	Free
Most Common Order Status	Easy	Airbnb	Free
Calculating Overtime Pay	Easy	Cisco	Free
Top Products by Revenue	Medium	Walmart	Free
Product Summary	Medium	Amazon	Free
Parsing Comma-Separated Values	Medium	Revolut	Free
CSV Row Filter and Count	Easy	DoorDash	Free
Analyze Sales Dataset Dimensions and Calculate Total Revenue	Easy	Databricks	Free
Sort Avro Employee Records by Salary	Easy	GitHub	Free
Count User Events from JSON Activity Logs	Easy	Uber	Free
Split Delimited Column into Separate Columns with Pandas	Easy	Snowflake	Free
Compare SQLite Database and CSV File Records	Easy	Robinhood	Free
Analyze DataFrame Memory Usage	Easy	SAP	Free
Time-Series Rolling Window Analysis for Multi-Stock Price Data	Medium	HashiCorp	Free
Flatten Nested JSON to CSV with Dot-Notation Columns	Medium	Amazon	Free
Calculate Descriptive Statistics for Numeric Columns in Pandas	Easy	Google	Free
Decompose Time-Series Data into Trend, Seasonal, and Residual Components	Medium	Instacart	Free
Extract Schema Information from Parquet File Using PyArrow	Easy	Palantir	Free
Select Specific Columns from Parquet File	Easy	OpenAI	Free
Flatten Nested Struct Columns in Parquet and Export to CSV	Medium	Coinbase	Free
Merge Customer and Purchase Data Using Pandas	Easy	Mastercard	Free
SQL JOIN with Pandas Data Processing and CSV Export	Medium	Intel	Free
Insert New Records into SQLite Database from CSV	Medium	Visa	Free
Aggregate SQL Query Results with Pandas and Export to Excel	Medium	Meta	Free
Aggregate Time-Series Data into Fixed Time Windows	Hard	Tesla	Free
Interpolate Missing Values in Irregular Time-Series Sensor Data	Hard	VMware	Free
Remove Seasonal Effects from Time-Series Sales Data	Hard	Cloudflare	Free
Convert Excel Files with Multiple Sheets to Individual CSV Files	Easy	Airbnb	Free