Data Engineering Interview Questions (30+ Questions)
Last Updated: June 8, 2026 • 30 Questions • Real Company Interviews
Are you preparing for a data engineering interview? Data engineering involves creating, building, and maintaining the infrastructure and systems that facilitate data collection, storage, and analysis. It focuses on developing robust data pipelines and ensuring seamless data flow from source to destination. For both newcomers and experienced professionals, mastering data engineering is vital in today's data-focused world.
Table of Contents
- Top K Frequent Elements in Stream (medium)
- LFU Cache (hard)
- Event Stream Deduplicator (medium)
- Moving Average from Data Stream (easy)
- Skew-Aware Key Partitioner (medium)
- Hash Join Simulator (medium)
- Log Aggregator (medium)
- Find Median from Data Stream (hard)
- Min Stack (medium)
- Design Twitter (medium)
- Design Add and Search Words Data Structure (medium)
- Implement Trie (Prefix Tree) (medium)
- LRU Cache (medium)
- Continuous Subarray Sum (medium)
- Accounts Merge (medium)
- Subarray Sum Equals K (medium)
- Sliding Window Median (hard)
- Managing High I/O Processes (easy)
- Average Order Value (easy)
- Find Cheapest Product (easy)
- New vs. Returning Customer Report (easy)
- Above Average Price Products (medium)
- Frequent Price Change Detector (medium)
- Self-Join for Manager Names (easy)
- Join Employees and Departments (easy)
- Find Highest Salary (easy)
- Aggregate Employee Salaries (medium)
- Filter Orders by Date Range (easy)
- Count Distinct Product Categories (medium)
- Join Three Tables (easy)
In this article, we present Data Engineering Interview Questions with answers typically asked in data engineer interviews at high-paying companies. Whether you're preparing for your first data engineering role or looking to progress in your career, this guide will help you navigate the most frequently asked data engineering interview questions and stand out in competitive job interviews.
💡 Pro Tips for Data Engineering Interviews
- Practice each question and understand the underlying concepts
- Review company-specific technologies and methodologies
- Prepare follow-up questions and edge cases
- Practice explaining your solutions clearly and concisely
Interview Questions & Answers
1. Top K Frequent Elements in Stream
import heapq
from collections import defaultdict
class TopKFrequent:
def init(self, k):
self.k = k
self.freq = defaultdict(int)
def add(self, num):
self.freq[num] += 1
def topK(self):
items = [(-count, num) for num, count in self.freq.items()]
top = heapq.nsmallest(self.k, items)
return [num for _, num in top]
2. LFU Cache
from collections import defaultdict, OrderedDict
class LFUCache:
def init(self, capacity):
self.cap = capacity
self.min_freq = 0
self.key_val = {}
self.key_freq = {}
self.freq_keys = defaultdict(OrderedDict)
def _update(self, key):
freq = self.key_freq[key]
self.freq_keys[freq].pop(key)
if not self.freq_keys[freq] and self.min_freq == freq:
self.min_freq += 1
self.key_freq[key] = freq + 1
self.freq_keys[freq + 1][key] = None
def get(self, key):
if key not in self.key_val:
return -1
self._update(key)
return self.key_val[key]
def put(self, key, value):
if self.cap <= 0:
return
if key in self.key_val:
self.key_val[key] = value
self._update(key)
return
if len(self.key_val) >= self.cap:
evict_key, _ = self.freq_keys[self.min_freq].popitem(last=False)
del self.key_val[evict_key]
del self.key_freq[evict_key]
self.key_val[key] = value
self.key_freq[key] = 1
self.freq_keys[1][key] = None
self.min_freq = 1
3. Event Stream Deduplicator
class Deduplicator:
def init(self, ttl):
self.ttl = ttl
self.seen = {}
def process(self, timestamp, eventId):
expired = [k for k, t in self.seen.items() if timestamp - t > self.ttl]
for k in expired:
del self.seen[k]
if eventId in self.seen:
self.seen[eventId] = timestamp
return False
self.seen[eventId] = timestamp
return True
4. Moving Average from Data Stream
from collections import deque
class MovingAverage:
def init(self, size):
self.size = size
self.queue = deque()
self.window_sum = 0
def next(self, val):
self.queue.append(val)
self.window_sum += val
if len(self.queue) > self.size:
self.window_sum -= self.queue.popleft()
return self.window_sum / len(self.queue)
5. Skew-Aware Key Partitioner
class SkewHandler:
def init(self, numBuckets, hotThreshold, splitFactor):
self.n = numBuckets
self.threshold = hotThreshold
self.split = splitFactor
self.counts = {}
self.robin = {}
self.load = [0] * numBuckets
def _hash(self, key):
h = 0
for c in key:
h = h * 31 + ord(c)
return h % self.n
def assign(self, key):
self.counts[key] = self.counts.get(key, 0) + 1
base = self._hash(key)
if self.counts[key] > self.threshold:
idx = self.robin.get(key, 0)
bucket = (base + idx) % self.n
self.robin[key] = (idx + 1) % self.split
else:
bucket = base
self.load[bucket] += 1
return bucket
def getLoad(self):
return list(self.load)
6. Hash Join Simulator
from collections import defaultdict
class HashJoin:
def init(self):
self.table = defaultdict(list)
def build(self, rows, keyIndex):
self.table.clear()
for row in rows:
self.table[row[keyIndex]].append(row)
def probe(self, rows, keyIndex):
result = []
for row in rows:
for build_row in self.table.get(row[keyIndex], []):
result.append(build_row + row)
result.sort()
return result
7. Log Aggregator
from collections import defaultdict
import bisect
class LogAggregator:
def init(self):
self.logs = defaultdict(list)
def record(self, timestamp, key):
self.logs[key].append(timestamp)
def count(self, key, start, end):
if key not in self.logs:
return 0
timestamps = self.logs[key]
left = bisect.bisect_left(timestamps, start)
right = bisect.bisect_right(timestamps, end)
return right - left
8. Find Median from Data Stream
import heapq
class MedianFinder:
def init(self):
self.small = [] # max-heap (negated values)
self.large = [] # min-heap
def addNum(self, num):
heapq.heappush(self.small, -num)
heapq.heappush(self.large, -heapq.heappop(self.small))
if len(self.large) > len(self.small):
heapq.heappush(self.small, -heapq.heappop(self.large))
def findMedian(self):
if len(self.small) > len(self.large):
return float(-self.small[0])
return (-self.small[0] + self.large[0]) / 2.0
9. Min Stack
class MinStack:
def init(self):
self.stack = []
self.min_stack = []
def push(self, val: int) -> None:
self.stack.append(val)
if self.min_stack:
self.min_stack.append(min(val, self.min_stack[-1]))
else:
self.min_stack.append(val)
def pop(self) -> None:
self.stack.pop()
self.min_stack.pop()
def top(self) -> int:
return self.stack[-1]
def getMin(self) -> int:
return self.min_stack[-1]
10. Design Twitter
class Twitter:
def init(self):
self.count = 0
self.tweetMap = {}
self.followMap = {}
def postTweet(self, userId: int, tweetId: int) -> None:
if userId not in self.tweetMap:
self.tweetMap[userId] = []
self.tweetMap[userId].append([self.count, tweetId])
self.count -= 1
def getNewsFeed(self, userId: int) -> list[int]:
res = []
minHeap = []
if userId not in self.followMap:
self.followMap[userId] = set()
self.followMap[userId].add(userId)
for followeeId in self.followMap[userId]:
if followeeId in self.tweetMap and len(self.tweetMap[followeeId]) > 0:
index = len(self.tweetMap[followeeId]) - 1
count, tweetId = self.tweetMap[followeeId][index]
heapq.heappush(minHeap, [count, tweetId, followeeId, index - 1])
while minHeap and len(res) < 10:
count, tweetId, followeeId, index = heapq.heappop(minHeap)
res.append(tweetId)
if index >= 0:
next_count, next_tweetId = self.tweetMap[followeeId][index]
heapq.heappush(minHeap, [next_count, next_tweetId, followeeId, index - 1])
self.followMap[userId].remove(userId)
return res
def follow(self, followerId: int, followeeId: int) -> None:
if followerId not in self.followMap:
self.followMap[followerId] = set()
self.followMap[followerId].add(followeeId)
def unfollow(self, followerId: int, followeeId: int) -> None:
if followerId in self.followMap and followeeId in self.followMap[followerId]:
self.followMap[followerId].remove(followeeId)
11. Design Add and Search Words Data Structure
class TrieNode:
def init(self):
self.children = {}
self.is_end = False
class WordDictionary:
def init(self):
self.root = TrieNode()
def addWord(self, word: str) -> None:
curr = self.root
for char in word:
if char not in curr.children:
curr.children[char] = TrieNode()
curr = curr.children[char]
curr.is_end = True
def search(self, word: str) -> bool:
def dfs(j, root):
curr = root
for i in range(j, len(word)):
char = word[i]
if char == '.':
for child in curr.children.values():
if dfs(i + 1, child):
return True
return False
else:
if char not in curr.children:
return False
curr = curr.children[char]
return curr.is_end
return dfs(0, self.root)
12. Implement Trie (Prefix Tree)
class TrieNode:
def init(self):
self.children = {}
self.is_end_of_word = False
class Trie:
def init(self):
self.root = TrieNode()
def insert(self, word: str) -> None:
curr = self.root
for char in word:
if char not in curr.children:
curr.children[char] = TrieNode()
curr = curr.children[char]
curr.is_end_of_word = True
def search(self, word: str) -> bool:
curr = self.root
for char in word:
if char not in curr.children:
return False
curr = curr.children[char]
return curr.is_end_of_word
def startsWith(self, prefix: str) -> bool:
curr = self.root
for char in prefix:
if char not in curr.children:
return False
curr = curr.children[char]
return True
13. LRU Cache
class Node:
def init(self, key=0, val=0):
self.key = key
self.val = val
self.prev = None
self.next = None
class LRUCache:
def init(self, capacity: int):
self.cap = capacity
self.cache = {}
self.left = Node()
self.right = Node()
self.left.next = self.right
self.right.prev = self.left
def remove(self, node):
prev_node = node.prev
next_node = node.next
prev_node.next = next_node
next_node.prev = prev_node
def insert(self, node):
prev_mru = self.right.prev
prev_mru.next = node
self.right.prev = node
node.prev = prev_mru
node.next = self.right
def get(self, key: int) -> int:
if key in self.cache:
self.remove(self.cache[key])
self.insert(self.cache[key])
return self.cache[key].val
return -1
def put(self, key: int, value: int) -> None:
if key in self.cache:
self.remove(self.cache[key])
self.cache[key] = Node(key, value)
self.insert(self.cache[key])
if len(self.cache) > self.cap:
lru = self.left.next
self.remove(lru)
del self.cache[lru.key]
14. Continuous Subarray Sum
def check_subarray_sum(nums: list[int], k: int) -> bool:
remainder_map = {0: -1}
prefix_sum = 0
for i in range(len(nums)):
prefix_sum += nums[i]
remainder = prefix_sum % k
if remainder in remainder_map:
if i - remainder_map[remainder] >= 2:
return True
else:
remainder_map[remainder] = i
return False
15. Accounts Merge
class UnionFind:
def init(self):
self.parent = {}
def find(self, x):
if x not in self.parent:
self.parent[x] = x
if self.parent[x] != x:
self.parent[x] = self.find(self.parent[x])
return self.parent[x]
def union(self, x, y):
rootX = self.find(x)
rootY = self.find(y)
if rootX != rootY:
self.parent[rootY] = rootX
def accounts_merge(accounts: list[list[str]]) -> list[list[str]]:
uf = UnionFind()
email_to_name = {}
for acc in accounts:
name = acc[0]
first_email = acc[1]
for i in range(1, len(acc)):
email = acc[i]
email_to_name[email] = name
uf.union(first_email, email)
merged_emails = {}
for email in email_to_name:
root = uf.find(email)
if root not in merged_emails:
merged_emails[root] = []
merged_emails[root].append(email)
res = []
for root, emails in merged_emails.items():
res.append([email_to_name[root]] + sorted(emails))
return res
16. Subarray Sum Equals K
def subarray_sum(nums: list[int], k: int) -> int:
count = 0
prefix_sum = 0
prefix_map = {0: 1}
for num in nums:
prefix_sum += num
diff = prefix_sum - k
if diff in prefix_map:
count += prefix_map[diff]
prefix_map[prefix_sum] = prefix_map.get(prefix_sum, 0) + 1
return count
17. Sliding Window Median
def median_sliding_window(nums: list[int], k: int) -> list[float]:
small = []
large = []
lazy = {}
for i in range(k):
heapq.heappush(small, -nums[i])
for i in range(k // 2):
heapq.heappush(large, -heapq.heappop(small))
def get_median():
if k % 2 == 1:
return float(-small[0])
return (-small[0] + large[0]) / 2.0
res = [get_median()]
for i in range(k, len(nums)):
out_num = nums[i - k]
in_num = nums[i]
lazy[out_num] = lazy.get(out_num, 0) + 1
balance = 0
if out_num <= -small[0]:
balance -= 1
else:
balance += 1
if small and in_num <= -small[0]:
balance += 1
heapq.heappush(small, -in_num)
else:
balance -= 1
heapq.heappush(large, in_num)
if balance < 0:
heapq.heappush(small, -heapq.heappop(large))
elif balance > 0:
heapq.heappush(large, -heapq.heappop(small))
while small and lazy.get(-small[0], 0) > 0:
lazy[-small[0]] -= 1
heapq.heappop(small)
while large and lazy.get(large[0], 0) > 0:
lazy[large[0]] -= 1
heapq.heappop(large)
res.append(get_median())
return res
18. Managing High I/O Processes
Learn how to identify and manage high I/O consuming processes on a Linux server using command-line tools. This guide covers sorting running processes by disk activity and mitigating disk bottlenecks through throttling, stopping, or rescheduling heavy jobs, improving overall system responsiveness and application performance.
19. Average Order Value
Objective
To answer the SQL interview question effectively, you need to craft a query that calculates and displays the average expenditure per customer from a table named Orders. This table records customer purchases. The results should display each customer_id alongside their avg_order_value, rounded to two decimal places. Finally, ensure the output is sorted in ascending order based on customer_id.
Additional Information
Table Schema:
- Orders:
order_id(INTEGER): Unique identifier for each order.customer_id(INTEGER): Identifier for the customer who made the order.total_amount(DECIMAL): Total amount spent on the order.order_date(DATE): The date when the order was placed.
Constraints:
- Each
customer_idmay have multiple associated orders. - There is at least one order present in the table.
Output Requirements:
- Columns:
customer_id,avg_order_value avg_order_valuemust be rounded to two decimal places.- Results must be ordered by
customer_idin ascending order.
Detailed Query Explanation
The query you need to write should meet the specific output requirements and constraints mentioned. Here's the detailed SQL query designed to solve the problem:
SELECT
customer_id,
ROUND(AVG(total_amount), 2) AS avg_order_value
FROM
Orders
GROUP BY
customer_id
ORDER BY
customer_id;
Keyword Optimization
When formulating the answer:
- "SQL query to determine the average expenditure per customer"
- "display
customer_idalongside theiravg_order_value" - "rounded to two decimal places"
- "sorted in ascending order based on
customer_id" - "table schema includes
order_id,customer_id,total_amount, andorder_date"
These phrases and keywords are naturally embedded in the explanation and query to ensure the content is both informative and optimized for SEO.
20. Find Cheapest Product
Objective
Retrieve the name and price of the product(s) with the lowest price from the products table.
Additional Information
- The
productstable contains the following columns:id(integer): Unique identifier for each product.name(string): Name of the product.price(decimal): Price of the product.
- There may be multiple products sharing the same lowest price.
- Write an SQL query that returns the
nameandpriceof the cheapest product(s).
In order to tackle this SQL interview question effectively, you need to identify the product or products with the lowest price from the products table. Here's the step-by-step process to achieve this:
- First, determine the lowest price among all the products.
- Then, retrieve the
nameandpriceof the product(s) with this minimum price.
Now, let's write the precise SQL query to accomplish this objective:
SELECT
name,
price
FROM
products
WHERE
price = (SELECT MIN(price) FROM products);
This SQL query efficiently retrieves the name and price of the product(s) with the lowest price from the products table by leveraging a subquery to find the minimum price. This approach ensures that even if there are multiple products with the same lowest price, all of them will be included in the result set.
21. New vs. Returning Customer Report
Generating a Detailed Customer Order Report in SQL
The given objective of this interview question is to write an SQL query to generate a comprehensive report that lists each order along with specific associated details such as order date, customer ID, customer name, and customer type, which indicates whether the order is the customer's first or a subsequent order. This can be achieved by linking information from two database tables, orders and customers.
Here’s how you can achieve this:
SQL Query to Generate the Report
WITH first_orders AS (
SELECT
customer_id,
MIN(order_date) AS first_order_date
FROM
orders
GROUP BY
customer_id
)
SELECT
o.order_date,
o.customer_id,
c.customer_name,
CASE
WHEN o.order_date = fo.first_order_date THEN 'New'
ELSE 'Returning'
END AS customer_type
FROM
orders o
JOIN
customers c ON o.customer_id = c.id
LEFT JOIN
first_orders fo ON o.customer_id = fo.customer_id
ORDER BY
o.order_date,
o.customer_id;
Explanation of the SQL Query Structure
The query begins with a
WITHclause, also known as a Common Table Expression (CTE), namedfirst_orders. This CTE identifies the earliest order date for each customer by using the SQL aggregation functions:- Columns Selected:
customer_idfirst_order_date(calculated as the minimumorder_datefor eachcustomer_id)
- Grouped By:
customer_id
- Columns Selected:
The main
SELECTstatement retrieves the following details:order_date: The date when the order was made (orders.order_date).customer_id: Customer identifier (orders.customer_id).customer_name: Name of the customer (customers.customer_name).customer_type: A CASE statement determining if the order is the customer's first or subsequent order. Iforders.order_datematchesfirst_orders.first_order_date, it is marked as ‘New’; otherwise, it is marked as ‘Returning’.
Join Operations:
- The
orderstable is joined with thecustomerstable oncustomer_idto get the customer name. - The CTE
first_ordersis LEFT JOINED with theorderstable to determine if the order is the first order or not.
- The
Ordering:
- The results are ordered first by
order_dateand then bycustomer_idfor better readability and to match expected output structure.
- The results are ordered first by
Additional Notes
- The use of SQL functions such as
MIN()andGROUP BYensures that the query efficiently handles large datasets by summarizing order data per customer. - The
LEFT JOINis crucial to ensure that even if there are no orders yet, customers who have placed future orders can be correctly identified.
This SQL query is designed to provide all necessary details in an ordered and efficient manner to analyze customer purchasing patterns effectively.
22. Above Average Price Products
SQL Interview Question: Fetching Available Products Above Average Price
Objective
Develop an SQL query to fetch all products currently available in stock and priced above the average price of all available products. The dataset should be organized primarily based on price in descending order and secondarily by rating in descending order. Products lacking a rating should be placed at the end of the sorted list.
Additional Information
- Include only products with
stock_quantitygreater than 0. - The average price must be computed solely based on the products that are in stock.
- When sorting, products with a
NULLrating should appear after those with non-null ratings. - The query should return all columns from the
Productstable. - Account for the possibility that some products might not have a rating value.
This detailed explanation contributes valuable insights into constructing a sophisticated SQL query that efficiently retrieves and sorts products based on inventory status and price, offering critical functionality for database management. Understanding and mastering such queries are essential for optimizing data handling and retrieval processes in various business applications.
Sample Query
SELECT *
FROM Products
WHERE stock_quantity > 0
AND price > (SELECT AVG(price) FROM Products WHERE stock_quantity > 0)
ORDER BY price DESC, rating DESC
NULLS LAST;
By employing the above query, you can ensure that you are extracting relevant product data effectively, adhering to the given criteria, and enhancing your SQL skill set for advanced data manipulation and querying tasks.
23. Frequent Price Change Detector
Identifying Products with Multiple Price Changes Using SQL
Objective
In this interview question, you are provided with two tables: products and price_history. Your task is to write an SQL query that identifies all products that have experienced at least two price changes. A price change is defined as a difference in price from one entry to the next for the same product, based on the chronological order of dates.
Requirements
Tables:
products:id(integer): The unique identifier for each product.name(string): The name of the product.category(string): The category to which the product belongs.
price_history:product_id(integer): The identifier linking to theproductstable.price(decimal): The price of the product at a specific date.date(date): The date when the price was recorded.
Constraints:
- Each
product_idinprice_historycorresponds to anidinproducts. - Prices are recorded in chronological order for each product.
- Each
Output:
- The product’s name.
- The category of the product.
- The total number of price changes, returned as a string.
Conditions:
- Only consider consecutive price entries for determining a price change.
- A product must have at least two price changes to be included in the result.
Order:
- Results should be ordered by the number of price changes in descending order.
- In case of ties, results should be ordered by the product ID in ascending order.
This query helps in understanding the dynamics of product pricing over time and can be a valuable insight for inventory management, pricing strategies, and market analysis. Performing well on such questions requires a meticulous approach to grouping and ordering, which is fundamental in SQL querying.
24. Self-Join for Manager Names
SQL Query to Retrieve Employee and Manager Names Ordered by Employee ID
In a typical SQL job interview, you might encounter questions designed to evaluate your ability to manipulate and retrieve data using SQL queries. One common question involves retrieving each employee's name alongside their respective manager's name from a given employees table.
To solve this, you'll need to create a SQL query that:
- Retrieves all employees.
- Includes each employee's manager name (if one exists).
- Ensures that employees without managers are still listed with a
NULLvalue for the manager's name. - Orders the results by the employee's unique identifier in ascending order.
Sample SQL Query
Here’s a detailed SQL query that meets the requirements:
SELECT
e.id AS employee_id,
e.name AS employee_name,
m.name AS manager_name
FROM
employees e
LEFT JOIN
employees m ON e.manager_id = m.id
ORDER BY
e.id ASC;
Explanation of the Query
SELECT: Specifies the columns you want to retrieve. Here we select the
idandnamefrom theemployeestable (aliased ase) as well as thenamefrom the manager.e.id AS employee_id: Retrieves the employee's ID and labels it as
employee_id.e.name AS employee_name: Retrieves the employee's name and labels it as
employee_name.m.name AS manager_name: Retrieves the manager's name (if it exists) and labels it as
manager_name.FROM employees e: Specifies the primary table we are querying from (
employees), aliasing it ase.LEFT JOIN employees m ON e.manager_id = m.id: Performs a left join on the
employeestable itself to get the manager's details. The join condition matchese.manager_idwithm.id, ensuring managers are matched to their subordinates.ORDER BY e.id ASC: Orders the result by the
idof the employees in ascending order.
This query will yield a list of all employees with their respective manager names, if available. Employees without a manager will have NULL in the manager_name column, and the results will be neatly ordered by employee ID.
Conclusion
Answering SQL interview questions effectively often involves not just solving the problem but doing so in a clear, efficient, and accurate manner. This SQL query example demonstrates your ability to handle self-joins, NULL values, and sorting, making it a comprehensive answer to the posed question.
By fully understanding and implementing the core requirements—retrieving employees and their managers, handling employees without managers, and ordering the results by employee ID—you’ll showcase your SQL proficiency and logical problem-solving skills, critical assets for any data-centric role.
25. Join Employees and Departments
Understanding the Question
Objective
The goal is to write an SQL query to retrieve and display a list of employees who earn more than $50,000. For each employee that meets the specified salary condition, the output should include several details: employee ID, first name, last name, job title, hire date, and the name of their department. If an employee does not belong to any department, the department name should be presented as NULL. Furthermore, the employees should be listed in descending order based on their hire date, showcasing the most recently hired employees first.
Additional Information
Tables:
employeestable:employee_id(Integer): Unique identifier for each employee.first_name(String): Employee's first name.last_name(String): Employee's last name.department_id(Integer): Identifier for the department to which the employee belongs (can beNULL).job_title(String): The title of the employee's job.salary(Integer): The employee's salary.hire_date(Date): The date the employee was hired.
departmentstable:department_id(Integer): Unique identifier for each department.department_name(String): The name of the department.location(String): The location of the department.
Requirements
- Use a
LEFT JOINto merge theemployeesanddepartmentstables on thedepartment_id. - Filter to only include employees with a
salarygreater than 50000. - Display
department_nameasNULLfor employees who are not assigned to any department. - Order the results by
hire_datein descending order.
Example
Input:
employeestable:employee_id first_name last_name department_id job_title salary hire_date 1 John Smith 1 Senior Developer 85000 2020-01-15 2 Mary Johnson 2 Project Manager 75000 2021-03-20 3 Peter Brown NULL Consultant 65000 2022-06-10 4 Sarah Davis 1 Developer 45000 2021-09-01 departmentstable:department_id department_name location 1 Engineering New York 2 Project Management Boston 3 Marketing Chicago
Expected Output:
| employee_id | first_name | last_name | department_name | job_title | hire_date |
|---|---|---|---|---|---|
| 3 | Peter | Brown | NULL | Consultant | 2022-06-10 |
| 2 | Mary | Johnson | Project Management | Project Manager | 2021-03-20 |
| 1 | John | Smith | Engineering | Senior Developer | 2020-01-15 |
SQL Query:
SELECT
e.employee_id,
e.first_name,
e.last_name,
d.department_name,
e.job_title,
e.hire_date
FROM
employees e
LEFT JOIN
departments d ON e.department_id = d.department_id
WHERE
e.salary > 50000
ORDER BY
e.hire_date DESC;
26. Find Highest Salary
Objective
Write an SQL query to retrieve the names and salaries of the highest-paid employee(s) from the employees table.
Additional Information
- The
employeestable consists of the following columns:id(integer): The unique identifier for each employee.name(string): The name of the employee.salary(integer): The salary of the employee.
- If multiple employees share the highest salary, include all of them in the result.
- The output should contain two columns:
nameandsalary. - The results can be returned in any order.
This SQL query requires identifying the highest salary first and then selecting the names and salaries of all employees who earn this highest salary.
To start with, the query can use a subquery to determine the maximum salary from the employees table. Subsequently, a main query can be employed to fetch the names and salaries of employees whose salaries match the maximum salary obtained from the subquery.
Here is the complete SQL query to achieve this objective:
SQL Query
SELECT name, salary
FROM employees
WHERE salary = (SELECT MAX(salary) FROM employees);
This query effectively retrieves the names and salaries of all employees who have the highest salary in the employees table, satisfying the given requirements.
27. Aggregate Employee Salaries
Objective
Given two tables, employees and departments, construct a SQL query to display each department's name, the total number of employees in that department, the aggregate salary of all its employees, and the average salary rounded to two decimal places. The output should be sorted alphabetically by the department name.
Additional Information
Tables:
employees:
id(INTEGER): Unique identifier for each employee.name(VARCHAR): Name of the employee.department_id(INTEGER): Identifier linking to the employee's department.salary(INTEGER): Salary of the employee.
departments:
id(INTEGER): Unique identifier for each department.name(VARCHAR): Name of the department.location(VARCHAR): Location of the department.
Requirements:
- Use appropriate JOIN operations to link
employeeswithdepartments. - Calculate the total number of employees (
employee_count) in each department. - Compute the sum of salaries (
total_salary) for each department. - Determine the average salary (
average_salary) for each department, rounded to two decimal places. - Alias the department name as
department. - Order the final results by the
departmentname in ascending order.
Output Columns:
department(VARCHAR): Name of the department.employee_count(INTEGER): Number of employees in the department.total_salary(INTEGER): Combined salaries of all employees in the department.average_salary(DECIMAL(10,2)): Average salary within the department, rounded to two decimal places.
28. Filter Orders by Date Range
How to Construct an SQL Query to Fetch Order Details within a Specific Date Range
If you're preparing for an interview, crafting a precise SQL query to fetch order details is a skill you need to master. The task involves fetching order_id, customer_name, order_date, and total_amount from the orders table, only for orders placed between January 1, 2023, and June 30, 2023. The results must be sorted by order_date in ascending order. Here’s a step-by-step guide on how to nail this:
SQL Query Example
To achieve this, you need to follow these pointers:
- Select the desired columns:
order_id,customer_name,order_date, andtotal_amount. - Filter the results to only include orders where the
order_datefalls inclusively between '2023-01-01' and '2023-06-30'. - Order the results by
order_datein ascending order to get the earliest orders at the top.
Here is the SQL query you need:
SELECT order_id, customer_name, order_date, total_amount
FROM orders
WHERE order_date BETWEEN '2023-01-01' AND '2023-06-30'
ORDER BY order_date ASC;
Explanation
- FROM orders: Indicates the data source, which is the
orderstable. - SELECT order_id, customer_name, order_date, total_amount: Specifies the columns you want to retrieve.
- WHERE order_date BETWEEN '2023-01-01' AND '2023-06-30': Filters the results to include only those orders placed within the defined date range.
- ORDER BY order_date ASC: Ensures that the output is sorted by
order_datein ascending order.
This query effectively pulls the required data while maintaining the specified conditions and ordering.
Conclusion
Being able to craft an SQL query that accurately fetches order details based on specific criteria is crucial for data analysis and reporting. This particular query showcases your ability to filter data by date and sort it effectively, ensuring that you can produce meaningful and organized insights from your data tables. Practice writing and running similar queries to enhance your SQL skills and be interview-ready.
29. Count Distinct Product Categories
Analyzing Primary Categories and Subcategories in a Products Table: A Detailed Guide
Objective
In this analysis, we aim to examine the products table to identify and count all primary categories that do not have a parent category. Additionally, we will calculate the total number of unique subcategories directly associated with these primary categories.
Key Points
- Primary Category: Defined as a category with no
parent_category. - Subcategory: A distinct category whose
parent_categorymatches a primary category. - Output: The analysis should yield two specific metrics:
total_root_categories: The count of distinct primary categories.total_subcategories: The count of unique subcategories linked to the primary categories.
Table Description
The products table consist of these essential columns:
id(integer): Unique identifier for each product.name(string): Name of the product.category(string): The category to which the product belongs.parent_category(string or null): The parent category of the product's category, if applicable.
Analysis Goals
- Identify primary categories with
parent_categoryasnull. - Determine the distinct subcategories that have their
parent_categoryset to a primary category.
This analysis provides valuable insights into the category hierarchy within the products table and supports better categorization strategies.
By performing this analysis, businesses can understand and organize their product categories more effectively, ensuring a well-structured product categorization system that enhances searchability and user experience.
Conclusion
The aim is to extract the total_root_categories and total_subcategories to comprehend the structure and hierarchy of the product categories. This process is pivotal in optimizing product categorization for better inventory management and improved customer navigation.
This detailed guide serves as a basis for extracting meaningful insights from the products table, ensuring accurate and efficient data analysis.
30. Join Three Tables
Objective
Construct a SQL query to retrieve a comprehensive list of customer orders. For each order, display the customer's name, the date the order was placed, the name of the product ordered, the quantity of the product, and the unit price. Ensure that the results are organized first by the customer's name in ascending order and then by the order date in ascending order.
Additional Information
- Tables and Schemas:
Customers
customer_id(INTEGER): Unique identifier for each customer.customer_name(VARCHAR): Name of the customer.email(VARCHAR): Email address of the customer.
Orders
order_id(INTEGER): Unique identifier for each order.customer_id(INTEGER): Identifier linking the order to a customer.order_date(DATE): Date when the order was placed.
OrderDetails
order_id(INTEGER): Identifier linking the detail to an order.product_name(VARCHAR): Name of the product ordered.quantity(INTEGER): Quantity of the product ordered.unit_price(DECIMAL): Price per unit of the product.
Constraints
- Each customer can have multiple orders.
- Each order can include multiple products.
- Ensure that all relevant records from the three tables are accurately joined based on their relationships.
Output Requirements
- Columns to include in the result:
customer_nameorder_dateproduct_namequantityunit_price
- The result should be sorted by
customer_name(ascending) and then byorder_date(ascending).
Example SQL Query
SELECT
c.customer_name,
o.order_date,
od.product_name,
od.quantity,
od.unit_price
FROM
Customers c
JOIN
Orders o ON c.customer_id = o.customer_id
JOIN
OrderDetails od ON o.order_id = od.order_id
ORDER BY
c.customer_name ASC,
o.order_date ASC;
Ready to Practice More?
Explore interview questions from specific companies or try our hands-on labs to build practical experience.