Clients Transaction Data
Beginner Mode

Start your terminal to use beginner mode.

Objective

An accounting firm handles transactional data from various clients. The firm receives the data in two separate DataFrames: df_transactions contains the financial events, and df_clients contains the client metadata.

Task

It is observed that there are a number of duplicated and incorrect primary keys in both DataFrames, as well as improperly formatted dates. Write a PySpark function that combines the DataFrames while strictly enforcing the following data quality rules:

  1. A valid TransactionID in df_transactions is an integer value greater than 0, and each TransactionID should be unique. Any row with an invalid or duplicate TransactionID should be dropped.

  2. A valid ClientID in both DataFrames is an integer value greater than 0, and each ClientID in df_clients should be unique. Any row with an invalid or duplicate ClientID in df_clients should be dropped.

  3. df_transactions should only include rows with ClientIDs that exist in the cleaned df_clients DataFrame.

  4. The Date column should be strictly in the format yyyy-mm-dd. Rows with an invalid string format (like 2023-7-1 or 05/12/2023) must be dropped.

Save your resulting DataFrame as result_df. Ensure the output matches the exact schema order requested, cast the Amount column to an Integer, and order the final output by TransactionID in ascending order.

File Path

  • Transactions Dataset: /home/interview/df_transactions.csv
  • Clients Dataset: /home/interview/df_clients.csv
  • Starter script: /home/interview/clean_transactions.py

Schema

df_transactions.csv

Column Name Data Type
TransactionID Integer
ClientID Integer
Date String
Amount Float

df_clients.csv

Column Name Data Type
ClientID Integer
ClientName String
Industry String

Expected Output Schema

Column Name Data Type
TransactionID Integer
ClientID Integer
Date String
Amount Integer
ClientName String
Industry String

Example

Given this sample input:

df_transactions

TransactionID ClientID Date Amount
1 1 2023-07-01 100.0
2 1 2023-07-02 150.0
2 2 2023-07-01 200.0
3 3 2023-07-03 250.0
-4 4 2023-07-04 300.0
5 2 2023-25-01 350.0
6 1 2023-07-02 400.0
7 6 2023-07-01 450.0
8 7 2023-07-03 500.0
9 -8 2023-07-04 550.0

df_clients

ClientID ClientName Industry
1 Client1 Tech
2 Client2 Finance
3 Client3 Real Estate
4 Client4 Healthcare
5 Client5 Tech
1 Client6 Finance
6 Client7 Real Estate
-7 Client8 Healthcare
8 Client9 Tech
2 Client10 Finance

The expected output would be:

TransactionID ClientID Date Amount ClientName Industry
1 1 2023-07-01 100 Client1 Tech
2 1 2023-07-02 150 Client1 Tech
3 3 2023-07-03 250 Client3 Real Estate
5 2 2023-25-01 350 Client2 Finance
6 1 2023-07-02 400 Client1 Tech
7 6 2023-07-01 450 Client7 Real Estate

Terminal requires a larger screen

Open this page on a desktop or tablet (≥ 768px) to launch the terminal and practice hands-on.

Linux Terminal Environment

Write and execute your solution in the terminal below.

Sign In

Track

Question Difficulty Company Access
Need more practice in this area? Explore more questions →