Scenario
A company tracks customer churn across three separate data sources: user accounts, activity logs, and exit surveys. The activity logs contain duplicate records that must be cleaned before combining.
Task
Write a Snowflake SQL query that:
- Deduplicates
{{ ref("activities") }} by removing rows with identical user_id, activity_date, and activity_type (keep one row per unique combination)
- Left joins the deduplicated activities and
{{ ref("exit_surveys") }} onto {{ ref("accounts") }} using user_id
- Returns all seven columns and sorts by
user_id ascending, then activity_date descending
Schema
accounts
| Column |
Type |
Description |
| user_id |
String |
Unique user identifier |
| account_created_date |
Date |
Date the account was created |
| location |
String |
User's city |
activities
| Column |
Type |
Description |
| user_id |
String |
User identifier |
| activity_date |
Date |
Date of the activity |
| activity_type |
String |
Type of activity performed |
exit_surveys
| Column |
Type |
Description |
| user_id |
String |
User identifier |
| exit_date |
Date |
Date the user churned |
| exit_reason |
String |
Reason given for leaving |
Example
accounts:
| user_id |
account_created_date |
location |
| A01 |
2024-01-10 |
Portland |
| A02 |
2024-01-18 |
Denver |
| A03 |
2024-02-03 |
Atlanta |
| A04 |
2024-02-20 |
Seattle |
| A05 |
2024-03-01 |
Miami |
activities (with duplicates):
| user_id |
activity_date |
activity_type |
| A01 |
2024-03-12 |
Login |
| A01 |
2024-03-12 |
Login |
| A02 |
2024-03-18 |
Export |
| A02 |
2024-03-18 |
Export |
| A03 |
2024-04-01 |
Deploy |
| A04 |
2024-04-10 |
Login |
| A01 |
2024-04-15 |
Deploy |
exit_surveys:
| user_id |
exit_date |
exit_reason |
| A01 |
2024-05-01 |
Switched provider |
| A03 |
2024-05-10 |
Too expensive |
| A04 |
2024-05-15 |
Missing features |
Expected Output:
| user_id |
account_created_date |
location |
activity_date |
activity_type |
exit_date |
exit_reason |
| A01 |
2024-01-10 |
Portland |
2024-04-15 |
Deploy |
2024-05-01 |
Switched provider |
| A01 |
2024-01-10 |
Portland |
2024-03-12 |
Login |
2024-05-01 |
Switched provider |
| A02 |
2024-01-18 |
Denver |
2024-03-18 |
Export |
NULL |
NULL |
| A03 |
2024-02-03 |
Atlanta |
2024-04-01 |
Deploy |
2024-05-10 |
Too expensive |
| A04 |
2024-02-20 |
Seattle |
2024-04-10 |
Login |
2024-05-15 |
Missing features |
| A05 |
2024-03-01 |
Miami |
NULL |
NULL |
NULL |
NULL |
Note: Duplicate activity rows are reduced to one per unique combination. A01 has two distinct activities so appears twice. A02 has no exit survey (NULLs). A05 has no activities and no exit survey (NULLs in both). Results are sorted by user_id ASC, then activity_date DESC.
WITH deduped_activities AS (
SELECT
user_id,
activity_date,
activity_type
FROM {{ ref("activities") }}
QUALIFY ROW_NUMBER() OVER (
PARTITION BY user_id, activity_date, activity_type
ORDER BY activity_date
) = 1
),
combined AS (
SELECT
a.user_id,
a.account_created_date,
a.location,
act.activity_date,
act.activity_type,
e.exit_date,
e.exit_reason
FROM {{ ref("accounts") }} a
LEFT JOIN deduped_activities act
ON a.user_id = act.user_id
LEFT JOIN {{ ref("exit_surveys") }} e
ON a.user_id = e.user_id
)
SELECT *
FROM combined
ORDER BY user_id ASC, activity_date DESC
Explanation
Step 1: Deduplicate activities with QUALIFY and ROW_NUMBER
SELECT
user_id,
activity_date,
activity_type
FROM {{ ref("activities") }}
QUALIFY ROW_NUMBER() OVER (
PARTITION BY user_id, activity_date, activity_type
ORDER BY activity_date
) = 1
Snowflake's QUALIFY clause filters on the result of a window function without needing a subquery or CTE wrapper. ROW_NUMBER() assigns a sequential number within each partition of identical rows. Keeping only = 1 removes all duplicates. This is equivalent to SELECT DISTINCT on those three columns, but QUALIFY with ROW_NUMBER is the standard Snowflake pattern for deduplication and extends naturally to cases where you need to pick a specific row (e.g., the most recent).
Step 2: LEFT JOIN all three sources
FROM {{ ref("accounts") }} a
LEFT JOIN deduped_activities act
ON a.user_id = act.user_id
LEFT JOIN {{ ref("exit_surveys") }} e
ON a.user_id = e.user_id
Using {{ ref("accounts") }} as the base table with LEFT JOIN ensures every account appears in the output. Users without any activity records will have NULLs in the activity columns, and users without an exit survey will have NULLs in the survey columns. If a user has multiple distinct activities after deduplication, the LEFT JOIN produces one output row per activity.
Step 3: Sort the result
ORDER BY user_id ASC, activity_date DESC
The primary sort groups all rows for the same user together. The secondary sort (descending on activity_date) places the most recent activity first within each user. NULL activity dates (for users with no activities) naturally sort last in descending order.
Alternative: Using SELECT DISTINCT instead of QUALIFY
WITH deduped_activities AS (
SELECT DISTINCT
user_id,
activity_date,
activity_type
FROM {{ ref("activities") }}
)
SELECT
a.user_id,
a.account_created_date,
a.location,
da.activity_date,
da.activity_type,
es.exit_date,
es.exit_reason
FROM {{ ref("accounts") }} a
LEFT JOIN deduped_activities da
ON a.user_id = da.user_id
LEFT JOIN {{ ref("exit_surveys") }} es
ON a.user_id = es.user_id
ORDER BY a.user_id ASC, da.activity_date DESC
For this particular case, SELECT DISTINCT produces the same result since we want exactly one row per unique combination and do not need to choose between non-duplicate columns. The QUALIFY ROW_NUMBER() approach is more flexible for real-world deduplication where rows differ in some columns.