Databricks Community

CuriousEngineer · ‎02-17-2026

I’m building a fact table in Databricks where the grain is one row per case per 15-minute segment. For example, if a case lasts 90 minutes, I expect about 6 rows.

I generate the 15-minute segments and MERGE into a Delta table (incremental load). However, the final row count is higher than expected. When I group by case + segment type + 15-minute start time, I sometimes see duplicates for what should be a unique segment.

The source data doesn’t have duplicates.

What are common causes of this kind of duplication? Join explosion? MERGE behavior? Timezone/DST issues?

emma_s · ‎02-18-2026

Hi, Have you got a sample of the code you're running to do this? I suspect it's something in your join logic between the 15 minute start time segment and the startime itself and how you're joining it. Are you sure the events are unique themsevles and are the 15 minute windows unique too? If you share you're code I'd be happy to help further.

Curiousenginr · ‎02-19-2026

Hi @emma_s Thank you for your reply. Attached is the sample code.

-- ============================================================
-- 15-Minute OR Utilization Fact Build (Sanitized)
-- Intended grain:
-- (ORLogKey, FifteenMinuteTimeSegmentType, SegmentStartDate)
-- ============================================================

-- ==========================
-- Base Case Data
-- ==========================
WITH BaseLogData AS (
SELECT
f.LogID AS ORLogKey,
f.InORDateTime AS InORDateTime,
f.OutORDateTime AS OutORDateTime,
f.ProcedureStartDateTime AS ProcedureStartDateTime,
f.ProcedureCompleteDateTime AS ProcedureCompleteDateTime,
d.ORKey AS ORKey,
s.SurgeonKey AS PrimarySurgeonKey,
p.ProcedureKey AS PrimaryProcedureKey
FROM OR_Log o
JOIN OR_Fact f
ON o.LogID = f.LogID
LEFT JOIN OR_Dimension d
ON o.RoomID = d.ORRoomID
LEFT JOIN Surgeon_Dimension s
ON f.PrimarySurgeonID = s.SurgeonID
LEFT JOIN Procedure_Dimension p
ON f.PrimaryProcedureID = p.ProcedureID
WHERE f.InORDateTime IS NOT NULL
AND f.OutORDateTime > f.InORDateTime
),

-- ==========================
-- 15-Minute Number Generator
-- ==========================
Numbers AS (
SELECT explode(sequence(0, 191)) AS n
),

-- ==========================
-- Generate 15-Minute Segments
-- ==========================
FannedSegments AS (
SELECT
b.ORLogKey,
b.ORKey,
b.PrimarySurgeonKey,
b.PrimaryProcedureKey,
b.InORDateTime,
b.OutORDateTime,
date_add(
MINUTE,
15 * n.n,
from_unixtime(
floor(unix_timestamp(b.InORDateTime) / 900) * 900
)
) AS SegmentStartDate,
'Room' AS FifteenMinuteTimeSegmentType
FROM BaseLogData b
JOIN Numbers n
ON date_add(
MINUTE,
15 * n.n,
from_unixtime(
floor(unix_timestamp(b.InORDateTime) / 900) * 900
)
) < b.OutORDateTime
),

-- ==========================
-- Join to Time Dimension
-- ==========================
StageData AS (
SELECT
s.ORLogKey,
s.ORKey,
s.PrimarySurgeonKey,
s.PrimaryProcedureKey,
s.SegmentStartDate,
s.FifteenMinuteTimeSegmentType,
t.TimeOfDayID
FROM FannedSegments s
LEFT JOIN TimeOfDay_Dimension t
ON hour(s.SegmentStartDate) = t.HourNumber
AND minute(s.SegmentStartDate) = t.MinuteNumber
)

-- ==========================
-- Validate Intended Grain
-- ==========================
SELECT
ORLogKey,
FifteenMinuteTimeSegmentType,
SegmentStartDate,
COUNT(*) AS cnt
FROM StageData
GROUP BY 1,2,3
HAVING COUNT(*) > 1;

-- ==========================
-- MERGE INTO FACT
-- ==========================
MERGE INTO OR_Utilization_Fact target
USING StageData source
ON target.ORLogKey = source.ORLogKey
AND target.SegmentStartDate = source.SegmentStartDate
AND target.FifteenMinuteTimeSegmentType = source.FifteenMinuteTimeSegmentType

WHEN MATCHED THEN UPDATE SET
target.ORKey = source.ORKey,
target.PrimarySurgeonKey = source.PrimarySurgeonKey,
target.PrimaryProcedureKey = source.PrimaryProcedureKey,
target.TimeOfDayID = source.TimeOfDayID

WHEN NOT MATCHED THEN INSERT (
ORLogKey,
ORKey,
PrimarySurgeonKey,
PrimaryProcedureKey,
SegmentStartDate,
FifteenMinuteTimeSegmentType,
TimeOfDayID
)
VALUES (
source.ORLogKey,
source.ORKey,
source.PrimarySurgeonKey,
source.PrimaryProcedureKey,
source.SegmentStartDate,
source.FifteenMinuteTimeSegmentType,
source.TimeOfDayID
);

balajij8 · ‎02-18-2026

MERGE causes duplicates if the ON clause does not enforce the target grain. Validate if ON uniquely identifies each row at the fact grain

SteveOstrowski · ‎03-07-2026

Hi @CuriousEngineer,

This is a common scenario, and there are good solutions. Nice job including the code and the minimal reproducible example -- that makes it much easier to help. There are two places in your query that are very likely introducing duplicate rows before the MERGE ever runs. Let me walk through them.

ROOT CAUSE 1: THE LEFT JOIN TO TIMEOFDAY_DIMENSION

This is the most probable source of your duplicates. In your StageData CTE you join like this:

LEFT JOIN TimeOfDay_Dimension t
ON hour(s.SegmentStartDate) = t.HourNumber
AND minute(s.SegmentStartDate) = t.MinuteNumber

If timeofday_dimension contains more than one row for the same (HourNumber, MinuteNumber) combination, every matching row will fan out your segments. For example, if the dimension has two rows where HourNumber = 10 and MinuteNumber = 30, every segment at 10:30 doubles.

To verify, run this check:

SELECT HourNumber, MinuteNumber, COUNT(*) AS cnt
FROM TimeOfDay_Dimension
GROUP BY HourNumber, MinuteNumber
HAVING COUNT(*) > 1;

If that returns results, you have found your culprit. To fix it, either deduplicate the dimension table, or pick a single TimeOfDayID in the join:

LEFT JOIN (
SELECT HourNumber, MinuteNumber, MIN(TimeOfDayID) AS TimeOfDayID
FROM TimeOfDay_Dimension
GROUP BY HourNumber, MinuteNumber
) t
ON hour(s.SegmentStartDate) = t.HourNumber
AND minute(s.SegmentStartDate) = t.MinuteNumber

ROOT CAUSE 2: THE BASELOGDATA JOINS MAY FAN OUT ROWS

In your BaseLogData CTE you join OR_Log to OR_Fact, then LEFT JOIN to three dimension tables (OR_Dimension, Surgeon_Dimension, Procedure_Dimension). If any of those joins are one-to-many (for example, a RoomID maps to multiple rows in OR_Dimension), you will get duplicate base rows before the segment generation even starts. That duplication then gets multiplied by the number of 15-minute intervals.

Check each join with queries like:

SELECT ORRoomID, COUNT(*) AS cnt
FROM OR_Dimension
GROUP BY ORRoomID
HAVING COUNT(*) > 1;

SELECT SurgeonID, COUNT(*) AS cnt
FROM Surgeon_Dimension
GROUP BY SurgeonID
HAVING COUNT(*) > 1;

SELECT ProcedureID, COUNT(*) AS cnt
FROM Procedure_Dimension
GROUP BY ProcedureID
HAVING COUNT(*) > 1;

If any return rows, add deduplication or tighten the join conditions.

A NOTE ON MERGE BEHAVIOR

Your MERGE ON clause looks correct for the intended grain (ORLogKey, FifteenMinuteTimeSegmentType, SegmentStartDate). However, MERGE INTO on Delta tables requires that the source data (your StageData) has at most one row per key combination. If your source has duplicates, the MERGE will either:

- Throw a DELTA_MULTIPLE_SOURCE_ROW_MATCHING_TARGET_ROW_IN_MERGE error (when duplicates match existing target rows), or
- Insert all duplicate rows when they hit the NOT MATCHED clause (first run or new segments), which is likely what you are seeing.

This means the duplication is created before the MERGE, and the MERGE faithfully inserts all the duplicate source rows on the first load. On subsequent runs it would error out when those duplicates try to match existing target rows.

See the MERGE INTO documentation for details: https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html

RECOMMENDED DEBUGGING STEPS

1. Run the grain validation query you already have (the GROUP BY / HAVING COUNT(*) > 1) on the StageData CTE output -- you are already doing this which is great.

2. Run it separately on the FannedSegments CTE (before the TimeOfDay join) to isolate whether duplication happens during segment generation or during the dimension join.

3. Run the dimension uniqueness checks shown above.

4. Once you find the source, either fix the dimension data or add deduplication. A simple safety net before MERGE is:

MERGE INTO OR_Utilization_Fact target
USING (
SELECT
ORLogKey,
ORKey,
PrimarySurgeonKey,
PrimaryProcedureKey,
SegmentStartDate,
FifteenMinuteTimeSegmentType,
TimeOfDayID,
ROW_NUMBER() OVER (
PARTITION BY ORLogKey, FifteenMinuteTimeSegmentType, SegmentStartDate
ORDER BY TimeOfDayID
) AS rn
FROM StageData
) source
ON target.ORLogKey = source.ORLogKey
AND target.SegmentStartDate = source.SegmentStartDate
AND target.FifteenMinuteTimeSegmentType = source.FifteenMinuteTimeSegmentType
AND source.rn = 1
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED AND source.rn = 1 THEN INSERT ...;

A NOTE ON TIMEZONE / DST

You asked about timezone and DST issues. The floor(unix_timestamp(...) / 900) * 900 approach works on UTC epoch seconds, so it is DST-safe. However, if your source timestamps are stored as strings without timezone info and your cluster session timezone differs from the data timezone, unix_timestamp() could interpret them differently across DST boundaries. You can verify by running:

SELECT current_timezone();

And comparing that to the timezone your source datetimes represent. If there is a mismatch, explicitly cast with:

from_unixtime(floor(unix_timestamp(b.InORDateTime, 'yyyy-MM-dd HH:mm:ss') / 900) * 900)

Or use to_utc_timestamp / from_utc_timestamp to be explicit.

SUMMARY

The most likely cause is a many-to-one issue in either the TimeOfDay_Dimension join or one of the dimension joins in BaseLogData. The diagnostic queries above should pinpoint it quickly. The MERGE itself is not creating duplicates -- it is receiving duplicates from the source query.

Hope this helps, and let us know what you find!

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.