<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Row count is higher than expected after breaking cases into 15-minute segments in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/row-count-is-higher-than-expected-after-breaking-cases-into-15/m-p/148636#M11445</link>
    <description>&lt;P&gt;I’m building a fact table in Databricks where the grain is one row per case per 15-minute segment. For example, if a case lasts 90 minutes, I expect about 6 rows.&lt;/P&gt;&lt;P&gt;I generate the 15-minute segments and MERGE into a Delta table (incremental load). However, the final row count is higher than expected. When I group by case + segment type + 15-minute start time, I sometimes see duplicates for what should be a unique segment.&lt;/P&gt;&lt;P&gt;The source data doesn’t have duplicates.&lt;/P&gt;&lt;P&gt;What are common causes of this kind of duplication? Join explosion? MERGE behavior? Timezone/DST issues?&lt;/P&gt;</description>
    <pubDate>Tue, 17 Feb 2026 17:57:00 GMT</pubDate>
    <dc:creator>CuriousEngineer</dc:creator>
    <dc:date>2026-02-17T17:57:00Z</dc:date>
    <item>
      <title>Row count is higher than expected after breaking cases into 15-minute segments</title>
      <link>https://community.databricks.com/t5/get-started-discussions/row-count-is-higher-than-expected-after-breaking-cases-into-15/m-p/148636#M11445</link>
      <description>&lt;P&gt;I’m building a fact table in Databricks where the grain is one row per case per 15-minute segment. For example, if a case lasts 90 minutes, I expect about 6 rows.&lt;/P&gt;&lt;P&gt;I generate the 15-minute segments and MERGE into a Delta table (incremental load). However, the final row count is higher than expected. When I group by case + segment type + 15-minute start time, I sometimes see duplicates for what should be a unique segment.&lt;/P&gt;&lt;P&gt;The source data doesn’t have duplicates.&lt;/P&gt;&lt;P&gt;What are common causes of this kind of duplication? Join explosion? MERGE behavior? Timezone/DST issues?&lt;/P&gt;</description>
      <pubDate>Tue, 17 Feb 2026 17:57:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/row-count-is-higher-than-expected-after-breaking-cases-into-15/m-p/148636#M11445</guid>
      <dc:creator>CuriousEngineer</dc:creator>
      <dc:date>2026-02-17T17:57:00Z</dc:date>
    </item>
    <item>
      <title>Re: Row count is higher than expected after breaking cases into 15-minute segments</title>
      <link>https://community.databricks.com/t5/get-started-discussions/row-count-is-higher-than-expected-after-breaking-cases-into-15/m-p/148700#M11451</link>
      <description>&lt;P&gt;Hi, Have you got a sample of the code you're running to do this? I suspect it's something in your join logic between the 15 minute start time segment and the startime itself and how you're joining it. Are you sure the events are unique themsevles and are the 15 minute windows unique too? If you share you're code I'd be happy to help further.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 18 Feb 2026 13:48:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/row-count-is-higher-than-expected-after-breaking-cases-into-15/m-p/148700#M11451</guid>
      <dc:creator>emma_s</dc:creator>
      <dc:date>2026-02-18T13:48:50Z</dc:date>
    </item>
    <item>
      <title>Re: Row count is higher than expected after breaking cases into 15-minute segments</title>
      <link>https://community.databricks.com/t5/get-started-discussions/row-count-is-higher-than-expected-after-breaking-cases-into-15/m-p/148701#M11452</link>
      <description>&lt;P&gt;&lt;SPAN&gt;MERGE causes duplicates if the ON clause does not enforce the target grain. Validate if ON uniquely identifies each row at the fact grain&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 18 Feb 2026 14:19:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/row-count-is-higher-than-expected-after-breaking-cases-into-15/m-p/148701#M11452</guid>
      <dc:creator>balajij8</dc:creator>
      <dc:date>2026-02-18T14:19:35Z</dc:date>
    </item>
    <item>
      <title>Re: Row count is higher than expected after breaking cases into 15-minute segments</title>
      <link>https://community.databricks.com/t5/get-started-discussions/row-count-is-higher-than-expected-after-breaking-cases-into-15/m-p/148842#M11453</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/176516"&gt;@emma_s&lt;/a&gt;&amp;nbsp; Thank you for your reply. Attached is the sample code.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;-- ============================================================&lt;BR /&gt;-- 15-Minute OR Utilization Fact Build (Sanitized)&lt;BR /&gt;-- Intended grain:&lt;BR /&gt;-- (ORLogKey, FifteenMinuteTimeSegmentType, SegmentStartDate)&lt;BR /&gt;-- ============================================================&lt;/P&gt;&lt;P&gt;-- ==========================&lt;BR /&gt;-- Base Case Data&lt;BR /&gt;-- ==========================&lt;BR /&gt;WITH BaseLogData AS (&lt;BR /&gt;SELECT&lt;BR /&gt;f.LogID AS ORLogKey,&lt;BR /&gt;f.InORDateTime AS InORDateTime,&lt;BR /&gt;f.OutORDateTime AS OutORDateTime,&lt;BR /&gt;f.ProcedureStartDateTime AS ProcedureStartDateTime,&lt;BR /&gt;f.ProcedureCompleteDateTime AS ProcedureCompleteDateTime,&lt;BR /&gt;d.ORKey AS ORKey,&lt;BR /&gt;s.SurgeonKey AS PrimarySurgeonKey,&lt;BR /&gt;p.ProcedureKey AS PrimaryProcedureKey&lt;BR /&gt;FROM OR_Log o&lt;BR /&gt;JOIN OR_Fact f&lt;BR /&gt;ON o.LogID = f.LogID&lt;BR /&gt;LEFT JOIN OR_Dimension d&lt;BR /&gt;ON o.RoomID = d.ORRoomID&lt;BR /&gt;LEFT JOIN Surgeon_Dimension s&lt;BR /&gt;ON f.PrimarySurgeonID = s.SurgeonID&lt;BR /&gt;LEFT JOIN Procedure_Dimension p&lt;BR /&gt;ON f.PrimaryProcedureID = p.ProcedureID&lt;BR /&gt;WHERE f.InORDateTime IS NOT NULL&lt;BR /&gt;AND f.OutORDateTime &amp;gt; f.InORDateTime&lt;BR /&gt;),&lt;/P&gt;&lt;P&gt;-- ==========================&lt;BR /&gt;-- 15-Minute Number Generator&lt;BR /&gt;-- ==========================&lt;BR /&gt;Numbers AS (&lt;BR /&gt;SELECT explode(sequence(0, 191)) AS n&lt;BR /&gt;),&lt;/P&gt;&lt;P&gt;-- ==========================&lt;BR /&gt;-- Generate 15-Minute Segments&lt;BR /&gt;-- ==========================&lt;BR /&gt;FannedSegments AS (&lt;BR /&gt;SELECT&lt;BR /&gt;b.ORLogKey,&lt;BR /&gt;b.ORKey,&lt;BR /&gt;b.PrimarySurgeonKey,&lt;BR /&gt;b.PrimaryProcedureKey,&lt;BR /&gt;b.InORDateTime,&lt;BR /&gt;b.OutORDateTime,&lt;BR /&gt;date_add(&lt;BR /&gt;MINUTE,&lt;BR /&gt;15 * n.n,&lt;BR /&gt;from_unixtime(&lt;BR /&gt;floor(unix_timestamp(b.InORDateTime) / 900) * 900&lt;BR /&gt;)&lt;BR /&gt;) AS SegmentStartDate,&lt;BR /&gt;'Room' AS FifteenMinuteTimeSegmentType&lt;BR /&gt;FROM BaseLogData b&lt;BR /&gt;JOIN Numbers n&lt;BR /&gt;ON date_add(&lt;BR /&gt;MINUTE,&lt;BR /&gt;15 * n.n,&lt;BR /&gt;from_unixtime(&lt;BR /&gt;floor(unix_timestamp(b.InORDateTime) / 900) * 900&lt;BR /&gt;)&lt;BR /&gt;) &amp;lt; b.OutORDateTime&lt;BR /&gt;),&lt;/P&gt;&lt;P&gt;-- ==========================&lt;BR /&gt;-- Join to Time Dimension&lt;BR /&gt;-- ==========================&lt;BR /&gt;StageData AS (&lt;BR /&gt;SELECT&lt;BR /&gt;s.ORLogKey,&lt;BR /&gt;s.ORKey,&lt;BR /&gt;s.PrimarySurgeonKey,&lt;BR /&gt;s.PrimaryProcedureKey,&lt;BR /&gt;s.SegmentStartDate,&lt;BR /&gt;s.FifteenMinuteTimeSegmentType,&lt;BR /&gt;t.TimeOfDayID&lt;BR /&gt;FROM FannedSegments s&lt;BR /&gt;LEFT JOIN TimeOfDay_Dimension t&lt;BR /&gt;ON hour(s.SegmentStartDate) = t.HourNumber&lt;BR /&gt;AND minute(s.SegmentStartDate) = t.MinuteNumber&lt;BR /&gt;)&lt;/P&gt;&lt;P&gt;-- ==========================&lt;BR /&gt;-- Validate Intended Grain&lt;BR /&gt;-- ==========================&lt;BR /&gt;SELECT&lt;BR /&gt;ORLogKey,&lt;BR /&gt;FifteenMinuteTimeSegmentType,&lt;BR /&gt;SegmentStartDate,&lt;BR /&gt;COUNT(*) AS cnt&lt;BR /&gt;FROM StageData&lt;BR /&gt;GROUP BY 1,2,3&lt;BR /&gt;HAVING COUNT(*) &amp;gt; 1;&lt;/P&gt;&lt;P&gt;-- ==========================&lt;BR /&gt;-- MERGE INTO FACT&lt;BR /&gt;-- ==========================&lt;BR /&gt;MERGE INTO OR_Utilization_Fact target&lt;BR /&gt;USING StageData source&lt;BR /&gt;ON target.ORLogKey = source.ORLogKey&lt;BR /&gt;AND target.SegmentStartDate = source.SegmentStartDate&lt;BR /&gt;AND target.FifteenMinuteTimeSegmentType = source.FifteenMinuteTimeSegmentType&lt;/P&gt;&lt;P&gt;WHEN MATCHED THEN UPDATE SET&lt;BR /&gt;target.ORKey = source.ORKey,&lt;BR /&gt;target.PrimarySurgeonKey = source.PrimarySurgeonKey,&lt;BR /&gt;target.PrimaryProcedureKey = source.PrimaryProcedureKey,&lt;BR /&gt;target.TimeOfDayID = source.TimeOfDayID&lt;/P&gt;&lt;P&gt;WHEN NOT MATCHED THEN INSERT (&lt;BR /&gt;ORLogKey,&lt;BR /&gt;ORKey,&lt;BR /&gt;PrimarySurgeonKey,&lt;BR /&gt;PrimaryProcedureKey,&lt;BR /&gt;SegmentStartDate,&lt;BR /&gt;FifteenMinuteTimeSegmentType,&lt;BR /&gt;TimeOfDayID&lt;BR /&gt;)&lt;BR /&gt;VALUES (&lt;BR /&gt;source.ORLogKey,&lt;BR /&gt;source.ORKey,&lt;BR /&gt;source.PrimarySurgeonKey,&lt;BR /&gt;source.PrimaryProcedureKey,&lt;BR /&gt;source.SegmentStartDate,&lt;BR /&gt;source.FifteenMinuteTimeSegmentType,&lt;BR /&gt;source.TimeOfDayID&lt;BR /&gt;);&lt;/P&gt;</description>
      <pubDate>Thu, 19 Feb 2026 17:56:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/row-count-is-higher-than-expected-after-breaking-cases-into-15/m-p/148842#M11453</guid>
      <dc:creator>Curiousenginr</dc:creator>
      <dc:date>2026-02-19T17:56:16Z</dc:date>
    </item>
    <item>
      <title>Re: Row count is higher than expected after breaking cases into 15-minute segments</title>
      <link>https://community.databricks.com/t5/get-started-discussions/row-count-is-higher-than-expected-after-breaking-cases-into-15/m-p/150135#M11504</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/216479"&gt;@CuriousEngineer&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;This is a common scenario, and there are good solutions. Nice job including the code and the minimal reproducible example -- that makes it much easier to help. There are two places in your query that are very likely introducing duplicate rows before the MERGE ever runs. Let me walk through them.&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;ROOT CAUSE 1: THE LEFT JOIN TO TIMEOFDAY_DIMENSION&lt;/P&gt;
&lt;P&gt;This is the most probable source of your duplicates. In your StageData CTE you join like this:&lt;/P&gt;
&lt;P&gt;LEFT JOIN TimeOfDay_Dimension t&lt;BR /&gt;ON hour(s.SegmentStartDate) = t.HourNumber&lt;BR /&gt;AND minute(s.SegmentStartDate) = t.MinuteNumber&lt;/P&gt;
&lt;P&gt;If timeofday_dimension contains more than one row for the same (HourNumber, MinuteNumber) combination, every matching row will fan out your segments. For example, if the dimension has two rows where HourNumber = 10 and MinuteNumber = 30, every segment at 10:30 doubles.&lt;/P&gt;
&lt;P&gt;To verify, run this check:&lt;/P&gt;
&lt;P&gt;SELECT HourNumber, MinuteNumber, COUNT(*) AS cnt&lt;BR /&gt;FROM TimeOfDay_Dimension&lt;BR /&gt;GROUP BY HourNumber, MinuteNumber&lt;BR /&gt;HAVING COUNT(*) &amp;gt; 1;&lt;/P&gt;
&lt;P&gt;If that returns results, you have found your culprit. To fix it, either deduplicate the dimension table, or pick a single TimeOfDayID in the join:&lt;/P&gt;
&lt;P&gt;LEFT JOIN (&lt;BR /&gt;SELECT HourNumber, MinuteNumber, MIN(TimeOfDayID) AS TimeOfDayID&lt;BR /&gt;FROM TimeOfDay_Dimension&lt;BR /&gt;GROUP BY HourNumber, MinuteNumber&lt;BR /&gt;) t&lt;BR /&gt;ON hour(s.SegmentStartDate) = t.HourNumber&lt;BR /&gt;AND minute(s.SegmentStartDate) = t.MinuteNumber&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;ROOT CAUSE 2: THE BASELOGDATA JOINS MAY FAN OUT ROWS&lt;/P&gt;
&lt;P&gt;In your BaseLogData CTE you join OR_Log to OR_Fact, then LEFT JOIN to three dimension tables (OR_Dimension, Surgeon_Dimension, Procedure_Dimension). If any of those joins are one-to-many (for example, a RoomID maps to multiple rows in OR_Dimension), you will get duplicate base rows before the segment generation even starts. That duplication then gets multiplied by the number of 15-minute intervals.&lt;/P&gt;
&lt;P&gt;Check each join with queries like:&lt;/P&gt;
&lt;P&gt;SELECT ORRoomID, COUNT(*) AS cnt&lt;BR /&gt;FROM OR_Dimension&lt;BR /&gt;GROUP BY ORRoomID&lt;BR /&gt;HAVING COUNT(*) &amp;gt; 1;&lt;/P&gt;
&lt;P&gt;SELECT SurgeonID, COUNT(*) AS cnt&lt;BR /&gt;FROM Surgeon_Dimension&lt;BR /&gt;GROUP BY SurgeonID&lt;BR /&gt;HAVING COUNT(*) &amp;gt; 1;&lt;/P&gt;
&lt;P&gt;SELECT ProcedureID, COUNT(*) AS cnt&lt;BR /&gt;FROM Procedure_Dimension&lt;BR /&gt;GROUP BY ProcedureID&lt;BR /&gt;HAVING COUNT(*) &amp;gt; 1;&lt;/P&gt;
&lt;P&gt;If any return rows, add deduplication or tighten the join conditions.&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;A NOTE ON MERGE BEHAVIOR&lt;/P&gt;
&lt;P&gt;Your MERGE ON clause looks correct for the intended grain (ORLogKey, FifteenMinuteTimeSegmentType, SegmentStartDate). However, MERGE INTO on Delta tables requires that the source data (your StageData) has at most one row per key combination. If your source has duplicates, the MERGE will either:&lt;/P&gt;
&lt;P&gt;- Throw a DELTA_MULTIPLE_SOURCE_ROW_MATCHING_TARGET_ROW_IN_MERGE error (when duplicates match existing target rows), or&lt;BR /&gt;- Insert all duplicate rows when they hit the NOT MATCHED clause (first run or new segments), which is likely what you are seeing.&lt;/P&gt;
&lt;P&gt;This means the duplication is created before the MERGE, and the MERGE faithfully inserts all the duplicate source rows on the first load. On subsequent runs it would error out when those duplicates try to match existing target rows.&lt;/P&gt;
&lt;P&gt;See the MERGE INTO documentation for details: &lt;A href="https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html" target="_blank"&gt;https://docs.databricks.com/en/sql/language-manual/delta-merge-into.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;RECOMMENDED DEBUGGING STEPS&lt;/P&gt;
&lt;P&gt;1. Run the grain validation query you already have (the GROUP BY / HAVING COUNT(*) &amp;gt; 1) on the StageData CTE output -- you are already doing this which is great.&lt;/P&gt;
&lt;P&gt;2. Run it separately on the FannedSegments CTE (before the TimeOfDay join) to isolate whether duplication happens during segment generation or during the dimension join.&lt;/P&gt;
&lt;P&gt;3. Run the dimension uniqueness checks shown above.&lt;/P&gt;
&lt;P&gt;4. Once you find the source, either fix the dimension data or add deduplication. A simple safety net before MERGE is:&lt;/P&gt;
&lt;P&gt;MERGE INTO OR_Utilization_Fact target&lt;BR /&gt;USING (&lt;BR /&gt;SELECT&lt;BR /&gt;ORLogKey,&lt;BR /&gt;ORKey,&lt;BR /&gt;PrimarySurgeonKey,&lt;BR /&gt;PrimaryProcedureKey,&lt;BR /&gt;SegmentStartDate,&lt;BR /&gt;FifteenMinuteTimeSegmentType,&lt;BR /&gt;TimeOfDayID,&lt;BR /&gt;ROW_NUMBER() OVER (&lt;BR /&gt;PARTITION BY ORLogKey, FifteenMinuteTimeSegmentType, SegmentStartDate&lt;BR /&gt;ORDER BY TimeOfDayID&lt;BR /&gt;) AS rn&lt;BR /&gt;FROM StageData&lt;BR /&gt;) source&lt;BR /&gt;ON target.ORLogKey = source.ORLogKey&lt;BR /&gt;AND target.SegmentStartDate = source.SegmentStartDate&lt;BR /&gt;AND target.FifteenMinuteTimeSegmentType = source.FifteenMinuteTimeSegmentType&lt;BR /&gt;AND source.rn = 1&lt;BR /&gt;WHEN MATCHED THEN UPDATE SET ...&lt;BR /&gt;WHEN NOT MATCHED AND source.rn = 1 THEN INSERT ...;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;A NOTE ON TIMEZONE / DST&lt;/P&gt;
&lt;P&gt;You asked about timezone and DST issues. The floor(unix_timestamp(...) / 900) * 900 approach works on UTC epoch seconds, so it is DST-safe. However, if your source timestamps are stored as strings without timezone info and your cluster session timezone differs from the data timezone, unix_timestamp() could interpret them differently across DST boundaries. You can verify by running:&lt;/P&gt;
&lt;P&gt;SELECT current_timezone();&lt;/P&gt;
&lt;P&gt;And comparing that to the timezone your source datetimes represent. If there is a mismatch, explicitly cast with:&lt;/P&gt;
&lt;P&gt;from_unixtime(floor(unix_timestamp(b.InORDateTime, 'yyyy-MM-dd HH:mm:ss') / 900) * 900)&lt;/P&gt;
&lt;P&gt;Or use to_utc_timestamp / from_utc_timestamp to be explicit.&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;SUMMARY&lt;/P&gt;
&lt;P&gt;The most likely cause is a many-to-one issue in either the TimeOfDay_Dimension join or one of the dimension joins in BaseLogData. The diagnostic queries above should pinpoint it quickly. The MERGE itself is not creating duplicates -- it is receiving duplicates from the source query.&lt;/P&gt;
&lt;P&gt;Hope this helps, and let us know what you find!&lt;/P&gt;
&lt;P&gt;* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.&lt;/P&gt;</description>
      <pubDate>Sun, 08 Mar 2026 04:43:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/row-count-is-higher-than-expected-after-breaking-cases-into-15/m-p/150135#M11504</guid>
      <dc:creator>SteveOstrowski</dc:creator>
      <dc:date>2026-03-08T04:43:20Z</dc:date>
    </item>
  </channel>
</rss>

