<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Writing to Delta Table and retrieving back the IDs doesn't work in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/writing-to-delta-table-and-retrieving-back-the-ids-doesn-t-work/m-p/113784#M44632</link>
    <description>&lt;P&gt;I'm interested too in this problem.. someone could help?&lt;/P&gt;</description>
    <pubDate>Thu, 27 Mar 2025 08:02:35 GMT</pubDate>
    <dc:creator>jeremy98</dc:creator>
    <dc:date>2025-03-27T08:02:35Z</dc:date>
    <item>
      <title>Writing to Delta Table and retrieving back the IDs doesn't work</title>
      <link>https://community.databricks.com/t5/data-engineering/writing-to-delta-table-and-retrieving-back-the-ids-doesn-t-work/m-p/113697#M44614</link>
      <description>&lt;P&gt;Hi.&lt;/P&gt;&lt;P&gt;I have a workflow in which I write few rows into a Delta Table with auto-generated IDs. Then, I need to retrieve them back just after they're written into the table to collect those generated IDs, so I read the table and I use two columns (one is a client_id and the other is a timestamp) of the "input" DataFrame we inserted as join key.&lt;/P&gt;&lt;P&gt;The problem is that sometimes it's retrieving them, but not always. I did some smoke tests in an interactive notebook and I was never able to recreate what's happening in the workflow.&lt;/P&gt;&lt;P&gt;Here there is a screenshot of the Spark UI in which you can see the operations (saveAsTable and the two subsequent collects) and their submit time.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="dcrnc_4-1743006179065.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/15625iB9F01C555B7ECE85/image-size/large?v=v2&amp;amp;px=999" role="button" title="dcrnc_4-1743006179065.png" alt="dcrnc_4-1743006179065.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;I did a lot of various tests in both the workflow and the interactive notebook, but still no relevant result. I thought it was something related to the timestamp and/or the timezone, but everything looked good to me. I've also tried to convert them to unix_timestamps and still the issue occurs.&lt;/P&gt;&lt;P&gt;Please, could you help me?&lt;/P&gt;&lt;P&gt;Thank you.&lt;BR /&gt;Best Regards.&lt;/P&gt;</description>
      <pubDate>Wed, 26 Mar 2025 16:30:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/writing-to-delta-table-and-retrieving-back-the-ids-doesn-t-work/m-p/113697#M44614</guid>
      <dc:creator>dc-rnc</dc:creator>
      <dc:date>2025-03-26T16:30:01Z</dc:date>
    </item>
    <item>
      <title>Re: Writing to Delta Table and retrieving back the IDs doesn't work</title>
      <link>https://community.databricks.com/t5/data-engineering/writing-to-delta-table-and-retrieving-back-the-ids-doesn-t-work/m-p/113784#M44632</link>
      <description>&lt;P&gt;I'm interested too in this problem.. someone could help?&lt;/P&gt;</description>
      <pubDate>Thu, 27 Mar 2025 08:02:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/writing-to-delta-table-and-retrieving-back-the-ids-doesn-t-work/m-p/113784#M44632</guid>
      <dc:creator>jeremy98</dc:creator>
      <dc:date>2025-03-27T08:02:35Z</dc:date>
    </item>
    <item>
      <title>Re: Writing to Delta Table and retrieving back the IDs doesn't work</title>
      <link>https://community.databricks.com/t5/data-engineering/writing-to-delta-table-and-retrieving-back-the-ids-doesn-t-work/m-p/136636#M50621</link>
      <description>&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Your workflow issue—writing to a Delta Table, immediately reading back using a join on client_id and timestamp, but sometimes missing rows—suggests a subtle problem, likely related to timing, consistency, or column precision between your input DataFrame and the data as written to Delta. Here’s a thorough breakdown of what could be happening and key steps to resolve it:&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Possible Root Causes&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Write and Read Consistency&lt;/STRONG&gt;&lt;BR /&gt;Delta Lake operations are ACID compliant, but there can be a delay between when writes are "committed" and when subsequent reads see the new data, especially if caching or transaction logs are involved.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Timestamp Precision Issues&lt;/STRONG&gt;&lt;BR /&gt;Timestamps may be rounded or truncated during write operations (e.g., microseconds lost), causing join conditions to fail if you use exact matches. Delta defaults to millisecond precision, but Spark DataFrames may handle timestamps differently or keep more precision than Delta persists.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Time Zone Mismatches&lt;/STRONG&gt;&lt;BR /&gt;If your environment has any timezone mismatches, either during writing or reading, this will break equality joins. Even with explicit conversion to Unix timestamp, subtle TZ issues can persist if conversion functions use different base timezones.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Spark Caching&lt;/STRONG&gt;&lt;BR /&gt;If you're reading from a Delta Table that was cached earlier in the session, you might fetch stale data. Unpersist the cache before reading after a write.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Write-to-Read Race Condition&lt;/STRONG&gt;&lt;BR /&gt;Double-check that your "collect" operation to read back is not happening before the Delta transaction is fully committed. Spark job scheduling can sometimes submit read before write is done, especially in parallel workflows.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Troubleshooting Steps&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Validate Transaction Commit&lt;/STRONG&gt;&lt;BR /&gt;Make sure to only trigger the read-back after the write operation returns and is committed. In workflows, use proper job dependencies/sequencing.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Check Timestamp Rounding&lt;/STRONG&gt;&lt;BR /&gt;Inspect the timestamps in Spark DataFrame vs those in Delta Table with&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;.show()&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;on both, paying attention to sub-second digits. If there are differences, join with a tolerance range on timestamp instead of exact equality:&lt;/P&gt;
&lt;DIV class="w-full md:max-w-[90vw]"&gt;
&lt;DIV class="codeWrapper text-light selection:text-super selection:bg-super/10 my-md relative flex flex-col rounded font-mono text-sm font-normal bg-subtler"&gt;
&lt;DIV class="translate-y-xs -translate-x-xs bottom-xl mb-xl flex h-0 items-start justify-end md:sticky md:top-[100px]"&gt;
&lt;DIV class="overflow-hidden rounded-full border-subtlest ring-subtlest divide-subtlest bg-base"&gt;
&lt;DIV class="border-subtlest ring-subtlest divide-subtlest bg-subtler"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="-mt-xl"&gt;
&lt;DIV&gt;
&lt;DIV class="text-quiet bg-subtle py-xs px-sm inline-block rounded-br rounded-tl-[3px] font-thin" data-testid="code-language-indicator"&gt;text&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&lt;CODE&gt;abs(df.timestamp - deltaTable.timestamp) &amp;lt; 1 second
&lt;/CODE&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Unpersist Caches&lt;/STRONG&gt;&lt;BR /&gt;Run&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;spark.catalog.clearCache()&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;before reading after every write.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Audit Timezones&lt;/STRONG&gt;&lt;BR /&gt;Set timezone for Spark explicitly:&lt;/P&gt;
&lt;DIV class="w-full md:max-w-[90vw]"&gt;
&lt;DIV class="codeWrapper text-light selection:text-super selection:bg-super/10 my-md relative flex flex-col rounded font-mono text-sm font-normal bg-subtler"&gt;
&lt;DIV class="translate-y-xs -translate-x-xs bottom-xl mb-xl flex h-0 items-start justify-end md:sticky md:top-[100px]"&gt;
&lt;DIV class="overflow-hidden rounded-full border-subtlest ring-subtlest divide-subtlest bg-base"&gt;
&lt;DIV class="border-subtlest ring-subtlest divide-subtlest bg-subtler"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV class="-mt-xl"&gt;
&lt;DIV&gt;
&lt;DIV class="text-quiet bg-subtle py-xs px-sm inline-block rounded-br rounded-tl-[3px] font-thin" data-testid="code-language-indicator"&gt;text&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;DIV&gt;&lt;SPAN&gt;&lt;CODE&gt;spark.conf.set("spark.sql.session.timeZone", "UTC")
&lt;/CODE&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Do this for both writing and reading, then compare timestamps in UTC.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Compare Raw Data&lt;/STRONG&gt;&lt;BR /&gt;After writing, query the Delta Table directly (e.g., with SQL) alongside your insert DataFrame, to see what got persisted.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Row Hashing for Join&lt;/STRONG&gt;&lt;BR /&gt;Instead of joining on timestamp, create a stable hash on (client_id, timestamp)&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;EM&gt;before&lt;/EM&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;the write and use it for post-write lookup.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Recommendations&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Use a surrogate, auto-generated ID (from Delta or Spark) for reliable joining, not just (client_id, timestamp).&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Avoid using floating point or timestamp columns for exact joins, switch to integer (unix timestamps) or use a range.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Always synchronize workflow steps to ensure all writes are durable before reads.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Check Spark job/submission times in Spark UI to ensure that read operations are scheduled after writes, not immediately in parallel.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Next Actions&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Double-check the exact SQL/dataframe join key logic for equality and type matching.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Review Spark logs for “cache” and transaction commit timing.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Try a join with a ±1 second window on timestamps and compare results.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Wed, 29 Oct 2025 20:39:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/writing-to-delta-table-and-retrieving-back-the-ids-doesn-t-work/m-p/136636#M50621</guid>
      <dc:creator>mark_ott</dc:creator>
      <dc:date>2025-10-29T20:39:00Z</dc:date>
    </item>
  </channel>
</rss>

