<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Perform row_number() filter in autoloader in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/perform-row-number-filter-in-autoloader/m-p/98184#M39636</link>
    <description>&lt;P&gt;HI&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/131538"&gt;@hkmodi&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Basically, as&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/79106"&gt;@daniel_sahal&lt;/a&gt;&amp;nbsp; said, bronze layer should reflect the source system. The silver layer is dedicated for deduplication/cleaning/enrichment of dataset. If you still need to deduplicate at bronze layer you have 2 options:&lt;BR /&gt;- use merge statement (but you said that you cannot use merge)&lt;/P&gt;&lt;P&gt;- after bronze table is loaded, run process that will deduplicate table (I don't like this approach ;))&lt;/P&gt;</description>
    <pubDate>Fri, 08 Nov 2024 12:18:57 GMT</pubDate>
    <dc:creator>szymon_dybczak</dc:creator>
    <dc:date>2024-11-08T12:18:57Z</dc:date>
    <item>
      <title>Perform row_number() filter in autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/perform-row-number-filter-in-autoloader/m-p/98131#M39617</link>
      <description>&lt;P&gt;I have created an autoloader job that reads data from S3 (files with no extension) having json using (cloudFiles.format, text). Now this job is suppose to run every 4 hours and read all the new data that arrived. But before writing into a delta table, I want to make sure I have just one occurrence of the id with the max timestamp being written into the delta table.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;I am using micro batch for doing this using row_number() == 1. But is gives me max timestamp occurrence of the micro batch and not the entire data read when script is run.&lt;BR /&gt;&lt;BR /&gt;How do I tackle this. My dataset is huge so can't use maxFiles or maxBytes. Also I need to append not merge into the final delta table&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 07 Nov 2024 19:28:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/perform-row-number-filter-in-autoloader/m-p/98131#M39617</guid>
      <dc:creator>hkmodi</dc:creator>
      <dc:date>2024-11-07T19:28:54Z</dc:date>
    </item>
    <item>
      <title>Re: Perform row_number() filter in autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/perform-row-number-filter-in-autoloader/m-p/98170#M39627</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/131538"&gt;@hkmodi&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;That's what Medallion Architecture was designed for.&lt;BR /&gt;Basically, you load everything you get into Bronze layer (meaning that you'll have duplicates in there), then do the deduplication when loading the data into the Silver layer.&lt;/P&gt;</description>
      <pubDate>Fri, 08 Nov 2024 07:02:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/perform-row-number-filter-in-autoloader/m-p/98170#M39627</guid>
      <dc:creator>daniel_sahal</dc:creator>
      <dc:date>2024-11-08T07:02:40Z</dc:date>
    </item>
    <item>
      <title>Re: Perform row_number() filter in autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/perform-row-number-filter-in-autoloader/m-p/98184#M39636</link>
      <description>&lt;P&gt;HI&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/131538"&gt;@hkmodi&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Basically, as&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/79106"&gt;@daniel_sahal&lt;/a&gt;&amp;nbsp; said, bronze layer should reflect the source system. The silver layer is dedicated for deduplication/cleaning/enrichment of dataset. If you still need to deduplicate at bronze layer you have 2 options:&lt;BR /&gt;- use merge statement (but you said that you cannot use merge)&lt;/P&gt;&lt;P&gt;- after bronze table is loaded, run process that will deduplicate table (I don't like this approach ;))&lt;/P&gt;</description>
      <pubDate>Fri, 08 Nov 2024 12:18:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/perform-row-number-filter-in-autoloader/m-p/98184#M39636</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2024-11-08T12:18:57Z</dc:date>
    </item>
    <item>
      <title>Re: Perform row_number() filter in autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/perform-row-number-filter-in-autoloader/m-p/98197#M39641</link>
      <description>&lt;P&gt;I did have this approach in mind. But every 4 hours of data that I have is equivalent to 1.2 TB&lt;BR /&gt;I want to avoid writing all that into a table and then performing dedeuplication.&lt;BR /&gt;&lt;BR /&gt;Is there a way I can create a temp view and do deduplication before writing that into a table?&lt;/P&gt;</description>
      <pubDate>Fri, 08 Nov 2024 16:30:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/perform-row-number-filter-in-autoloader/m-p/98197#M39641</guid>
      <dc:creator>hkmodi</dc:creator>
      <dc:date>2024-11-08T16:30:41Z</dc:date>
    </item>
  </channel>
</rss>

