<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How to find that given Parquet file got imported into Bronze Layer ? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-find-that-given-parquet-file-got-imported-into-bronze/m-p/67945#M33485</link>
    <description>&lt;P&gt;Hi Team,&lt;/P&gt;&lt;P&gt;Recently we had created new Databricks project/solution (based on Medallion architecture) having Bronze-Silver-Gold Layer based tables. So we have created Delta-Live-Table based pipeline for Bronze-Layer implementation. Source files are Parquet files located on ADLS location ( External Location ). DLT-Pipeline reads PARQUET files from this External Location and imports data into _RAW and _APPEND_RAW ( Streaming tables ).&lt;/P&gt;&lt;P&gt;What we found that Parquet files are getting created serially at External Location but Bronze-Job ( a DLT based pipeline ) , running in Continuous mode, is Not able to import data from Parquet files into _raw tables.&lt;/P&gt;&lt;P&gt;As alternative approach, I did row-count on _RAW table, as shown below, and found that records are present for the date when we Turned-ON Bronze-DLT-Pipeline ( which is running Continuously ).&lt;/P&gt;&lt;P&gt;SELECT bronze_landing_date, Count(*)&lt;/P&gt;&lt;P&gt;FROM abc_raw&lt;/P&gt;&lt;P&gt;GROUP BY bronze_landing_date&lt;/P&gt;&lt;P&gt;As Job is running since last 10 days, we should get 10 rows of 10 Dates but I am only getting 1 row ( the date on which Job got started).&lt;/P&gt;&lt;P&gt;So I would like to know that How to find that given Parquet file got imported into Bronze Layer !!!&lt;/P&gt;&lt;P&gt;Also Is there anything we are missing in settings part for Bronze-DLT-Pipeline ?&lt;/P&gt;&lt;P&gt;Any pointers would be greatly appreciated.&lt;/P&gt;</description>
    <pubDate>Thu, 02 May 2024 12:50:38 GMT</pubDate>
    <dc:creator>Devsql</dc:creator>
    <dc:date>2024-05-02T12:50:38Z</dc:date>
    <item>
      <title>How to find that given Parquet file got imported into Bronze Layer ?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-find-that-given-parquet-file-got-imported-into-bronze/m-p/67945#M33485</link>
      <description>&lt;P&gt;Hi Team,&lt;/P&gt;&lt;P&gt;Recently we had created new Databricks project/solution (based on Medallion architecture) having Bronze-Silver-Gold Layer based tables. So we have created Delta-Live-Table based pipeline for Bronze-Layer implementation. Source files are Parquet files located on ADLS location ( External Location ). DLT-Pipeline reads PARQUET files from this External Location and imports data into _RAW and _APPEND_RAW ( Streaming tables ).&lt;/P&gt;&lt;P&gt;What we found that Parquet files are getting created serially at External Location but Bronze-Job ( a DLT based pipeline ) , running in Continuous mode, is Not able to import data from Parquet files into _raw tables.&lt;/P&gt;&lt;P&gt;As alternative approach, I did row-count on _RAW table, as shown below, and found that records are present for the date when we Turned-ON Bronze-DLT-Pipeline ( which is running Continuously ).&lt;/P&gt;&lt;P&gt;SELECT bronze_landing_date, Count(*)&lt;/P&gt;&lt;P&gt;FROM abc_raw&lt;/P&gt;&lt;P&gt;GROUP BY bronze_landing_date&lt;/P&gt;&lt;P&gt;As Job is running since last 10 days, we should get 10 rows of 10 Dates but I am only getting 1 row ( the date on which Job got started).&lt;/P&gt;&lt;P&gt;So I would like to know that How to find that given Parquet file got imported into Bronze Layer !!!&lt;/P&gt;&lt;P&gt;Also Is there anything we are missing in settings part for Bronze-DLT-Pipeline ?&lt;/P&gt;&lt;P&gt;Any pointers would be greatly appreciated.&lt;/P&gt;</description>
      <pubDate>Thu, 02 May 2024 12:50:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-find-that-given-parquet-file-got-imported-into-bronze/m-p/67945#M33485</guid>
      <dc:creator>Devsql</dc:creator>
      <dc:date>2024-05-02T12:50:38Z</dc:date>
    </item>
    <item>
      <title>Re: How to find that given Parquet file got imported into Bronze Layer ?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-find-that-given-parquet-file-got-imported-into-bronze/m-p/67979#M33503</link>
      <description>&lt;P&gt;Hello &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/104457"&gt;@Devsql&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;
&lt;P&gt;It appears that you are creating DLT bronze tables using a standard &lt;EM&gt;spark.read&lt;/EM&gt; operation. This may explain why the DLT table doesn't include "new files" during a REFRESH operation.&lt;/P&gt;
&lt;P&gt;For incremental ingestion of bronze layer data into your DLT pipeline and tables, we recommend using Autoloader. You can find more information in the following documents:&lt;/P&gt;
&lt;P&gt;- DLT Update Modes (Full Refresh/Refresh): &lt;A href="https://docs.databricks.com/en/delta-live-tables/updates.html" target="_blank"&gt;https://docs.databricks.com/en/delta-live-tables/updates.html&lt;/A&gt;&lt;BR /&gt;- Autoloader: &lt;A href="https://docs.databricks.com/en/ingestion/auto-loader/index.html#what-is-auto-loader" target="_blank"&gt;https://docs.databricks.com/en/ingestion/auto-loader/index.html#what-is-auto-loader&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 02 May 2024 23:05:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-find-that-given-parquet-file-got-imported-into-bronze/m-p/67979#M33503</guid>
      <dc:creator>raphaelblg</dc:creator>
      <dc:date>2024-05-02T23:05:35Z</dc:date>
    </item>
    <item>
      <title>Re: How to find that given Parquet file got imported into Bronze Layer ?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-find-that-given-parquet-file-got-imported-into-bronze/m-p/70063#M33975</link>
      <description>&lt;P&gt;Yes &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/97998"&gt;@raphaelblg&lt;/a&gt;, &lt;SPAN class=""&gt;we are already using Auto Loader option and i understand that.Auto Loader will continuously import file from ADLS Gen2 into Bronze-Layer-DB.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN class=""&gt;But still I am not Yet clear about how to know if given Parquet file got imported into Bronze Layer ? &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN class=""&gt;Just based on Logic of Auto Loader, we need to assume that file already got imported Or Is there any mechanism for this !!!&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN class=""&gt;Any article in this regard would be helpful.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN class=""&gt;Thanks&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN class=""&gt;Devsql&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 21 May 2024 06:33:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-find-that-given-parquet-file-got-imported-into-bronze/m-p/70063#M33975</guid>
      <dc:creator>Devsql</dc:creator>
      <dc:date>2024-05-21T06:33:14Z</dc:date>
    </item>
    <item>
      <title>Re: How to find that given Parquet file got imported into Bronze Layer ?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-find-that-given-parquet-file-got-imported-into-bronze/m-p/70132#M34010</link>
      <description>&lt;P&gt;Hello &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/104457"&gt;@Devsql&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Autoloader initially lists files using one of the&amp;nbsp;&lt;A href="https://docs.databricks.com/en/ingestion/auto-loader/file-detection-modes.html" target="_self"&gt;File Detection Modes&lt;/A&gt;. For each batch of files discovered, a checkpoint is created. If you wish to examine the state of your checkpoint, you can use the &lt;A href="https://docs.databricks.com/en/sql/language-manual/functions/cloud_files_state.html" target="_self"&gt;cloud_files_state&lt;/A&gt; SQL function, which displays all files discovered by Autoloader.&lt;/P&gt;
&lt;P&gt;Autoloader uses&amp;nbsp;&lt;A href="https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing" target="_self"&gt;Checkpointing&lt;/A&gt;&amp;nbsp;to maintain state, maintaining&amp;nbsp;&lt;A href="https://docs.databricks.com/en/structured-streaming/index.html#what-is-structured-streaming" target="_blank"&gt;exactly-once&lt;/A&gt;&amp;nbsp;&lt;SPAN&gt;processing guarantees throughout your spark structured streaming query.&lt;BR /&gt;&lt;BR /&gt;I hope my answer is helpful to you,&amp;nbsp;I've attempted to provide a comprehensive overview of Databricks Autoloader's features.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 21 May 2024 14:27:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-find-that-given-parquet-file-got-imported-into-bronze/m-p/70132#M34010</guid>
      <dc:creator>raphaelblg</dc:creator>
      <dc:date>2024-05-21T14:27:30Z</dc:date>
    </item>
  </channel>
</rss>

