<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: exposing RAW files using read_files based views, partition discovery and skipping, performance i in Warehousing &amp; Analytics</title>
    <link>https://community.databricks.com/t5/warehousing-analytics/exposing-raw-files-using-read-files-based-views-partition/m-p/68483#M1312</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;Many thanks for your answer.&lt;BR /&gt;I would like to focus on the main problems at hand here. Please note that we are not discussing external tables based on CSV / XML files nor delta tables just the 'read_files' functionality.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;The main issue is the &lt;STRONG&gt;partition skipping&lt;/STRONG&gt; when using &lt;STRONG&gt;read_files.&lt;BR /&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;BR /&gt;&lt;/STRONG&gt;When reading the view, the partitions are discovered correctly, but &lt;STRONG&gt;NOT skipped&lt;/STRONG&gt;.&lt;BR /&gt;This naturally leads to suboptimal perormance.&lt;BR /&gt;All the folders in the location of the read_files query is pointing to, are in the spark partition&amp;nbsp; format:&amp;nbsp;&lt;SPAN&gt;`/data_feed/batch_id=&amp;lt;batch_id_value&amp;gt;/` i.e.:&lt;/SPAN&gt; `batch_id=eda7b3eb7011.2024-04-03T111458.255Z`&lt;/P&gt;&lt;P&gt;There is not much optimisation I can do on the query side, the following code is as simple as it gets:&lt;/P&gt;&lt;PRE&gt;select * from catalog.schema.view_csv_name
where batch_id = "eda7b3eb7011.2024-04-03T111458.255Z" ;&lt;/PRE&gt;&lt;P&gt;When looking at the query profile it seems that the process has stuck at the Optimizing query &amp;amp; pruning files.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="KrzysztofPrzyso_0-1715101700962.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/7473iE3EF0648BD1E66B3/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="KrzysztofPrzyso_0-1715101700962.png" alt="KrzysztofPrzyso_0-1715101700962.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;I always cancel it after ~20 minutes.&lt;BR /&gt;Maybe it is a bug? Could you check with the product team if the partition skipping should be actually working here?&lt;/P&gt;</description>
    <pubDate>Tue, 07 May 2024 17:11:29 GMT</pubDate>
    <dc:creator>KrzysztofPrzyso</dc:creator>
    <dc:date>2024-05-07T17:11:29Z</dc:date>
    <item>
      <title>exposing RAW files using read_files based views, partition discovery and skipping, performance issue</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/exposing-raw-files-using-read-files-based-views-partition/m-p/67332#M1293</link>
      <description>&lt;P&gt;Hi,&lt;BR /&gt;As a formal requirement in my project I need to keep original, raw (mainly CSVs and XMLs) files on the lake. Later on they are being ingested into Delta format backed medallion stages, bronze, silver, gold etc.&lt;/P&gt;&lt;P&gt;From the audit, operations and discovery point of view it would make sense to expose these files to be consumed (view only) using SQL.&lt;/P&gt;&lt;P&gt;As my system is batch only, the raw files are in stored in hive like folders: `/data_feed/batch_id=&amp;lt;batch_id_value&amp;gt;/`&lt;/P&gt;&lt;P&gt;Partition schema discovery is supported as per this document:&amp;nbsp;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/read_files" target="_blank" rel="noopener"&gt;read_files table-valued function - Azure Databricks - Databricks SQL | Microsoft Learn&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I have created the view as follows:&lt;/P&gt;&lt;P&gt;CSV&lt;/P&gt;&lt;PRE&gt;create view catalog.schema.view_csv_name
as
select *, _metadata.* from read_files( 'abfss://container@storageaccount.dfs.core.windows.net/provider_name/feed_name/', format =&amp;gt; 'csv', multiLine =&amp;gt; true, pathGlobfilter =&amp;gt; "*.csv");&lt;/PRE&gt;&lt;P&gt;XML&lt;/P&gt;&lt;PRE&gt;create view catalog.schema.view_xml_name
as
select *,  _metadata.* from read_files( 'abfss://container@storageaccount.dfs.core.windows.net/provider_name/feed_name/', rowTag =&amp;gt; "tag:tag_name" );&lt;/PRE&gt;&lt;P&gt;Both of the above are working, but there is a performance issues. All using the serveless SQL warehouse.&lt;BR /&gt;The batch_id partition is correctly discovered and shown as column in the query result.&lt;/P&gt;&lt;P&gt;1. When creating a view, ALL folders (potentially terabytes of data) is being scanned. So, each change on deployment is taking a long time.&lt;/P&gt;&lt;P&gt;2. When querying the data using a typical scenario, retrieval data from a single folder the performance is not satisfactory. It seems that the typical spark PARTITION SKIPPING is NOT WORKING here. It seems that all the folders are being scanned.&lt;/P&gt;&lt;PRE&gt;select * from catalog.schema.view_csv_name
where batch_id = "eda7b3eb7011.2024-04-03T111458.255Z" ;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;It works quite well when pointing directly to the folder&lt;/P&gt;&lt;PRE&gt;select * from read_files( 'abfss://container@storageaccount.dfs.core.windows.net/provider_name/feed_name/batch_id=iso_country.2024-04-03T111458.255Z/', format =&amp;gt; 'csv', header =&amp;gt; true );&lt;/PRE&gt;&lt;P&gt;but of course it is unpractical to expose each folder as a separate view.&lt;BR /&gt;On the other hand sometime it would be to much to ask each Analyst to always input all the settings.&lt;/P&gt;&lt;P&gt;Please advise if I can make any changes to improve performance.&lt;/P&gt;&lt;P&gt;I have tried external tables, but currently XML is not supported there, and I really want to keep it read only.&lt;BR /&gt;&lt;SPAN&gt;&lt;BR /&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/25346"&gt;@Hubert-Dudek&lt;/a&gt; Any thoughts?&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 25 Apr 2024 15:21:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/exposing-raw-files-using-read-files-based-views-partition/m-p/67332#M1293</guid>
      <dc:creator>KrzysztofPrzyso</dc:creator>
      <dc:date>2024-04-25T15:21:03Z</dc:date>
    </item>
    <item>
      <title>Re: exposing RAW files using read_files based views, partition discovery and skipping, performance i</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/exposing-raw-files-using-read-files-based-views-partition/m-p/68483#M1312</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;Many thanks for your answer.&lt;BR /&gt;I would like to focus on the main problems at hand here. Please note that we are not discussing external tables based on CSV / XML files nor delta tables just the 'read_files' functionality.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;The main issue is the &lt;STRONG&gt;partition skipping&lt;/STRONG&gt; when using &lt;STRONG&gt;read_files.&lt;BR /&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;BR /&gt;&lt;/STRONG&gt;When reading the view, the partitions are discovered correctly, but &lt;STRONG&gt;NOT skipped&lt;/STRONG&gt;.&lt;BR /&gt;This naturally leads to suboptimal perormance.&lt;BR /&gt;All the folders in the location of the read_files query is pointing to, are in the spark partition&amp;nbsp; format:&amp;nbsp;&lt;SPAN&gt;`/data_feed/batch_id=&amp;lt;batch_id_value&amp;gt;/` i.e.:&lt;/SPAN&gt; `batch_id=eda7b3eb7011.2024-04-03T111458.255Z`&lt;/P&gt;&lt;P&gt;There is not much optimisation I can do on the query side, the following code is as simple as it gets:&lt;/P&gt;&lt;PRE&gt;select * from catalog.schema.view_csv_name
where batch_id = "eda7b3eb7011.2024-04-03T111458.255Z" ;&lt;/PRE&gt;&lt;P&gt;When looking at the query profile it seems that the process has stuck at the Optimizing query &amp;amp; pruning files.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="KrzysztofPrzyso_0-1715101700962.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/7473iE3EF0648BD1E66B3/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="KrzysztofPrzyso_0-1715101700962.png" alt="KrzysztofPrzyso_0-1715101700962.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;I always cancel it after ~20 minutes.&lt;BR /&gt;Maybe it is a bug? Could you check with the product team if the partition skipping should be actually working here?&lt;/P&gt;</description>
      <pubDate>Tue, 07 May 2024 17:11:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/exposing-raw-files-using-read-files-based-views-partition/m-p/68483#M1312</guid>
      <dc:creator>KrzysztofPrzyso</dc:creator>
      <dc:date>2024-05-07T17:11:29Z</dc:date>
    </item>
    <item>
      <title>Re: exposing RAW files using read_files based views, partition discovery and skipping, performance i</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/exposing-raw-files-using-read-files-based-views-partition/m-p/98109#M1678</link>
      <description>&lt;P&gt;I am having the same experience when utilizing read_files on a hive-partitioned datalake of text files on Azure.&amp;nbsp; Oddly, the query profile reflects that the partition filters are being applied appropriately, and the total number of files/bytes read reflects that, but the "&lt;SPAN&gt;Optimizing query &amp;amp; pruning files" step jumps from milliseconds to minutes when using partition filters versus querying a subfolder directly, despite the fact that the file/byte count is the same.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 07 Nov 2024 16:26:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/exposing-raw-files-using-read-files-based-views-partition/m-p/98109#M1678</guid>
      <dc:creator>bobmclaren</dc:creator>
      <dc:date>2024-11-07T16:26:58Z</dc:date>
    </item>
    <item>
      <title>Re: exposing RAW files using read_files based views, partition discovery and skipping, performance i</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/exposing-raw-files-using-read-files-based-views-partition/m-p/98351#M1683</link>
      <description>&lt;P&gt;Many thanks for raising it&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/131511"&gt;@bobmclaren&lt;/a&gt;&amp;nbsp;.&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9" target="_blank"&gt;@Retired_mod&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;Do you have any updates on the above?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 11 Nov 2024 16:16:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/exposing-raw-files-using-read-files-based-views-partition/m-p/98351#M1683</guid>
      <dc:creator>KrzysztofPrzyso</dc:creator>
      <dc:date>2024-11-11T16:16:59Z</dc:date>
    </item>
  </channel>
</rss>

