<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Performance issue with Spark SQL when working with data from Unity Catalog in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114267#M44768</link>
    <description>&lt;P&gt;Yes,&amp;nbsp;You are right, for the file storage buckets created by DLT, the number of files is a bit larger than the temporary buckets I created. But the total number of files in the data storage buckets managed by Unity Catalog is only about ~100 files.&lt;BR /&gt;I don't know why with such a number of files, when reading data with Spark, the speed is very slow.&lt;/P&gt;</description>
    <pubDate>Wed, 02 Apr 2025 09:26:42 GMT</pubDate>
    <dc:creator>minhhung0507</dc:creator>
    <dc:date>2025-04-02T09:26:42Z</dc:date>
    <item>
      <title>Performance issue with Spark SQL when working with data from Unity Catalog</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114186#M44740</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;We're encountering a performance issue with Spark SQL when working with data from Unity Catalog. Specifically, when I use Spark to read data from a Unity Catalog partition created by DLT and then create a view, the executor retrieval is very slow. However, if I clone the data outside of the Unity Catalog partition (still managed by DLT), the performance improves dramatically.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Has anyone seen similar behavior or can shed some light on what might be causing the discrepancy? Could it be related to metadata overhead, permission handling, or something else specific to how Unity Catalog manages partitions? Any insights or troubleshooting tips would be greatly appreciate.&lt;/P&gt;</description>
      <pubDate>Tue, 01 Apr 2025 14:30:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114186#M44740</guid>
      <dc:creator>minhhung0507</dc:creator>
      <dc:date>2025-04-01T14:30:33Z</dc:date>
    </item>
    <item>
      <title>Re: Performance issue with Spark SQL when working with data from Unity Catalog</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114247#M44759</link>
      <description>&lt;P&gt;I have not yet noticed a slow down due to unity catalog itself.&lt;BR /&gt;I did however saw terrible performance using a shared mode cluster.&lt;BR /&gt;But what you can check is how the data is physically stored.&amp;nbsp; perhaps this partition is skewed or written in many small files.&lt;BR /&gt;Running optimize on the delta table could also help.&lt;BR /&gt;I never use DLT because the lack of control and visibility, but that is of course my own opinion.&lt;/P&gt;</description>
      <pubDate>Wed, 02 Apr 2025 07:50:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114247#M44759</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2025-04-02T07:50:29Z</dc:date>
    </item>
    <item>
      <title>Re: Performance issue with Spark SQL when working with data from Unity Catalog</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114255#M44761</link>
      <description>&lt;P&gt;Is your dataset large with many partitions??&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Use the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;DESCRIBE DETAIL&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;command to inspect the metadata of the Unity Catalog table and the cloned table. Compare the number of partitions and other metadata details.&lt;/P&gt;&lt;DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;DESCRIBE&amp;nbsp;DETAIL&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;&amp;lt;&lt;/SPAN&gt;&lt;SPAN class=""&gt;unity_catalog_table&lt;/SPAN&gt;&lt;SPAN class=""&gt;&amp;gt;&lt;/SPAN&gt;&lt;SPAN class=""&gt;;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;DESCRIBE&amp;nbsp;DETAIL&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;&amp;lt;&lt;/SPAN&gt;&lt;SPAN class=""&gt;cloned_table&lt;/SPAN&gt;&lt;SPAN class=""&gt;&amp;gt;&lt;/SPAN&gt;&lt;SPAN class=""&gt;;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;If the Unity Catalog table has significantly more partitions or complex metadata, this could explain the performance difference.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;and&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Compare the query plans for the Unity Catalog table and the cloned table using the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;EXPLAIN&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;command:&lt;/P&gt;&lt;DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;EXPLAIN&amp;nbsp;EXTENDED&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;SELECT&lt;/SPAN&gt;&lt;SPAN class=""&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;*&lt;/SPAN&gt;&lt;SPAN class=""&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;FROM&lt;/SPAN&gt;&lt;SPAN class=""&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;&amp;lt;&lt;/SPAN&gt;&lt;SPAN class=""&gt;unity_catalog_table&lt;/SPAN&gt;&lt;SPAN class=""&gt;&amp;gt;&lt;/SPAN&gt;&lt;SPAN class=""&gt;;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;EXPLAIN&amp;nbsp;EXTENDED&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;SELECT&lt;/SPAN&gt;&lt;SPAN class=""&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;*&lt;/SPAN&gt;&lt;SPAN class=""&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;FROM&lt;/SPAN&gt;&lt;SPAN class=""&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN class=""&gt;&amp;lt;&lt;/SPAN&gt;&lt;SPAN class=""&gt;cloned_table&lt;/SPAN&gt;&lt;SPAN class=""&gt;&amp;gt;&lt;/SPAN&gt;&lt;SPAN class=""&gt;;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Look for differences in the query plan, such as additional metadata operations or lack of partition pruning.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Wed, 02 Apr 2025 08:15:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114255#M44761</guid>
      <dc:creator>saurabh18cs</dc:creator>
      <dc:date>2025-04-02T08:15:45Z</dc:date>
    </item>
    <item>
      <title>Re: Performance issue with Spark SQL when working with data from Unity Catalog</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114262#M44765</link>
      <description>&lt;P&gt;Let me clarify a bit, the tables are created and stored on Google Cloud Storage bucket. And I use spark to read data directly on those partitions but they are still very slow compared to cloning to another partition not managed by unity catalog.&lt;/P&gt;</description>
      <pubDate>Wed, 02 Apr 2025 09:08:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114262#M44765</guid>
      <dc:creator>minhhung0507</dc:creator>
      <dc:date>2025-04-02T09:08:42Z</dc:date>
    </item>
    <item>
      <title>Re: Performance issue with Spark SQL when working with data from Unity Catalog</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114264#M44767</link>
      <description>&lt;P&gt;that is possible. that is why I mentioned looking at the physical files.&lt;BR /&gt;If the original partition exists of 200 small files, the clone of 1 or 4 bigger files: huge difference.&lt;/P&gt;</description>
      <pubDate>Wed, 02 Apr 2025 09:15:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114264#M44767</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2025-04-02T09:15:26Z</dc:date>
    </item>
    <item>
      <title>Re: Performance issue with Spark SQL when working with data from Unity Catalog</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114267#M44768</link>
      <description>&lt;P&gt;Yes,&amp;nbsp;You are right, for the file storage buckets created by DLT, the number of files is a bit larger than the temporary buckets I created. But the total number of files in the data storage buckets managed by Unity Catalog is only about ~100 files.&lt;BR /&gt;I don't know why with such a number of files, when reading data with Spark, the speed is very slow.&lt;/P&gt;</description>
      <pubDate>Wed, 02 Apr 2025 09:26:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114267#M44768</guid>
      <dc:creator>minhhung0507</dc:creator>
      <dc:date>2025-04-02T09:26:42Z</dc:date>
    </item>
    <item>
      <title>Re: Performance issue with Spark SQL when working with data from Unity Catalog</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114279#M44771</link>
      <description>&lt;P&gt;Her you can find some info.&lt;BR /&gt;Basically it boils down to: reading a file = overhead.&lt;BR /&gt;You want to minimize overhead without stressing the workers too much (gigantic partitions).&lt;/P&gt;&lt;P&gt;&lt;A href="https://medium.com/globant/how-to-solve-a-large-number-of-small-files-problem-in-spark-21f819eb36d3" target="_blank"&gt;https://medium.com/globant/how-to-solve-a-large-number-of-small-files-problem-in-spark-21f819eb36d3&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 02 Apr 2025 10:17:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114279#M44771</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2025-04-02T10:17:11Z</dc:date>
    </item>
    <item>
      <title>Re: Performance issue with Spark SQL when working with data from Unity Catalog</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114281#M44772</link>
      <description>&lt;P&gt;Thanks for your details&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/14792"&gt;@-werners-&lt;/a&gt;&amp;nbsp;. I will check it out and optimize.&lt;BR /&gt;But actually the number of files in the clone buckets I created is only a few files less than the number of files in the buckets created by Delta Live Table and managed by Unity Catalog.&lt;BR /&gt;In addition, when I look at the details inside, when spark reads the buckets created by DLT, there will be an additional config:&lt;BR /&gt;DataFilters: [isnull(__DeleteVersion#592), (isnull(__MEETS_DROP_EXPECTATIONS#595) OR __MEETS_DROP_EXPECTATIONS..., Format: Parquet, Location: PreparedDeltaFileIndex(1 paths)[gs://cimb-prod-lakehouse/gold-layer/__unitystorage/schemas/8962e5..., PartitionFilters: [], PushedFilters: [IsNull(__DeleteVersion), Or(IsNull(__MEETS_DROP_EXPECTATIONS),EqualTo(__MEETS_DROP_EXPECTATIONS&lt;/P&gt;&lt;P&gt;While the clone buckets I created with the same config as the table above will not have this config:&lt;BR /&gt;DataFilters: []&lt;/P&gt;&lt;P&gt;I think this might be the reason why Spark executes so long for buckets created by DLT, but I've never encountered this before.&lt;BR /&gt;Any thoughts on it?&lt;/P&gt;</description>
      <pubDate>Wed, 02 Apr 2025 10:31:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114281#M44772</guid>
      <dc:creator>minhhung0507</dc:creator>
      <dc:date>2025-04-02T10:31:53Z</dc:date>
    </item>
    <item>
      <title>Re: Performance issue with Spark SQL when working with data from Unity Catalog</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114282#M44773</link>
      <description>&lt;P&gt;If there is only a small difference between the number of files, that is probably not it (unless there is serious data skew).&lt;BR /&gt;DLT probably adds this config for data quality reasons (which might be defined on the dlt).&amp;nbsp; But as I do not use dlt, I could be wrong here.&lt;BR /&gt;But it certainly could be the cause because the filters will be evaluated on read.&lt;/P&gt;</description>
      <pubDate>Wed, 02 Apr 2025 10:40:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114282#M44773</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2025-04-02T10:40:57Z</dc:date>
    </item>
    <item>
      <title>Re: Performance issue with Spark SQL when working with data from Unity Catalog</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114374#M44793</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/14792"&gt;@-werners-&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;In case the problem is actually caused by these filters because DLT applies these configs for data quality, is there a way to ignore these &lt;STRONG&gt;filters&lt;/STRONG&gt; when reading data and ignore &lt;STRONG&gt;delta_log&lt;/STRONG&gt; files when reading in that partition?&lt;BR /&gt;I want to verify to make sure that DLT automatically adding these filters will cause performance problems in processing data with Spark.&lt;/P&gt;</description>
      <pubDate>Thu, 03 Apr 2025 07:53:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114374#M44793</guid>
      <dc:creator>minhhung0507</dc:creator>
      <dc:date>2025-04-03T07:53:07Z</dc:date>
    </item>
    <item>
      <title>Re: Performance issue with Spark SQL when working with data from Unity Catalog</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114376#M44794</link>
      <description>&lt;P&gt;you can read the physical parquet files with spark.read.parquet().&lt;/P&gt;&lt;P&gt;The trick is to know which files are the current ones.&lt;/P&gt;</description>
      <pubDate>Thu, 03 Apr 2025 08:47:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-issue-with-spark-sql-when-working-with-data-from/m-p/114376#M44794</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2025-04-03T08:47:35Z</dc:date>
    </item>
  </channel>
</rss>

