<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Discrepancy in Performance Reading Delta Tables from S3 in PySpark in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/discrepancy-in-performance-reading-delta-tables-from-s3-in/m-p/64639#M9712</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;Thank you for your quick response.&lt;/P&gt;&lt;P&gt;Is it like In the first scenario&amp;nbsp;&lt;SPAN&gt;predicate and projection pushdown is not working?&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 26 Mar 2024 12:50:22 GMT</pubDate>
    <dc:creator>namankhamesara</dc:creator>
    <dc:date>2024-03-26T12:50:22Z</dc:date>
    <item>
      <title>Discrepancy in Performance Reading Delta Tables from S3 in PySpark</title>
      <link>https://community.databricks.com/t5/get-started-discussions/discrepancy-in-performance-reading-delta-tables-from-s3-in/m-p/64581#M9710</link>
      <description>&lt;P&gt;Hello Databricks Community,&lt;/P&gt;&lt;P&gt;I've encountered a puzzling performance difference while reading Delta tables from S3 using PySpark, particularly when applying filters and projections. I'm seeking insights to understand this variation better.&lt;/P&gt;&lt;P&gt;I've attempted two methods:&lt;/P&gt;&lt;P&gt;spark.read.format("delta").load(my_location).filter(my_filter).select("col1", "col2")&lt;BR /&gt;spark.read.format("delta").load(filtered_data_source)&lt;BR /&gt;my_location consists of the whole dataset, whereas filtered_data_source contains data after applying filters and selecting specific columns in the first scenario.&lt;/P&gt;&lt;P&gt;In theory, PySpark leverages predicate pushdown and projection pushdown, which should optimize query execution by fetching only the required data from the source. However, I'm observing a significant time gap between the two scenarios: 50 minutes to execute complete job for the first and 10 minutes to execute complete job for the second, despite identical configurations.&lt;/P&gt;&lt;P&gt;My question arises from this discrepancy: If predicate pushdown is included in PySpark by default, why is there a significant time difference? Could it be that predicate pushdown and projection are not fully supported in PySpark by default, and additional configurations are necessary to enable these optimizations?&lt;/P&gt;</description>
      <pubDate>Tue, 26 Mar 2024 06:34:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/discrepancy-in-performance-reading-delta-tables-from-s3-in/m-p/64581#M9710</guid>
      <dc:creator>namankhamesara</dc:creator>
      <dc:date>2024-03-26T06:34:56Z</dc:date>
    </item>
    <item>
      <title>Re: Discrepancy in Performance Reading Delta Tables from S3 in PySpark</title>
      <link>https://community.databricks.com/t5/get-started-discussions/discrepancy-in-performance-reading-delta-tables-from-s3-in/m-p/64639#M9712</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;Thank you for your quick response.&lt;/P&gt;&lt;P&gt;Is it like In the first scenario&amp;nbsp;&lt;SPAN&gt;predicate and projection pushdown is not working?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 26 Mar 2024 12:50:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/discrepancy-in-performance-reading-delta-tables-from-s3-in/m-p/64639#M9712</guid>
      <dc:creator>namankhamesara</dc:creator>
      <dc:date>2024-03-26T12:50:22Z</dc:date>
    </item>
    <item>
      <title>Re: Discrepancy in Performance Reading Delta Tables from S3 in PySpark</title>
      <link>https://community.databricks.com/t5/get-started-discussions/discrepancy-in-performance-reading-delta-tables-from-s3-in/m-p/108267#M9713</link>
      <description>&lt;P&gt;Use the &lt;CODE&gt;explain&lt;/CODE&gt; method to analyze the execution plans for both methods and identify any inefficiencies or differences in the plans.&lt;/P&gt;
&lt;P&gt;You can also review the metrics to understand this further.&lt;/P&gt;
&lt;P&gt;&lt;A href="https://www.databricks.com/discover/pages/optimize-data-workloads-guide" target="_blank"&gt;https://www.databricks.com/discover/pages/optimize-data-workloads-guide&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 01 Feb 2025 07:11:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/discrepancy-in-performance-reading-delta-tables-from-s3-in/m-p/108267#M9713</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2025-02-01T07:11:41Z</dc:date>
    </item>
  </channel>
</rss>

