Discrepancy in Performance Reading Delta Tables from S3 in PySpark

namankhamesara — Tue, 26 Mar 2024 06:34:56 GMT

Hello Databricks Community,

I've encountered a puzzling performance difference while reading Delta tables from S3 using PySpark, particularly when applying filters and projections. I'm seeking insights to understand this variation better.

I've attempted two methods:

spark.read.format("delta").load(my_location).filter(my_filter).select("col1", "col2")
spark.read.format("delta").load(filtered_data_source)
my_location consists of the whole dataset, whereas filtered_data_source contains data after applying filters and selecting specific columns in the first scenario.

In theory, PySpark leverages predicate pushdown and projection pushdown, which should optimize query execution by fetching only the required data from the source. However, I'm observing a significant time gap between the two scenarios: 50 minutes to execute complete job for the first and 10 minutes to execute complete job for the second, despite identical configurations.

My question arises from this discrepancy: If predicate pushdown is included in PySpark by default, why is there a significant time difference? Could it be that predicate pushdown and projection are not fully supported in PySpark by default, and additional configurations are necessary to enable these optimizations?

Re: Discrepancy in Performance Reading Delta Tables from S3 in PySpark

namankhamesara — Tue, 26 Mar 2024 12:50:22 GMT

Hi @Retired_mod
Thank you for your quick response.

Is it like In the first scenario predicate and projection pushdown is not working?

Re: Discrepancy in Performance Reading Delta Tables from S3 in PySpark

NandiniN — Sat, 01 Feb 2025 07:11:41 GMT

Use the explain method to analyze the execution plans for both methods and identify any inefficiencies or differences in the plans.

You can also review the metrics to understand this further.

https://www.databricks.com/discover/pages/optimize-data-workloads-guide

topic Re: Discrepancy in Performance Reading Delta Tables from S3 in PySpark in Get Started Discussions

Discrepancy in Performance Reading Delta Tables from S3 in PySpark

Re: Discrepancy in Performance Reading Delta Tables from S3 in PySpark

Re: Discrepancy in Performance Reading Delta Tables from S3 in PySpark