cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Discrepancy in Performance Reading Delta Tables from S3 in PySpark

namankhamesara
New Contributor II

Hello Databricks Community,

I've encountered a puzzling performance difference while reading Delta tables from S3 using PySpark, particularly when applying filters and projections. I'm seeking insights to understand this variation better.

I've attempted two methods:

spark.read.format("delta").load(my_location).filter(my_filter).select("col1", "col2")
spark.read.format("delta").load(filtered_data_source)
my_location consists of the whole dataset, whereas filtered_data_source contains data after applying filters and selecting specific columns in the first scenario.

In theory, PySpark leverages predicate pushdown and projection pushdown, which should optimize query execution by fetching only the required data from the source. However, I'm observing a significant time gap between the two scenarios: 50 minutes to execute complete job for the first and 10 minutes to execute complete job for the second, despite identical configurations.

My question arises from this discrepancy: If predicate pushdown is included in PySpark by default, why is there a significant time difference? Could it be that predicate pushdown and projection are not fully supported in PySpark by default, and additional configurations are necessary to enable these optimizations?

2 REPLIES 2

Hi @Retired_mod 
Thank you for your quick response.

Is it like In the first scenario predicate and projection pushdown is not working?

NandiniN
Databricks Employee
Databricks Employee

Use the explain method to analyze the execution plans for both methods and identify any inefficiencies or differences in the plans.

You can also review the metrics to understand this further.

https://www.databricks.com/discover/pages/optimize-data-workloads-guide