Databricks Community

namankhamesara · ‎03-25-2024

Hello Databricks Community,

I've encountered a puzzling performance difference while reading Delta tables from S3 using PySpark, particularly when applying filters and projections. I'm seeking insights to understand this variation better.

I've attempted two methods:

spark.read.format("delta").load(my_location).filter(my_filter).select("col1", "col2")
spark.read.format("delta").load(filtered_data_source)
my_location consists of the whole dataset, whereas filtered_data_source contains data after applying filters and selecting specific columns in the first scenario.

In theory, PySpark leverages predicate pushdown and projection pushdown, which should optimize query execution by fetching only the required data from the source. However, I'm observing a significant time gap between the two scenarios: 50 minutes to execute complete job for the first and 10 minutes to execute complete job for the second, despite identical configurations.

My question arises from this discrepancy: If predicate pushdown is included in PySpark by default, why is there a significant time difference? Could it be that predicate pushdown and projection are not fully supported in PySpark by default, and additional configurations are necessary to enable these optimizations?

Kaniz_Fatma · ‎03-26-2024

Hi @namankhamesara,

In your first scenario, you load the entire dataset (my_location), apply filters, and select specific columns. This involves reading all the data from S3, applying filters in Spark, and then selecting columns.
In the second scenario, you load a pre-filtered dataset (filtered_data_source). Since the filtering and column selection are done before loading, less data is transferred from S3 to Spark.
The significant time difference you observe likely stems from the amount of data read and transferred. The second scenario is more efficient due to predicate and projection pushdown.
While predicate pushdown is generally included by default in PySpark, there could be other factors affecting performance:
- Data Distribution: If your filtered dataset is smaller and better distributed, it can lead to faster execution.
- Partitioning: Ensure that your data is properly partitioned to take advantage of parallelism.
- Column Statistics: Sometimes, additional configurations (like collecting column statistics) can further optimize query execution.

namankhamesara · ‎03-26-2024

Hi @Kaniz_Fatma
Thank you for your quick response.

Is it like In the first scenario predicate and projection pushdown is not working?

Databricks Community

Discrepancy in Performance Reading Delta Tables from S3 in PySpark

🔔 ALERT: Act Now to Protect Your Community Account; Secure Your Details Before It's Too Late!

Databricks Learning Festival (Virtual): 10 July - 24 July 2024

Data + AI Summit 2024: An Executive Summary for Data Leaders

Big Data Is Back and Is More Important Than AI