cancel
Showing results for 
Search instead for 
Did you mean: 
Community Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Discrepancy in Performance Reading Delta Tables from S3 in PySpark

namankhamesara
New Contributor II

Hello Databricks Community,

I've encountered a puzzling performance difference while reading Delta tables from S3 using PySpark, particularly when applying filters and projections. I'm seeking insights to understand this variation better.

I've attempted two methods:

spark.read.format("delta").load(my_location).filter(my_filter).select("col1", "col2")
spark.read.format("delta").load(filtered_data_source)
my_location consists of the whole dataset, whereas filtered_data_source contains data after applying filters and selecting specific columns in the first scenario.

In theory, PySpark leverages predicate pushdown and projection pushdown, which should optimize query execution by fetching only the required data from the source. However, I'm observing a significant time gap between the two scenarios: 50 minutes to execute complete job for the first and 10 minutes to execute complete job for the second, despite identical configurations.

My question arises from this discrepancy: If predicate pushdown is included in PySpark by default, why is there a significant time difference? Could it be that predicate pushdown and projection are not fully supported in PySpark by default, and additional configurations are necessary to enable these optimizations?

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @namankhamesara

  • In your first scenario, you load the entire dataset (my_location), apply filters, and select specific columns. This involves reading all the data from S3, applying filters in Spark, and then selecting columns.
  • In the second scenario, you load a pre-filtered dataset (filtered_data_source). Since the filtering and column selection are done before loading, less data is transferred from S3 to Spark.
  • The significant time difference you observe likely stems from the amount of data read and transferred. The second scenario is more efficient due to predicate and projection pushdown.
  • While predicate pushdown is generally included by default in PySpark, there could be other factors affecting performance:
    • Data Distribution: If your filtered dataset is smaller and better distributed, it can lead to faster execution.
    • Partitioning: Ensure that your data is properly partitioned to take advantage of parallelism.
    • Column Statistics: Sometimes, additional configurations (like collecting column statistics) can further optimize query execution.

Hi @Kaniz_Fatma 
Thank you for your quick response.

Is it like In the first scenario predicate and projection pushdown is not working?

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!