Hello Databricks Community,
I've encountered a puzzling performance difference while reading Delta tables from S3 using PySpark, particularly when applying filters and projections. I'm seeking insights to understand this variation better.
I've attempted two methods:
spark.read.format("delta").load(my_location).filter(my_filter).select("col1", "col2")
spark.read.format("delta").load(filtered_data_source)
my_location consists of the whole dataset, whereas filtered_data_source contains data after applying filters and selecting specific columns in the first scenario.
In theory, PySpark leverages predicate pushdown and projection pushdown, which should optimize query execution by fetching only the required data from the source. However, I'm observing a significant time gap between the two scenarios: 50 minutes to execute complete job for the first and 10 minutes to execute complete job for the second, despite identical configurations.
My question arises from this discrepancy: If predicate pushdown is included in PySpark by default, why is there a significant time difference? Could it be that predicate pushdown and projection are not fully supported in PySpark by default, and additional configurations are necessary to enable these optimizations?