cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Filter not using partition

jenshumrich
New Contributor III

I have the following code:

spark.sparkContext.setCheckpointDir("dbfs:/mnt/lifestrategy-blob/checkpoints")
result_df.repartitionByRange(200, "IdStation")
result_df_checked = result_df.checkpoint(eager=True)
unique_stations = result_df.select("IdStation").distinct().collect()

for station in unique_stations:
    station_id = station["IdStation"]
    # Filter rows for the current station ID
    station_df = result_df.filter(col("IdStation") == station_id)

I noticed, that the checkpoint has 3600 files and both the collection of the unique stations and filter on the IdStation column does not use any information from the repartitionByRange. I aslo tried partition, but it did not improve on the full scanning of all 3600 files. Any ideas?

3 REPLIES 3

jose_gonzalez
Moderator
Moderator

Please check the physical query plan. Add .explain() API to your existing call and check the physical query plan for any filter push-down  values happening in your query.

jenshumrich
New Contributor III

Thanks a lot for your response. It seems the Filter is not pushed down, no?

 
station_df.explain()
== Physical Plan ==
*(1) Filter (isnotnull(IdStation#2678) AND (IdStation#2678 = 1119844))
+- *(1) Scan ExistingRDD[Date#2718,WindSpeed#2675,Tower_Acceleration#2676,Density#2677,IdStation#2678,WindShear#2684,Upflow#2691,Control_Mode#2699,Tw_Frequency#2708]

jose_gonzalez
Moderator
Moderator

it seems like there is a filter being apply according to this. 

Filter (isnotnull(IdStation#2678) AND (IdStation#2678 = 1119844))

 I would like to share the following notebook that covers in detail this topic, in case you would like to check it out https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741...

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!