cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Filter not using partition

jenshumrich
New Contributor III

I have the following code:

spark.sparkContext.setCheckpointDir("dbfs:/mnt/lifestrategy-blob/checkpoints")
result_df.repartitionByRange(200, "IdStation")
result_df_checked = result_df.checkpoint(eager=True)
unique_stations = result_df.select("IdStation").distinct().collect()

for station in unique_stations:
    station_id = station["IdStation"]
    # Filter rows for the current station ID
    station_df = result_df.filter(col("IdStation") == station_id)

I noticed, that the checkpoint has 3600 files and both the collection of the unique stations and filter on the IdStation column does not use any information from the repartitionByRange. I aslo tried partition, but it did not improve on the full scanning of all 3600 files. Any ideas?

3 REPLIES 3

jose_gonzalez
Moderator
Moderator

Please check the physical query plan. Add .explain() API to your existing call and check the physical query plan for any filter push-down  values happening in your query.

jenshumrich
New Contributor III

Thanks a lot for your response. It seems the Filter is not pushed down, no?

 
station_df.explain()
== Physical Plan ==
*(1) Filter (isnotnull(IdStation#2678) AND (IdStation#2678 = 1119844))
+- *(1) Scan ExistingRDD[Date#2718,WindSpeed#2675,Tower_Acceleration#2676,Density#2677,IdStation#2678,WindShear#2684,Upflow#2691,Control_Mode#2699,Tw_Frequency#2708]

jose_gonzalez
Moderator
Moderator

it seems like there is a filter being apply according to this. 

Filter (isnotnull(IdStation#2678) AND (IdStation#2678 = 1119844))

 I would like to share the following notebook that covers in detail this topic, in case you would like to check it out https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741...

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.