cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Filter not using partition

jenshumrich
Contributor

I have the following code:

spark.sparkContext.setCheckpointDir("dbfs:/mnt/lifestrategy-blob/checkpoints")
result_df.repartitionByRange(200, "IdStation")
result_df_checked = result_df.checkpoint(eager=True)
unique_stations = result_df.select("IdStation").distinct().collect()

for station in unique_stations:
    station_id = station["IdStation"]
    # Filter rows for the current station ID
    station_df = result_df.filter(col("IdStation") == station_id)

I noticed, that the checkpoint has 3600 files and both the collection of the unique stations and filter on the IdStation column does not use any information from the repartitionByRange. I aslo tried partition, but it did not improve on the full scanning of all 3600 files. Any ideas?

3 REPLIES 3

jose_gonzalez
Moderator
Moderator

Please check the physical query plan. Add .explain() API to your existing call and check the physical query plan for any filter push-down  values happening in your query.

jenshumrich
Contributor

Thanks a lot for your response. It seems the Filter is not pushed down, no?

 
station_df.explain()
== Physical Plan ==
*(1) Filter (isnotnull(IdStation#2678) AND (IdStation#2678 = 1119844))
+- *(1) Scan ExistingRDD[Date#2718,WindSpeed#2675,Tower_Acceleration#2676,Density#2677,IdStation#2678,WindShear#2684,Upflow#2691,Control_Mode#2699,Tw_Frequency#2708]

jose_gonzalez
Moderator
Moderator

it seems like there is a filter being apply according to this. 

Filter (isnotnull(IdStation#2678) AND (IdStation#2678 = 1119844))

 I would like to share the following notebook that covers in detail this topic, in case you would like to check it out https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741...

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group