topic Filter not using partition in Data Engineering

Filter not using partition

jenshumrich — Thu, 11 Apr 2024 11:46:16 GMT

I have the following code:

spark.sparkContext.setCheckpointDir("dbfs:/mnt/lifestrategy-blob/checkpoints") result_df.repartitionByRange(200, "IdStation") result_df_checked = result_df.checkpoint(eager=True) unique_stations = result_df.select("IdStation").distinct().collect() for station in unique_stations: station_id = station["IdStation"] # Filter rows for the current station ID station_df = result_df.filter(col("IdStation") == station_id)

I noticed, that the checkpoint has 3600 files and both the collection of the unique stations and filter on the IdStation column does not use any information from the repartitionByRange. I aslo tried partition, but it did not improve on the full scanning of all 3600 files. Any ideas?

Re: Filter not using partition

jose_gonzalez — Thu, 18 Apr 2024 21:56:56 GMT

Please check the physical query plan. Add .explain() API to your existing call and check the physical query plan for any filter push-down values happening in your query.

Re: Filter not using partition

jenshumrich — Thu, 25 Apr 2024 14:40:01 GMT

Thanks a lot for your response. It seems the Filter is not pushed down, no?

station_df.explain() == Physical Plan == *(1) Filter (isnotnull(IdStation#2678) AND (IdStation#2678 = 1119844)) +- *(1) Scan ExistingRDD[Date#2718,WindSpeed#2675,Tower_Acceleration#2676,Density#2677,IdStation#2678,WindShear#2684,Upflow#2691,Control_Mode#2699,Tw_Frequency#2708]

Re: Filter not using partition

jose_gonzalez — Wed, 01 May 2024 00:10:11 GMT

it seems like there is a filter being apply according to this.

Filter (isnotnull(IdStation#2678) AND (IdStation#2678 = 1119844))

I would like to share the following notebook that covers in detail this topic, in case you would like to check it out https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741049972324885/4201913720573284/4413065072037724/latest.html