Filter not using partition
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-11-2024 04:46 AM
I have the following code:
spark.sparkContext.setCheckpointDir("dbfs:/mnt/lifestrategy-blob/checkpoints")
result_df.repartitionByRange(200, "IdStation")
result_df_checked = result_df.checkpoint(eager=True)
unique_stations = result_df.select("IdStation").distinct().collect()
for station in unique_stations:
station_id = station["IdStation"]
# Filter rows for the current station ID
station_df = result_df.filter(col("IdStation") == station_id)
I noticed, that the checkpoint has 3600 files and both the collection of the unique stations and filter on the IdStation column does not use any information from the repartitionByRange. I aslo tried partition, but it did not improve on the full scanning of all 3600 files. Any ideas?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-18-2024 02:56 PM
Please check the physical query plan. Add .explain() API to your existing call and check the physical query plan for any filter push-down values happening in your query.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-25-2024 07:40 AM
Thanks a lot for your response. It seems the Filter is not pushed down, no?
station_df.explain()
== Physical Plan ==
*(1) Filter (isnotnull(IdStation#2678) AND (IdStation#2678 = 1119844))
+- *(1) Scan ExistingRDD[Date#2718,WindSpeed#2675,Tower_Acceleration#2676,Density#2677,IdStation#2678,WindShear#2684,Upflow#2691,Control_Mode#2699,Tw_Frequency#2708]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-30-2024 05:10 PM
it seems like there is a filter being apply according to this.
Filter (isnotnull(IdStation#2678) AND (IdStation#2678 = 1119844))
I would like to share the following notebook that covers in detail this topic, in case you would like to check it out https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741...

