Filter not using partition
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-11-2024 04:46 AM
I have the following code:
spark.sparkContext.setCheckpointDir("dbfs:/mnt/lifestrategy-blob/checkpoints")
result_df.repartitionByRange(200, "IdStation")
result_df_checked = result_df.checkpoint(eager=True)
unique_stations = result_df.select("IdStation").distinct().collect()
for station in unique_stations:
station_id = station["IdStation"]
# Filter rows for the current station ID
station_df = result_df.filter(col("IdStation") == station_id)I noticed, that the checkpoint has 3600 files and both the collection of the unique stations and filter on the IdStation column does not use any information from the repartitionByRange. I aslo tried partition, but it did not improve on the full scanning of all 3600 files. Any ideas?