Filter not using partition

jenshumrich · ‎04-11-2024

I have the following code:

spark.sparkContext.setCheckpointDir("dbfs:/mnt/lifestrategy-blob/checkpoints")
result_df.repartitionByRange(200, "IdStation")
result_df_checked = result_df.checkpoint(eager=True)
unique_stations = result_df.select("IdStation").distinct().collect()

for station in unique_stations:
    station_id = station["IdStation"]
    # Filter rows for the current station ID
    station_df = result_df.filter(col("IdStation") == station_id)

I noticed, that the checkpoint has 3600 files and both the collection of the unique stations and filter on the IdStation column does not use any information from the repartitionByRange. I aslo tried partition, but it did not improve on the full scanning of all 3600 files. Any ideas?