Re: Partition optimization strategy for task that ...

balajij8 · ‎04-08-2026

Hi, Key Points below

Time Window Chunking - Avoid interpolating a full week data in a single Spark action. Split the workload into daily or 12 hour slices. This caps maximum memory pressure, enables parallel execution and simplifies failure recovery.
Before interpolation, compute how many output rows each id will generate in the target window. This is derived from the duration between the earliest and latest measurement for each signal within the chunk.
Create Synthetic Workload Partition Key - Group signals into buckets based on their expected post interpolation row count. Target roughly equal memory per bucket (150 MB of expected interpolated data per partition).
Repartition by Workload Bucket - Apply a repartition operation using the synthetic bucket key. This ensures each Spark executor handles a comparable interpolation workload after expansion eliminating the severe skew that causes disk spill and straggler tasks.
Keep Adaptive Query Execution enabled with advisory partition sizing tuned to your bucket target. Photon will accelerate sequence generation and sort operations without manual tuning.
Persist the interpolated output with Delta Lake using clustering on id, timestamp. Liquid clustering optimizes downstream query pruning.
Evaluate whether all signals require 1 second granularity. Apply a tiered interpolation strategy: critical signals at 1s and low-variability signals at 60s.
Track partition distribution, spill volume and task duration metrics to iteratively adjust bucket sizing