I like the core idea. You are mining signals the platform already emits.I would start rules first, track small files ratio and average file size trend, watch skew per partition and shuffle bytes per input gigabyte. Compare job time to input size to c...