Hubert-Dudek
Databricks MVP

The cache is expensive and wants to save that data to memory and disk (id there is no more space left in memory). I know that, in theory, it should improve, but it can make things worse. I would just put

scaled_train_data = pipeline_data.transform(train_df)

scaled_test_data = pipeline_data.transform(test_df)

Then I would analyze the number and size of partitions of scaled_train_data and scaled_test_data and then repartition so each partition would be between 100 and 200 MB. The number of partitions I would set as a multiplier of cores on workers (so, for example, 16 cores on workers, and we have 64 partitions 150 MB each).

To analyze use:

df.rdd.getNumPartitions

df.rdd.partitions.length

df.rdd.partitions.size

To repartition use:

scaled_train_data = pipeline_data.transform(train_df).repartition(optimal_number)

scaled_test_data = pipeline_data.transform(test_df).repartition(optimal_number)


My blog: https://databrickster.medium.com/