- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-05-2023 02:38 AM
The cache is expensive and wants to save that data to memory and disk (id there is no more space left in memory). I know that, in theory, it should improve, but it can make things worse. I would just put
scaled_train_data = pipeline_data.transform(train_df)
scaled_test_data = pipeline_data.transform(test_df)
Then I would analyze the number and size of partitions of scaled_train_data and scaled_test_data and then repartition so each partition would be between 100 and 200 MB. The number of partitions I would set as a multiplier of cores on workers (so, for example, 16 cores on workers, and we have 64 partitions 150 MB each).
To analyze use:
df.rdd.getNumPartitions
df.rdd.partitions.length
df.rdd.partitions.size
To repartition use:
scaled_train_data = pipeline_data.transform(train_df).repartition(optimal_number)
scaled_test_data = pipeline_data.transform(test_df).repartition(optimal_number)
My blog: https://databrickster.medium.com/