Need Suggestion for better caching strategy

vishwanath_1 · ‎01-22-2024

i have below steps to perform

1.Read a csv file (considerably huge file .. ~100gb)

2.add index using zipwithindex function

3.repartition dataframe

4.Passing on to another function .

Can you suggest the best optimized caching strategy to execute these commands faster.

Below is the cluster configuration i have

Few more queries :-

1. i always had doubt ,if using 1 worker would suffice for my operation ?

2. what is the optimal number to give for repartitioning here.