Need Suggestion for better caching strategy

vishwanath_1
New Contributor III

i have below steps to perform 

1.Read a csv file (considerably huge file .. ~100gb)

2.add index using zipwithindex function 

3.repartition dataframe 

4.Passing on to another function .

Can you suggest the best optimized caching strategy to execute these commands faster.

Below is the cluster configuration i have 

vishwanath_1_0-1705915220664.png

 

Few more queries :-

1. i always had doubt ,if using 1 worker would suffice for my operation ?

2. what is the optimal number to give for repartitioning here.