Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-22-2024 01:26 AM
i have below steps to perform
1.Read a csv file (considerably huge file .. ~100gb)
2.add index using zipwithindex function
3.repartition dataframe
4.Passing on to another function .
Can you suggest the best optimized caching strategy to execute these commands faster.
Below is the cluster configuration i have
Few more queries :-
1. i always had doubt ,if using 1 worker would suffice for my operation ?
2. what is the optimal number to give for repartitioning here.