Reading bulk CSV files from Spark
While trying to read 100GB of csv.gz file from Spark which is taking forever to read, what are best options to read this file faster?
- 1992 Views
- 1 replies
- 1 kudos
Latest Reply
Part of the problem here is that .gz files are not splittable. If you have 1 huge 100GB .gz file, it can only be processed by one task. Can you change your input to use a splittable compression like .bz2? it'll work much better.
- 1 kudos