cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
cancel
Showing results for 
Search instead for 
Did you mean: 

Reading bulk CSV files from Spark

Srikanth_Gupta_
Valued Contributor

While trying to read 100GB of csv.gz file from Spark which is taking forever to read, what are best options to read this file faster?

1 REPLY 1

sean_owen
Honored Contributor II
Honored Contributor II

Part of the problem here is that .gz files are not splittable. If you have 1 huge 100GB .gz file, it can only be processed by one task. Can you change your input to use a splittable compression like .bz2? it'll work much better.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.