Warehousing & Analytics

Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.

Forum Posts

Sorted by:

Start a conversation

by Srikanth_Gupta_ • Valued Contributor

06-16-2021 5:39:16 AM

2044 Views
1 replies
1 kudos

Reading bulk CSV files from Spark

While trying to read 100GB of csv.gz file from Spark which is taking forever to read, what are best options to read this file faster?

Warehousing & Analytics

2044 Views
1 replies
1 kudos

06-16-2021 5:39:16 AM

View Replies

Latest Reply

sean_owen
Databricks Employee

06-17-2021 4:06:32 PM

1 kudos

Part of the problem here is that .gz files are not splittable. If you have 1 huge 100GB .gz file, it can only be processed by one task. Can you change your input to use a splittable compression like .bz2? it'll work much better.

1 kudos

06-17-2021 4:06:32 PM