Reading bulk CSV files from Spark

- - Certifications
- - Learning Paths
- - Databricks Product Tours
- - Get Started Guides

- - Get Started Resources
- - Events
- - Support FAQs
- - Technical Blog
- - Community Articles
- - Announcements
- - DatabricksTV
- - Product Platform Updates

- - Private Groups
  - Princeton Life Sciences Databricks User Group
- - Skills@Scale

- - Databricks Community Innovators
- - Khoros Community Forums Support (Not for Databricks Product Questions)
- - Databricks Community Code of Conduct

Warehousing & Analytics

Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.

While trying to read 100GB of csv.gz file from Spark which is taking forever to read, what are best options to read this file faster?

1 REPLY 1

Part of the problem here is that .gz files are not splittable. If you have 1 huge 100GB .gz file, it can only be processed by one task. Can you change your input to use a splittable compression like .bz2? it'll work much better.