cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

i am reading a 130gb csv file with multi line true it is taking 4 hours just to read

vishwanath_1
New Contributor III

reading 130gb file  without  multi line true it is 6 minutes 

my file has data in multi liner .

How to speed up the reading time here ..

 

i am using below command

InputDF=spark.read.option("delimiter","^")
.option("header",false)
.option("encoding","UTF-8")
.option("multiLine","true")
option("quote","\"")
.option("escape","\"").csv(inputFileName)
5 REPLIES 5

Kaniz_Fatma
Community Manager
Community Manager

Hi @vishwanath_1, Reading large CSV files with multiline records in Databricks can be time-consuming due to the complexity of parsing multiline records. 

 

Use Explicit Schema: One way to speed up reading a CSV into a DataFrame is by using an explicit sche.... This can help Spark optimize the reading process.

Ensure Proper Quoting: By default, when you use the multiLine option, Spark assumes that you have en.... If your data doesnโ€™t follow this, it might lead to incorrect reading and slow performance.

 

Consider Data Partitioning: If your data is too large, consider partitioning it. This allows Spark to read and process data in parallel, which can significantly improve performance. However, this might not be applicable if your data needs to be read as a whole due to multiline records.

 

Custom Parser: If none of the above solutions work, you might need to consider implementing a custom....

 

However, please note that when using the multiline option, the charset or encoding option might be i....

Lakshay
Esteemed Contributor
Esteemed Contributor

Hi @vishwanath_1 , Can you try setting the below config if this resolves the issue?

set spark.databricks.sql.csv.edgeParserSplittable=true;

vishwanath_1
New Contributor III

By using set spark.databricks.sql.csv.edgeParserSplittable=true;

There is now taking 30 mins lesser time than usual 4 hours.

Any other setting which can be used to make it faster?

Lakshay
Esteemed Contributor
Esteemed Contributor

You can also try using Photon. That can also help speed up the read operation.

Kaniz_Fatma
Community Manager
Community Manager

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 
 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group