cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

i am reading a 130gb csv file with multi line true it is taking 4 hours just to read

vishwanath_1
New Contributor III

reading 130gb file  without  multi line true it is 6 minutes 

my file has data in multi liner .

How to speed up the reading time here ..

 

i am using below command

InputDF=spark.read.option("delimiter","^")
.option("header",false)
.option("encoding","UTF-8")
.option("multiLine","true")
option("quote","\"")
.option("escape","\"").csv(inputFileName)
4 REPLIES 4

Lakshay
Databricks Employee
Databricks Employee

Hi @vishwanath_1 , Can you try setting the below config if this resolves the issue?

set spark.databricks.sql.csv.edgeParserSplittable=true;

vishwanath_1
New Contributor III

By using set spark.databricks.sql.csv.edgeParserSplittable=true;

There is now taking 30 mins lesser time than usual 4 hours.

Any other setting which can be used to make it faster?

Lakshay
Databricks Employee
Databricks Employee

You can also try using Photon. That can also help speed up the read operation.

subash_07
New Contributor II

Hi @Lakshay  , where did you find this config ? can you give link ?

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group