Databricks Community

vishwanath_1 · ‎01-11-2024

reading 130gb file without multi line true it is 6 minutes

my file has data in multi liner .

How to speed up the reading time here ..

i am using below command

InputDF=spark.read.option("delimiter","^")

.option("header",false)

.option("encoding","UTF-8")

.option("multiLine","true")

option("quote","\"")

.option("escape","\"").csv(inputFileName)

Lakshay · ‎01-12-2024

Hi @vishwanath_1 , Can you try setting the below config if this resolves the issue?

set spark.databricks.sql.csv.edgeParserSplittable=true;

vishwanath_1 · ‎01-21-2024

By using set spark.databricks.sql.csv.edgeParserSplittable=true;

There is now taking 30 mins lesser time than usual 4 hours.

Any other setting which can be used to make it faster?

Lakshay · ‎01-22-2024

You can also try using Photon. That can also help speed up the read operation.

subash_07 · ‎09-22-2024

Hi @Lakshay , where did you find this config ? can you give link ?

i am reading a 130gb csv file with multi line true it is taking 4 hours just to read