cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Small json files issue . taking 2 hours to read 3000 files

Subhasis
New Contributor II

Hello I am trying to read 3000 json files which has only one records. It is taking 2 hours to read all the files . How can I perform this operation faster pls suggest.

2 REPLIES 2

Subhasis
New Contributor II

This is the code ---df1 = spark.read.format("json").options(inferSchema="true", multiLine="true").load(file1)

 

Hi @Subhasis 

You can start off by specyfying schema upfront instead of using infer schema option. But to be honest, it is classical "small file problem". The best approach you can take is to compact those small files into larges ones. 
Or you can read all them and save them as a parquet files with a proper partition size.
Take a look at below threads for inspiration:

Big data [Spark] and its small files problem โ€“ Garren's [Big] Data Blog

 

apache spark - Reading Millions of Small JSON Files from S3 Bucket in PySpark Very Slow - Stack Over...

 

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group