cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

RPC Disassociate error due to container threshold exceeding and garbage collector error when reading 23 gb multiline JSON file.

rdobbss
New Contributor II

I am reading 23 gb multi line json file and flattening it using udf and writing datframe as parquet using psypark.

Cluster I am using is 3 node (8 core) 64gb memory with limit to go upto 8 nodes.

I am able to process 7gb file with no issue and takes around 25min.

Though when reading 23gb file it fais with RPC disassociation and garbage collector error.

code:

df = spark.read.load(File , format='json', multiline=True, encoding=UTF-8').repartition(128)

df = parse_json(df) #this function flattens the file. It is written in pyspark to utilize parallalism.

df = df.write.json(outfilepath, mode='overwrite')

Reading file takes around 5 min.

2nd line takes like 20 seconds (Due to lazy evaluation)

Writing takes like 25 minutes and fails with 4 times trial.

What I find it that even though I do repartition, file is not split into these partition and all the load while writing is taken by single core in 1st job.

with 7gb file also, I find that 1st job while writing takes 5-6 minutes to process file and then 2nd job parallelly writes file to target using all the partitions.

My assumption is as my single core only has 8gb memory, it is not even able to read 23gb file and just gives up.

23gb file has 600 records each representing valid json object.

2 REPLIES 2

User16753725469
Contributor II

Which type of workers you are using can please try using a memory-optimized instance and give it a try.

Vidula
Honored Contributor

Hi @Ravi Dobariya​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group