cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

RPC Disassociate error due to container threshold exceeding and garbage collector error when reading 23 gb multiline JSON file.

rdobbss
New Contributor II

I am reading 23 gb multi line json file and flattening it using udf and writing datframe as parquet using psypark.

Cluster I am using is 3 node (8 core) 64gb memory with limit to go upto 8 nodes.

I am able to process 7gb file with no issue and takes around 25min.

Though when reading 23gb file it fais with RPC disassociation and garbage collector error.

code:

df = spark.read.load(File , format='json', multiline=True, encoding=UTF-8').repartition(128)

df = parse_json(df) #this function flattens the file. It is written in pyspark to utilize parallalism.

df = df.write.json(outfilepath, mode='overwrite')

Reading file takes around 5 min.

2nd line takes like 20 seconds (Due to lazy evaluation)

Writing takes like 25 minutes and fails with 4 times trial.

What I find it that even though I do repartition, file is not split into these partition and all the load while writing is taken by single core in 1st job.

with 7gb file also, I find that 1st job while writing takes 5-6 minutes to process file and then 2nd job parallelly writes file to target using all the partitions.

My assumption is as my single core only has 8gb memory, it is not even able to read 23gb file and just gives up.

23gb file has 600 records each representing valid json object.

2 REPLIES 2

User16753725469
Contributor II

Which type of workers you are using can please try using a memory-optimized instance and give it a try.

Vidula
Honored Contributor

Hi @Ravi Dobariya​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.