I am reading 23 gb multi line json file and flattening it using udf and writing datframe as parquet using psypark.
Cluster I am using is 3 node (8 core) 64gb memory with limit to go upto 8 nodes.
I am able to process 7gb file with no issue and takes around 25min.
Though when reading 23gb file it fais with RPC disassociation and garbage collector error.
code:
df = spark.read.load(File , format='json', multiline=True, encoding=UTF-8').repartition(128)
df = parse_json(df) #this function flattens the file. It is written in pyspark to utilize parallalism.
df = df.write.json(outfilepath, mode='overwrite')
Reading file takes around 5 min.
2nd line takes like 20 seconds (Due to lazy evaluation)
Writing takes like 25 minutes and fails with 4 times trial.
What I find it that even though I do repartition, file is not split into these partition and all the load while writing is taken by single core in 1st job.
with 7gb file also, I find that 1st job while writing takes 5-6 minutes to process file and then 2nd job parallelly writes file to target using all the partitions.
My assumption is as my single core only has 8gb memory, it is not even able to read 23gb file and just gives up.
23gb file has 600 records each representing valid json object.