cancel
Showing results for 
Search instead for 
Did you mean: 

Spark Driver Crash Writing Large Text

oriole
New Contributor III

I'm working with a large text variable, working it into single line JSON where Spark can process beautifully. Using a single node 256 GB 32 core Standard_E32d_v4 "cluster", which should be plenty memory for this dataset (haven't seen cluster memory used exceed 130 GB). However I keep getting crashes "The spark driver has stopped unexpectedly and is restarting..." There is no further info on the failure. This happens when writing an intermediate step to text file using:

dbutils.fs.put('path/filename.txt',str_variable,True)

I've tried writing it to /tmp/ as well as an Azure blob, same result.

I started going down a gc tuning road but haven't figured out the cluster config to increase max heap size, which is currently 30GB

Any insight on what could be causing this? Not sure how else to work around this limitation since I've already broken the pipeline down into an intermediate-step-write, garbage collect/reset memory state, continue from intermediate, flow

1 ACCEPTED SOLUTION

Accepted Solutions

oriole
New Contributor III

@Vigneshraja Palaniraj​ I don't believe single_line_text_df.write.json works. I've tried all the str>>single_line_df>>file combinations. If there's no way around dbutils.fs limitation (unclear why that's low performance relative to python fileobject.write()) then I think the only other consideration is going str>>single_line_json_df that processes '\n' the same way spark.read.json('filepath') does

View solution in original post

5 REPLIES 5

pvignesh92
Honored Contributor

@David Toft​ Hi, The current implementation of dbutils.fs is single-threaded, performs the initial listing on the driver and subsequently launches a Spark job to perform the per-file operations. So I guess the put operation is running on a single core and could eventually break.

Did you try the possibility of storing the variable as text dataframe and writing to a path using dataframe writer ?

oriole
New Contributor III

So is single-threaded dbutils.fs significantly different from standard Python

f = open('path/newfile.txt','w')
f.write(str_variable)
f.close()

This actually works just fine and took 15s, marginal increase in memory usage. Text file ends up being 8.3GB. Surprising dbutils.fs isn't comparable

I did try using text df. Problem is the string will be in a single row, writes to single row, read back into df in single row (even with newlines I've inserted to create single line JSON). Reason I've been using put then spark.read.json is to convert the text to single line JSON df. I'd be happy going straight from large str >> single line JSON df without the write-text, read-JSON operation but don't know how to do that directly

pvignesh92
Honored Contributor

Hi @David Toft​ Will the df.write.json help you here to store your json variable to the path as that can avoid the issues you are facing using df.write.text.

oriole
New Contributor III

@Vigneshraja Palaniraj​ I don't believe single_line_text_df.write.json works. I've tried all the str>>single_line_df>>file combinations. If there's no way around dbutils.fs limitation (unclear why that's low performance relative to python fileobject.write()) then I think the only other consideration is going str>>single_line_json_df that processes '\n' the same way spark.read.json('filepath') does

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.