cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Spark Driver Crash Writing Large Text

oriole
New Contributor III

I'm working with a large text variable, working it into single line JSON where Spark can process beautifully. Using a single node 256 GB 32 core Standard_E32d_v4 "cluster", which should be plenty memory for this dataset (haven't seen cluster memory used exceed 130 GB). However I keep getting crashes "The spark driver has stopped unexpectedly and is restarting..." There is no further info on the failure. This happens when writing an intermediate step to text file using:

dbutils.fs.put('path/filename.txt',str_variable,True)

I've tried writing it to /tmp/ as well as an Azure blob, same result.

I started going down a gc tuning road but haven't figured out the cluster config to increase max heap size, which is currently 30GB

Any insight on what could be causing this? Not sure how else to work around this limitation since I've already broken the pipeline down into an intermediate-step-write, garbage collect/reset memory state, continue from intermediate, flow

1 ACCEPTED SOLUTION

Accepted Solutions

oriole
New Contributor III

@Vigneshraja Palaniraj​ I don't believe single_line_text_df.write.json works. I've tried all the str>>single_line_df>>file combinations. If there's no way around dbutils.fs limitation (unclear why that's low performance relative to python fileobject.write()) then I think the only other consideration is going str>>single_line_json_df that processes '\n' the same way spark.read.json('filepath') does

View solution in original post

5 REPLIES 5

pvignesh92
Honored Contributor

@David Toft​ Hi, The current implementation of dbutils.fs is single-threaded, performs the initial listing on the driver and subsequently launches a Spark job to perform the per-file operations. So I guess the put operation is running on a single core and could eventually break.

Did you try the possibility of storing the variable as text dataframe and writing to a path using dataframe writer ?

oriole
New Contributor III

So is single-threaded dbutils.fs significantly different from standard Python

f = open('path/newfile.txt','w')
f.write(str_variable)
f.close()

This actually works just fine and took 15s, marginal increase in memory usage. Text file ends up being 8.3GB. Surprising dbutils.fs isn't comparable

I did try using text df. Problem is the string will be in a single row, writes to single row, read back into df in single row (even with newlines I've inserted to create single line JSON). Reason I've been using put then spark.read.json is to convert the text to single line JSON df. I'd be happy going straight from large str >> single line JSON df without the write-text, read-JSON operation but don't know how to do that directly

pvignesh92
Honored Contributor

Hi @David Toft​ Will the df.write.json help you here to store your json variable to the path as that can avoid the issues you are facing using df.write.text.

oriole
New Contributor III

@Vigneshraja Palaniraj​ I don't believe single_line_text_df.write.json works. I've tried all the str>>single_line_df>>file combinations. If there's no way around dbutils.fs limitation (unclear why that's low performance relative to python fileobject.write()) then I think the only other consideration is going str>>single_line_json_df that processes '\n' the same way spark.read.json('filepath') does

pvignesh92
Honored Contributor

@David Toft​ Hi, Could you paste a ***** content that you are trying to store in a variable and in what format you want that in the output path. May be I can recreate this and see if I can see if something works for me

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group