03-19-2023 12:35 PM
I'm working with a large text variable, working it into single line JSON where Spark can process beautifully. Using a single node 256 GB 32 core Standard_E32d_v4 "cluster", which should be plenty memory for this dataset (haven't seen cluster memory used exceed 130 GB). However I keep getting crashes "The spark driver has stopped unexpectedly and is restarting..." There is no further info on the failure. This happens when writing an intermediate step to text file using:
dbutils.fs.put('path/filename.txt',str_variable,True)
I've tried writing it to /tmp/ as well as an Azure blob, same result.
I started going down a gc tuning road but haven't figured out the cluster config to increase max heap size, which is currently 30GB
Any insight on what could be causing this? Not sure how else to work around this limitation since I've already broken the pipeline down into an intermediate-step-write, garbage collect/reset memory state, continue from intermediate, flow
03-21-2023 06:55 AM
@Vigneshraja Palaniraj I don't believe single_line_text_df.write.json works. I've tried all the str>>single_line_df>>file combinations. If there's no way around dbutils.fs limitation (unclear why that's low performance relative to python fileobject.write()) then I think the only other consideration is going str>>single_line_json_df that processes '\n' the same way spark.read.json('filepath') does
03-20-2023 08:46 AM
@David Toft Hi, The current implementation of dbutils.fs is single-threaded, performs the initial listing on the driver and subsequently launches a Spark job to perform the per-file operations. So I guess the put operation is running on a single core and could eventually break.
Did you try the possibility of storing the variable as text dataframe and writing to a path using dataframe writer ?
03-21-2023 05:27 AM
So is single-threaded dbutils.fs significantly different from standard Python
f = open('path/newfile.txt','w')
f.write(str_variable)
f.close()
This actually works just fine and took 15s, marginal increase in memory usage. Text file ends up being 8.3GB. Surprising dbutils.fs isn't comparable
I did try using text df. Problem is the string will be in a single row, writes to single row, read back into df in single row (even with newlines I've inserted to create single line JSON). Reason I've been using put then spark.read.json is to convert the text to single line JSON df. I'd be happy going straight from large str >> single line JSON df without the write-text, read-JSON operation but don't know how to do that directly
03-21-2023 06:06 AM
Hi @David Toft Will the df.write.json help you here to store your json variable to the path as that can avoid the issues you are facing using df.write.text.
03-21-2023 06:55 AM
@Vigneshraja Palaniraj I don't believe single_line_text_df.write.json works. I've tried all the str>>single_line_df>>file combinations. If there's no way around dbutils.fs limitation (unclear why that's low performance relative to python fileobject.write()) then I think the only other consideration is going str>>single_line_json_df that processes '\n' the same way spark.read.json('filepath') does
03-22-2023 08:14 AM
@David Toft Hi, Could you paste a ***** content that you are trying to store in a variable and in what format you want that in the output path. May be I can recreate this and see if I can see if something works for me
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group