Databricks

oriole · ‎03-19-2023

I'm working with a large text variable, working it into single line JSON where Spark can process beautifully. Using a single node 256 GB 32 core Standard_E32d_v4 "cluster", which should be plenty memory for this dataset (haven't seen cluster memory used exceed 130 GB). However I keep getting crashes "The spark driver has stopped unexpectedly and is restarting..." There is no further info on the failure. This happens when writing an intermediate step to text file using:

dbutils.fs.put('path/filename.txt',str_variable,True)

I've tried writing it to /tmp/ as well as an Azure blob, same result.

I started going down a gc tuning road but haven't figured out the cluster config to increase max heap size, which is currently 30GB

Any insight on what could be causing this? Not sure how else to work around this limitation since I've already broken the pipeline down into an intermediate-step-write, garbage collect/reset memory state, continue from intermediate, flow

oriole · ‎03-21-2023

@Vigneshraja Palaniraj I don't believe single_line_text_df.write.json works. I've tried all the str>>single_line_df>>file combinations. If there's no way around dbutils.fs limitation (unclear why that's low performance relative to python fileobject.write()) then I think the only other consideration is going str>>single_line_json_df that processes '\n' the same way spark.read.json('filepath') does

View solution in original post

pvignesh92 · ‎03-20-2023

@David Toft Hi, The current implementation of dbutils.fs is single-threaded, performs the initial listing on the driver and subsequently launches a Spark job to perform the per-file operations. So I guess the put operation is running on a single core and could eventually break.

Did you try the possibility of storing the variable as text dataframe and writing to a path using dataframe writer ?

oriole · ‎03-21-2023

So is single-threaded dbutils.fs significantly different from standard Python

f = open('path/newfile.txt','w')
f.write(str_variable)
f.close()

This actually works just fine and took 15s, marginal increase in memory usage. Text file ends up being 8.3GB. Surprising dbutils.fs isn't comparable

I did try using text df. Problem is the string will be in a single row, writes to single row, read back into df in single row (even with newlines I've inserted to create single line JSON). Reason I've been using put then spark.read.json is to convert the text to single line JSON df. I'd be happy going straight from large str >> single line JSON df without the write-text, read-JSON operation but don't know how to do that directly

pvignesh92 · ‎03-21-2023

Hi @David Toft Will the df.write.json help you here to store your json variable to the path as that can avoid the issues you are facing using df.write.text.

oriole · ‎03-21-2023

@Vigneshraja Palaniraj I don't believe single_line_text_df.write.json works. I've tried all the str>>single_line_df>>file combinations. If there's no way around dbutils.fs limitation (unclear why that's low performance relative to python fileobject.write()) then I think the only other consideration is going str>>single_line_json_df that processes '\n' the same way spark.read.json('filepath') does

pvignesh92 · ‎03-22-2023

@David Toft Hi, Could you paste a ***** content that you are trying to store in a variable and in what format you want that in the output path. May be I can recreate this and see if I can see if something works for me

Databricks

Spark Driver Crash Writing Large Text

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI