Databricks Community

rgower · ‎02-03-2025

I have a regularly scheduled job that runs a PySpark Notebook that GETs semi-structured JSON data from an external API, loads that data into dataframes, and saves those dataframes to delta tables in Databricks.
I have the schema for the JSON defined in my Notebook, but because the API data is semi-structured, I have to convert some of the fields into strings as those fields could have multiple data types in the source data (struct, array, int, string, etc.). A truncated example of this schema definition is below, where the "data" field is the field to focus on:

...

StructField("fields", ArrayType(StructType([

StructField("id", StringType(), True),

StructField("name", StringType(), True),

StructField("type", StringType(), True),

StructField("value", StructType([

StructField("data", StringType(), True),

StructField("type", StringType(), True)

...

Whenever I run the notebook directly, Databricks saves the rows where those fields are JSON structures as properly formatted JSON strings like so (data redacted):

"{'id': *****, 'type': '*****', 'title': '********', 'allDay': ****, 'startTime': '*****', 'endTime': '*****', 'attendees': [{'emailAddress':'*****' ..."

However, when I run this Notebook as part of a job, any of the fields that are structs will get converted into useless strings that are no longer accessible in the same way JSON strings are, like so:

"{id=*****, type=*****, title=********, allDay=****, startTime=*****, endTime=*****, attendees=[{emailAddress=****** ..."

Is this expected behavior, and if not, is there anything I can configure in the Job that would force the same behavior I see in when I run the Notebook directly?
I tried using a table with a Variant datatype column to load the data into instead, but it seems like the conversion that flattens this JSON into an unusable string happens when I load the data into a Spark dataframe and not the target table, so that solution doesn't seem like it will work until Variants are supported in dataframes in Spark 4.0.

Alberto_Umana · ‎02-03-2025

Hi @rgower,

Are you using same cluster compute for both scenarios, run via notebook manually and via workflow job?

rgower · ‎02-03-2025

@Alberto_Umana That may be the root cause - I did some additional testing in a development environment and found that the difference seems to be Serverless compute vs All-Purpose/Job Compute clusters.

When I run the Notebook and Job via serverless compute, the JSON is formatted correctly.
When I run the Notebook and Job via an All-Purpose Compute (tested on a few different versions DBR 15.4, 16.0, 16.1, and 16.2 beta), or the Job via Job Compute, the JSON is formatted incorrectly.

I've been running the job via Job Compute because I was getting OOM errors and driver crashes when running the job via Serverless compute, but I might try to break up the job into multiple jobs to see if I can get Serverless compute to work for me.

All this begs the question - is this new result expected? Should I expect different JSON formatting results on different computes, or is this something that should be investigated deeper?

Alberto_Umana · ‎02-03-2025

Hi @rgower,

Thanks for your comments... I think it should be investigated further, since serverless should be using 15.4 DBR. I'll see if I can find something internally or If I can replicate it... otherwise would require a case with us to dig deeper.