Different JSON Results when Running a Job vs Running a Notebook
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-03-2025 08:12 AM
I have a regularly scheduled job that runs a PySpark Notebook that GETs semi-structured JSON data from an external API, loads that data into dataframes, and saves those dataframes to delta tables in Databricks.
I have the schema for the JSON defined in my Notebook, but because the API data is semi-structured, I have to convert some of the fields into strings as those fields could have multiple data types in the source data (struct, array, int, string, etc.). A truncated example of this schema definition is below, where the "data" field is the field to focus on:
...
Whenever I run the notebook directly, Databricks saves the rows where those fields are JSON structures as properly formatted JSON strings like so (data redacted):
"{'id': *****, 'type': '*****', 'title': '********', 'allDay': ****, 'startTime': '*****', 'endTime': '*****', 'attendees': [{'emailAddress':'*****' ..."
However, when I run this Notebook as part of a job, any of the fields that are structs will get converted into useless strings that are no longer accessible in the same way JSON strings are, like so:
"{id=*****, type=*****, title=********, allDay=****, startTime=*****, endTime=*****, attendees=[{emailAddress=****** ..."
Is this expected behavior, and if not, is there anything I can configure in the Job that would force the same behavior I see in when I run the Notebook directly?
I tried using a table with a Variant datatype column to load the data into instead, but it seems like the conversion that flattens this JSON into an unusable string happens when I load the data into a Spark dataframe and not the target table, so that solution doesn't seem like it will work until Variants are supported in dataframes in Spark 4.0.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-03-2025 08:17 AM
Hi @rgower,
Are you using same cluster compute for both scenarios, run via notebook manually and via workflow job?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-03-2025 09:03 AM
@Alberto_Umana That may be the root cause - I did some additional testing in a development environment and found that the difference seems to be Serverless compute vs All-Purpose/Job Compute clusters.
When I run the Notebook and Job via serverless compute, the JSON is formatted correctly.
When I run the Notebook and Job via an All-Purpose Compute (tested on a few different versions DBR 15.4, 16.0, 16.1, and 16.2 beta), or the Job via Job Compute, the JSON is formatted incorrectly.
I've been running the job via Job Compute because I was getting OOM errors and driver crashes when running the job via Serverless compute, but I might try to break up the job into multiple jobs to see if I can get Serverless compute to work for me.
All this begs the question - is this new result expected? Should I expect different JSON formatting results on different computes, or is this something that should be investigated deeper?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-03-2025 09:08 AM
Hi @rgower,
Thanks for your comments... I think it should be investigated further, since serverless should be using 15.4 DBR. I'll see if I can find something internally or If I can replicate it... otherwise would require a case with us to dig deeper.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-03-2025 09:10 AM
@Alberto_Umana Sounds good, thank you for looking into it and let me know if there's any additional information I can provide in the meantime!

