cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Different JSON Results when Running a Job vs Running a Notebook

rgower
New Contributor II

I have a regularly scheduled job that runs a PySpark Notebook that GETs semi-structured JSON data from an external API, loads that data into dataframes, and saves those dataframes to delta tables in Databricks. 
I have the schema for the JSON defined in my Notebook, but because the API data is semi-structured, I have to convert some of the fields into strings as those fields could have multiple data types in the source data (struct, array, int, string, etc.). A truncated example of this schema definition is below, where the "data" field is the field to focus on:

...

StructField("fields", ArrayType(StructType([
        StructField("id", StringType(), True),
        StructField("name", StringType(), True),
        StructField("type", StringType(), True),
        StructField("value", StructType([
            StructField("data", StringType(), True),
            StructField("type", StringType(), True)
...

Whenever I run the notebook directly, Databricks saves the rows where those fields are JSON structures as properly formatted JSON strings like so (data redacted):

"{'id': *****, 'type': '*****', 'title': '********', 'allDay': ****, 'startTime': '*****', 'endTime': '*****', 'attendees': [{'emailAddress':'*****' ..."

However, when I run this Notebook as part of a job, any of the fields that are structs will get converted into useless strings that are no longer accessible in the same way JSON strings are, like so:

"{id=*****, type=*****, title=********, allDay=****, startTime=*****, endTime=*****, attendees=[{emailAddress=****** ..."


Is this expected behavior, and if not, is there anything I can configure in the Job that would force the same behavior I see in when I run the Notebook directly?
I tried using a table with a Variant datatype column to load the data into instead, but it seems like the conversion that flattens this JSON into an unusable string happens when I load the data into a Spark dataframe and not the target table, so that solution doesn't seem like it will work until Variants are supported in dataframes in Spark 4.0.

4 REPLIES 4

Alberto_Umana
Databricks Employee
Databricks Employee

Hi @rgower

Are you using same cluster compute for both scenarios, run via notebook manually and via workflow job?

rgower
New Contributor II

@Alberto_Umana That may be the root cause - I did some additional testing in a development environment and found that the difference seems to be Serverless compute vs All-Purpose/Job Compute clusters.

When I run the Notebook and Job via serverless compute, the JSON is formatted correctly.
When I run the Notebook and Job via an All-Purpose Compute (tested on a few different versions DBR 15.4, 16.0, 16.1, and 16.2 beta), or the Job via Job Compute, the JSON is formatted incorrectly.

I've been running the job via Job Compute because I was getting OOM errors and driver crashes when running the job via Serverless compute, but I might try to break up the job into multiple jobs to see if I can get Serverless compute to work for me.

 

All this begs the question - is this new result expected? Should I expect different JSON formatting results on different computes, or is this something that should be investigated deeper?

Alberto_Umana
Databricks Employee
Databricks Employee

Hi @rgower,

Thanks for your comments... I think it should be investigated further, since serverless should be using 15.4 DBR. I'll see if I can find something internally or If I can replicate it... otherwise would require a case with us to dig deeper.

rgower
New Contributor II

@Alberto_Umana Sounds good, thank you for looking into it and let me know if there's any additional information I can provide in the meantime!