I have a regularly scheduled job that runs a PySpark Notebook that GETs semi-structured JSON data from an external API, loads that data into dataframes, and saves those dataframes to delta tables in Databricks.
I have the schema for the JSON defined in my Notebook, but because the API data is semi-structured, I have to convert some of the fields into strings as those fields could have multiple data types in the source data (struct, array, int, string, etc.). A truncated example of this schema definition is below, where the "data" field is the field to focus on:
...
StructField("fields", ArrayType(StructType([
StructField("id", StringType(), True),
StructField("name", StringType(), True),
StructField("type", StringType(), True),
StructField("value", StructType([
StructField("data", StringType(), True),
StructField("type", StringType(), True)
...
Whenever I run the notebook directly, Databricks saves the rows where those fields are JSON structures as properly formatted JSON strings like so (data redacted):
"{'id': *****, 'type': '*****', 'title': '********', 'allDay': ****, 'startTime': '*****', 'endTime': '*****', 'attendees': [{'emailAddress':'*****' ..."
However, when I run this Notebook as part of a job, any of the fields that are structs will get converted into useless strings that are no longer accessible in the same way JSON strings are, like so:
"{id=*****, type=*****, title=********, allDay=****, startTime=*****, endTime=*****, attendees=[{emailAddress=****** ..."
Is this expected behavior, and if not, is there anything I can configure in the Job that would force the same behavior I see in when I run the Notebook directly?
I tried using a table with a Variant datatype column to load the data into instead, but it seems like the conversion that flattens this JSON into an unusable string happens when I load the data into a Spark dataframe and not the target table, so that solution doesn't seem like it will work until Variants are supported in dataframes in Spark 4.0.