cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Modify the Json Schema Stored in a File for AutoLoader

kpendergast
Contributor

We are reading over an S3 bucket which contains a several million json files. The schema from the read is stored in a json file in the dbfs filestore. This file is then utilized by autoloader to write new files nightly to a delta table. The schema is updated to pickup new fields. The total amount of fields is around 260 but varies depending on the application. The majority of the fields are in large nested arrays which is where we run into issues. The read will infer columns data types in these nested arrays as longtype or double depending on the present values in the fields. Fields within nested arrays do not supported data type changes. For some reason it infers columns differently and breaks this process and a new staging table has to be created to support the new schema with nested arrays.

Here is the code to read, save, and view the schema in a file:

# read file in s3

s3_df = spark.read.format("json").load("s3a://data/*.json")

# save schema

with open("/dbfs/FileStore/s3/schemas/s3_schema_1222.json", "w") as f:

json.dump(s3_df.schema.jsonValue(), f)

# view schema in file

with open("/dbfs/FileStore/s3/schemas/s3_schema_1222.json", "r") as f_read:

for line in f_read:

print(line)

Here is a snippet of the output:

{"type": "struct", "fields": [{"name": "market", "type": "string", "nullable": true, "metadata": {}}, {"name": "marketDownMarketPriceProviders", "type": "string", "nullable": true, "metadata": {}}, {"name": "marketItineraries", "type": {"type": "array", "elementType": {"type": "struct", "fields": [{"name": "adultAlternativeItineraryDynamicDiscountAmount", "type": "double", "nullable": true, "metadata": {}}, {"name": "adultAlternativeItineraryTotal", "type": "double", "nullable": true, "metadata": {}}, {"name": "adultEligible", "type": "boolean", "nullable": true, "metadata": {}}, {"name": "availableSeatCount", "type": "long", "nullable": true, "metadata": {}}

To correct the field data types I copy/paste the entire output to a text editor, make some corrections so existing fields in the nested arrays match the current delta table. I then paste it into a notebook cell as a string value to a python variable and write it back to the file.

This is where my current solution breaks because the string is not encoded in the file correctly. After writing the string to the file and print the output looks like this:

"{\"type\": \"struct\", \"fields\": [{\"name\": \"market\", \"type\": \"string\", \"nullable\": true, \"metadata\": {}}, {\"name\": \"marketDownMarketPriceProviders\", \"type\": \"string\", \"nullable\": true, \"metadata\": {}}, {\"name\": \"marketItineraries\", \"type\": {\"type\": \"array\", \"elementType\": {\"type\": \"struct\", \"fields\": [{\"name\": \"adultAlternativeItineraryDynamicDiscountAmount\", \"type\": \"double\", \"nullable\": true, \"metadata\": {}}, {\"name\": \"adultAlternativeItineraryTotal\", \"type\": \"double\", \"nullable\": true, \"metadata\": {}}, {\"name\": \"adultEligible\", \"type\": \"boolean\", \"nullable\": true, \"metadata\": {}}, {\"name\": \"availableSeatCount\", \"type\": \"long\", \"nullable\": true, \"metadata\": {}

How can I write the schema string to the file and have it be in the correct format for autoloader?

I'm open to solutions that others are using but prefer to use this process if all possible.

1 ACCEPTED SOLUTION

Accepted Solutions

kpendergast
Contributor

if anyone is curious I ended up just passing the schema as a string to .schema(eval(the_schema)) in StructType format and not using the file based approach.

View solution in original post

1 REPLY 1

kpendergast
Contributor

if anyone is curious I ended up just passing the schema as a string to .schema(eval(the_schema)) in StructType format and not using the file based approach.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group