Databricks Community

kpendergast · ‎12-07-2022

We are reading over an S3 bucket which contains a several million json files. The schema from the read is stored in a json file in the dbfs filestore. This file is then utilized by autoloader to write new files nightly to a delta table. The schema is updated to pickup new fields. The total amount of fields is around 260 but varies depending on the application. The majority of the fields are in large nested arrays which is where we run into issues. The read will infer columns data types in these nested arrays as longtype or double depending on the present values in the fields. Fields within nested arrays do not supported data type changes. For some reason it infers columns differently and breaks this process and a new staging table has to be created to support the new schema with nested arrays.

Here is the code to read, save, and view the schema in a file:

# read file in s3

s3_df = spark.read.format("json").load("s3a://data/*.json")

# save schema

with open("/dbfs/FileStore/s3/schemas/s3_schema_1222.json", "w") as f:

json.dump(s3_df.schema.jsonValue(), f)

# view schema in file

with open("/dbfs/FileStore/s3/schemas/s3_schema_1222.json", "r") as f_read:

for line in f_read:

print(line)

Here is a snippet of the output:

{"type": "struct", "fields": [{"name": "market", "type": "string", "nullable": true, "metadata": {}}, {"name": "marketDownMarketPriceProviders", "type": "string", "nullable": true, "metadata": {}}, {"name": "marketItineraries", "type": {"type": "array", "elementType": {"type": "struct", "fields": [{"name": "adultAlternativeItineraryDynamicDiscountAmount", "type": "double", "nullable": true, "metadata": {}}, {"name": "adultAlternativeItineraryTotal", "type": "double", "nullable": true, "metadata": {}}, {"name": "adultEligible", "type": "boolean", "nullable": true, "metadata": {}}, {"name": "availableSeatCount", "type": "long", "nullable": true, "metadata": {}}

To correct the field data types I copy/paste the entire output to a text editor, make some corrections so existing fields in the nested arrays match the current delta table. I then paste it into a notebook cell as a string value to a python variable and write it back to the file.

This is where my current solution breaks because the string is not encoded in the file correctly. After writing the string to the file and print the output looks like this:

"{\"type\": \"struct\", \"fields\": [{\"name\": \"market\", \"type\": \"string\", \"nullable\": true, \"metadata\": {}}, {\"name\": \"marketDownMarketPriceProviders\", \"type\": \"string\", \"nullable\": true, \"metadata\": {}}, {\"name\": \"marketItineraries\", \"type\": {\"type\": \"array\", \"elementType\": {\"type\": \"struct\", \"fields\": [{\"name\": \"adultAlternativeItineraryDynamicDiscountAmount\", \"type\": \"double\", \"nullable\": true, \"metadata\": {}}, {\"name\": \"adultAlternativeItineraryTotal\", \"type\": \"double\", \"nullable\": true, \"metadata\": {}}, {\"name\": \"adultEligible\", \"type\": \"boolean\", \"nullable\": true, \"metadata\": {}}, {\"name\": \"availableSeatCount\", \"type\": \"long\", \"nullable\": true, \"metadata\": {}

How can I write the schema string to the file and have it be in the correct format for autoloader?

I'm open to solutions that others are using but prefer to use this process if all possible.

kpendergast · ‎12-08-2022

if anyone is curious I ended up just passing the schema as a string to .schema(eval(the_schema)) in StructType format and not using the file based approach.

View solution in original post

kpendergast · ‎12-08-2022

if anyone is curious I ended up just passing the schema as a string to .schema(eval(the_schema)) in StructType format and not using the file based approach.

Databricks Community

Modify the Json Schema Stored in a File for AutoLoader

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!