cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Modify the Json Schema Stored in a File for AutoLoader

kpendergast
Contributor

We are reading over an S3 bucket which contains a several million json files. The schema from the read is stored in a json file in the dbfs filestore. This file is then utilized by autoloader to write new files nightly to a delta table. The schema is updated to pickup new fields. The total amount of fields is around 260 but varies depending on the application. The majority of the fields are in large nested arrays which is where we run into issues. The read will infer columns data types in these nested arrays as longtype or double depending on the present values in the fields. Fields within nested arrays do not supported data type changes. For some reason it infers columns differently and breaks this process and a new staging table has to be created to support the new schema with nested arrays.

Here is the code to read, save, and view the schema in a file:

# read file in s3

s3_df = spark.read.format("json").load("s3a://data/*.json")

# save schema

with open("/dbfs/FileStore/s3/schemas/s3_schema_1222.json", "w") as f:

json.dump(s3_df.schema.jsonValue(), f)

# view schema in file

with open("/dbfs/FileStore/s3/schemas/s3_schema_1222.json", "r") as f_read:

for line in f_read:

print(line)

Here is a snippet of the output:

{"type": "struct", "fields": [{"name": "market", "type": "string", "nullable": true, "metadata": {}}, {"name": "marketDownMarketPriceProviders", "type": "string", "nullable": true, "metadata": {}}, {"name": "marketItineraries", "type": {"type": "array", "elementType": {"type": "struct", "fields": [{"name": "adultAlternativeItineraryDynamicDiscountAmount", "type": "double", "nullable": true, "metadata": {}}, {"name": "adultAlternativeItineraryTotal", "type": "double", "nullable": true, "metadata": {}}, {"name": "adultEligible", "type": "boolean", "nullable": true, "metadata": {}}, {"name": "availableSeatCount", "type": "long", "nullable": true, "metadata": {}}

To correct the field data types I copy/paste the entire output to a text editor, make some corrections so existing fields in the nested arrays match the current delta table. I then paste it into a notebook cell as a string value to a python variable and write it back to the file.

This is where my current solution breaks because the string is not encoded in the file correctly. After writing the string to the file and print the output looks like this:

"{\"type\": \"struct\", \"fields\": [{\"name\": \"market\", \"type\": \"string\", \"nullable\": true, \"metadata\": {}}, {\"name\": \"marketDownMarketPriceProviders\", \"type\": \"string\", \"nullable\": true, \"metadata\": {}}, {\"name\": \"marketItineraries\", \"type\": {\"type\": \"array\", \"elementType\": {\"type\": \"struct\", \"fields\": [{\"name\": \"adultAlternativeItineraryDynamicDiscountAmount\", \"type\": \"double\", \"nullable\": true, \"metadata\": {}}, {\"name\": \"adultAlternativeItineraryTotal\", \"type\": \"double\", \"nullable\": true, \"metadata\": {}}, {\"name\": \"adultEligible\", \"type\": \"boolean\", \"nullable\": true, \"metadata\": {}}, {\"name\": \"availableSeatCount\", \"type\": \"long\", \"nullable\": true, \"metadata\": {}

How can I write the schema string to the file and have it be in the correct format for autoloader?

I'm open to solutions that others are using but prefer to use this process if all possible.

1 ACCEPTED SOLUTION

Accepted Solutions

kpendergast
Contributor

if anyone is curious I ended up just passing the schema as a string to .schema(eval(the_schema)) in StructType format and not using the file based approach.

View solution in original post

1 REPLY 1

kpendergast
Contributor

if anyone is curious I ended up just passing the schema as a string to .schema(eval(the_schema)) in StructType format and not using the file based approach.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.