We are reading over an S3 bucket which contains a several million json files. The schema from the read is stored in a json file in the dbfs filestore. This file is then utilized by autoloader to write new files nightly to a delta table. The schema is updated to pickup new fields. The total amount of fields is around 260 but varies depending on the application. The majority of the fields are in large nested arrays which is where we run into issues. The read will infer columns data types in these nested arrays as longtype or double depending on the present values in the fields. Fields within nested arrays do not supported data type changes. For some reason it infers columns differently and breaks this process and a new staging table has to be created to support the new schema with nested arrays.
Here is the code to read, save, and view the schema in a file:
# read file in s3
s3_df = spark.read.format("json").load("s3a://data/*.json")
# save schema
with open("/dbfs/FileStore/s3/schemas/s3_schema_1222.json", "w") as f:
json.dump(s3_df.schema.jsonValue(), f)
# view schema in file
with open("/dbfs/FileStore/s3/schemas/s3_schema_1222.json", "r") as f_read:
for line in f_read:
print(line)
Here is a snippet of the output:
{"type": "struct", "fields": [{"name": "market", "type": "string", "nullable": true, "metadata": {}}, {"name": "marketDownMarketPriceProviders", "type": "string", "nullable": true, "metadata": {}}, {"name": "marketItineraries", "type": {"type": "array", "elementType": {"type": "struct", "fields": [{"name": "adultAlternativeItineraryDynamicDiscountAmount", "type": "double", "nullable": true, "metadata": {}}, {"name": "adultAlternativeItineraryTotal", "type": "double", "nullable": true, "metadata": {}}, {"name": "adultEligible", "type": "boolean", "nullable": true, "metadata": {}}, {"name": "availableSeatCount", "type": "long", "nullable": true, "metadata": {}}
To correct the field data types I copy/paste the entire output to a text editor, make some corrections so existing fields in the nested arrays match the current delta table. I then paste it into a notebook cell as a string value to a python variable and write it back to the file.
This is where my current solution breaks because the string is not encoded in the file correctly. After writing the string to the file and print the output looks like this:
"{\"type\": \"struct\", \"fields\": [{\"name\": \"market\", \"type\": \"string\", \"nullable\": true, \"metadata\": {}}, {\"name\": \"marketDownMarketPriceProviders\", \"type\": \"string\", \"nullable\": true, \"metadata\": {}}, {\"name\": \"marketItineraries\", \"type\": {\"type\": \"array\", \"elementType\": {\"type\": \"struct\", \"fields\": [{\"name\": \"adultAlternativeItineraryDynamicDiscountAmount\", \"type\": \"double\", \"nullable\": true, \"metadata\": {}}, {\"name\": \"adultAlternativeItineraryTotal\", \"type\": \"double\", \"nullable\": true, \"metadata\": {}}, {\"name\": \"adultEligible\", \"type\": \"boolean\", \"nullable\": true, \"metadata\": {}}, {\"name\": \"availableSeatCount\", \"type\": \"long\", \"nullable\": true, \"metadata\": {}
How can I write the schema string to the file and have it be in the correct format for autoloader?
I'm open to solutions that others are using but prefer to use this process if all possible.