Hello !
I playing with autoloader schema inference on a big S3 repo with +300 tables and large CSV files. I'm looking at autoloader with great attention, as it can be a great time saver on our ingestion process (data comes from a transactional DB generated through a CDC feature).
My code is pretty standard:
(spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("cloudFiles.schemaLocation", target_table_path)
.option("cloudFiles.inferColumnTypes", True)
.load(source_table_path)
.writeStream
.option("checkpointLocation", target_table_path)
.trigger(availableNow=True)
.toTable(table)
)
I have some CSV files using comma delimiters having some cells inside whom I have JSON data. A quick extract :
Op,TIMESTAMP,field_1,field_2,field_3,field_4,date_add,date_upd,percent,matches
U,2023-01-10 16:14:26.000000,539775799,74793,688,+1.00000000e+02,2021-04-13 21:24:39,2023-01-10 16:14:26,78,"[{""name"":""age_40_50"",""value"":0},{""name"":""xxxx"",""value"":0},{""name"":""xxxx"",""value"":4},{""name"":""xxxx"",""value"":0},{""name"":""xxxx"",""value"":0},{""name"":""xxxx"",""value"":4},{""name"":""xxxx"",""value"":4},{""name"":""xxxx"",""value"":1},{""name"":""xxxx"",""value"":4},{""name"":""xxxx"",""value"":4},{""name"":""***"",""value"":4}]"
Autoloader recognize my last column as a string and didn't escape the comma inside this column.
Of course, I can regenerate my sources with another delimiter, but I look at autoloader options to do this on the more easy way. I know I can playing with data afterwards or using some select to flatten the JSON as a struct (since I want to infer a lot of tables, exceptions it's what I want to avoid).
Thanks for you help,