Databricks

alxsbn · ‎01-11-2023

Hello !

I playing with autoloader schema inference on a big S3 repo with +300 tables and large CSV files. I'm looking at autoloader with great attention, as it can be a great time saver on our ingestion process (data comes from a transactional DB generated through a CDC feature).

My code is pretty standard:

(spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("cloudFiles.schemaLocation", target_table_path)
        .option("cloudFiles.inferColumnTypes", True)
        .load(source_table_path)
        .writeStream
        .option("checkpointLocation", target_table_path)
        .trigger(availableNow=True) 
        .toTable(table)
    )

I have some CSV files using comma delimiters having some cells inside whom I have JSON data. A quick extract :

Op,TIMESTAMP,field_1,field_2,field_3,field_4,date_add,date_upd,percent,matches
U,2023-01-10 16:14:26.000000,539775799,74793,688,+1.00000000e+02,2021-04-13 21:24:39,2023-01-10 16:14:26,78,"[{""name"":""age_40_50"",""value"":0},{""name"":""xxxx"",""value"":0},{""name"":""xxxx"",""value"":4},{""name"":""xxxx"",""value"":0},{""name"":""xxxx"",""value"":0},{""name"":""xxxx"",""value"":4},{""name"":""xxxx"",""value"":4},{""name"":""xxxx"",""value"":1},{""name"":""xxxx"",""value"":4},{""name"":""xxxx"",""value"":4},{""name"":""***"",""value"":4}]"

Autoloader recognize my last column as a string and didn't escape the comma inside this column.

Of course, I can regenerate my sources with another delimiter, but I look at autoloader options to do this on the more easy way. I know I can playing with data afterwards or using some select to flatten the JSON as a struct (since I want to infer a lot of tables, exceptions it's what I want to avoid).

Thanks for you help,