cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Autloader on CSV file didn't infer well cell with JSON data

alxsbn
New Contributor III

Hello !

I playing with autoloader schema inference on a big S3 repo with +300 tables and large CSV files. I'm looking at autoloader with great attention, as it can be a great time saver on our ingestion process (data comes from a transactional DB generated through a CDC feature).

My code is pretty standard:

(spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("cloudFiles.schemaLocation", target_table_path)
        .option("cloudFiles.inferColumnTypes", True)
        .load(source_table_path)
        .writeStream
        .option("checkpointLocation", target_table_path)
        .trigger(availableNow=True) 
        .toTable(table)
    )

I have some CSV files using comma delimiters having some cells inside whom I have JSON data. A quick extract :

Op,TIMESTAMP,field_1,field_2,field_3,field_4,date_add,date_upd,percent,matches
U,2023-01-10 16:14:26.000000,539775799,74793,688,+1.00000000e+02,2021-04-13 21:24:39,2023-01-10 16:14:26,78,"[{""name"":""age_40_50"",""value"":0},{""name"":""xxxx"",""value"":0},{""name"":""xxxx"",""value"":4},{""name"":""xxxx"",""value"":0},{""name"":""xxxx"",""value"":0},{""name"":""xxxx"",""value"":4},{""name"":""xxxx"",""value"":4},{""name"":""xxxx"",""value"":1},{""name"":""xxxx"",""value"":4},{""name"":""xxxx"",""value"":4},{""name"":""***"",""value"":4}]"

Autoloader recognize my last column as a string and didn't escape the comma inside this column.

Of course, I can regenerate my sources with another delimiter, but I look at autoloader options to do this on the more easy way. I know I can playing with data afterwards or using some select to flatten the JSON as a struct (since I want to infer a lot of tables, exceptions it's what I want to avoid).

Thanks for you help,

1 ACCEPTED SOLUTION

Accepted Solutions

daniel_sahal
Esteemed Contributor

PySpark by default is using \ as an escape character. You can change it to "

Doc: https://docs.databricks.com/ingestion/auto-loader/options.html#csv-options

View solution in original post

2 REPLIES 2

daniel_sahal
Esteemed Contributor

PySpark by default is using \ as an escape character. You can change it to "

Doc: https://docs.databricks.com/ingestion/auto-loader/options.html#csv-options

alxsbn
New Contributor III

So obvious. Thanks adding following option solve it

.option("escape","\"")