<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Autloader on CSV file didn't infer well cell with JSON data in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/autloader-on-csv-file-didn-t-infer-well-cell-with-json-data/m-p/12678#M7450</link>
    <description>&lt;P&gt;So obvious. Thanks adding following option solve it&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;.option("escape","\"")&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 12 Jan 2023 09:17:23 GMT</pubDate>
    <dc:creator>alxsbn</dc:creator>
    <dc:date>2023-01-12T09:17:23Z</dc:date>
    <item>
      <title>Autloader on CSV file didn't infer well cell with JSON data</title>
      <link>https://community.databricks.com/t5/data-engineering/autloader-on-csv-file-didn-t-infer-well-cell-with-json-data/m-p/12676#M7448</link>
      <description>&lt;P&gt;Hello ! &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I playing with autoloader schema inference on a big S3 repo with +300 tables and large CSV files. I'm looking at autoloader with great attention, as it can be a great time saver on our ingestion process (data comes from a transactional DB generated through a CDC feature).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;My code is pretty standard:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;(spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("cloudFiles.schemaLocation", target_table_path)
        .option("cloudFiles.inferColumnTypes", True)
        .load(source_table_path)
        .writeStream
        .option("checkpointLocation", target_table_path)
        .trigger(availableNow=True) 
        .toTable(table)
    )&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;I have some CSV files using comma delimiters having some cells inside whom I have JSON data. A quick extract : &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;Op,TIMESTAMP,field_1,field_2,field_3,field_4,date_add,date_upd,percent,matches
U,2023-01-10 16:14:26.000000,539775799,74793,688,+1.00000000e+02,2021-04-13 21:24:39,2023-01-10 16:14:26,78,"[{""name"":""age_40_50"",""value"":0},{""name"":""xxxx"",""value"":0},{""name"":""xxxx"",""value"":4},{""name"":""xxxx"",""value"":0},{""name"":""xxxx"",""value"":0},{""name"":""xxxx"",""value"":4},{""name"":""xxxx"",""value"":4},{""name"":""xxxx"",""value"":1},{""name"":""xxxx"",""value"":4},{""name"":""xxxx"",""value"":4},{""name"":""***"",""value"":4}]"&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Autoloader recognize my last column as a string and didn't escape the comma inside this column.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Of course, I can regenerate my sources with another delimiter, but I look at autoloader options to do this on the more easy way. I know I can playing with data afterwards or using some select to flatten the JSON as a struct (since I want to infer a lot of tables, exceptions it's what I want to avoid).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks for you help,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 11 Jan 2023 10:40:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autloader-on-csv-file-didn-t-infer-well-cell-with-json-data/m-p/12676#M7448</guid>
      <dc:creator>alxsbn</dc:creator>
      <dc:date>2023-01-11T10:40:56Z</dc:date>
    </item>
    <item>
      <title>Re: Autloader on CSV file didn't infer well cell with JSON data</title>
      <link>https://community.databricks.com/t5/data-engineering/autloader-on-csv-file-didn-t-infer-well-cell-with-json-data/m-p/12677#M7449</link>
      <description>&lt;P&gt;PySpark by default is using \ as an escape character. You can change it to "&lt;/P&gt;&lt;P&gt;Doc: &lt;A href="https://docs.databricks.com/ingestion/auto-loader/options.html#csv-options" target="test_blank"&gt;https://docs.databricks.com/ingestion/auto-loader/options.html#csv-options&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 11 Jan 2023 11:43:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autloader-on-csv-file-didn-t-infer-well-cell-with-json-data/m-p/12677#M7449</guid>
      <dc:creator>daniel_sahal</dc:creator>
      <dc:date>2023-01-11T11:43:05Z</dc:date>
    </item>
    <item>
      <title>Re: Autloader on CSV file didn't infer well cell with JSON data</title>
      <link>https://community.databricks.com/t5/data-engineering/autloader-on-csv-file-didn-t-infer-well-cell-with-json-data/m-p/12678#M7450</link>
      <description>&lt;P&gt;So obvious. Thanks adding following option solve it&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;.option("escape","\"")&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 12 Jan 2023 09:17:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autloader-on-csv-file-didn-t-infer-well-cell-with-json-data/m-p/12678#M7450</guid>
      <dc:creator>alxsbn</dc:creator>
      <dc:date>2023-01-12T09:17:23Z</dc:date>
    </item>
  </channel>
</rss>

