<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Load CSV files with slightly different schemas in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/load-csv-files-with-slightly-different-schemas/m-p/6420#M2570</link>
    <description>&lt;P&gt;@Morten Stakkeland​&amp;nbsp;if i am not wrong, you want to handle schema changes from source to target. can you please check schema evolution in Databricks &lt;/P&gt;</description>
    <pubDate>Tue, 04 Apr 2023 19:24:45 GMT</pubDate>
    <dc:creator>karthik_p</dc:creator>
    <dc:date>2023-04-04T19:24:45Z</dc:date>
    <item>
      <title>Load CSV files with slightly different schemas</title>
      <link>https://community.databricks.com/t5/data-engineering/load-csv-files-with-slightly-different-schemas/m-p/6419#M2569</link>
      <description>&lt;P&gt;I have a set of CSV files generated by a system, where the schema has evolved over the years. Some columns have been added, and at least one column has been renamed in newer files. Is there any way to elegantly load these files into a dataframe? &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have tried spark.read.csv() using different options. My next thought would be to load the individual files using pandas, possibly using applyInPandas. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Any thoughts or ideas?&lt;/P&gt;</description>
      <pubDate>Tue, 04 Apr 2023 19:22:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/load-csv-files-with-slightly-different-schemas/m-p/6419#M2569</guid>
      <dc:creator>MRTN</dc:creator>
      <dc:date>2023-04-04T19:22:03Z</dc:date>
    </item>
    <item>
      <title>Re: Load CSV files with slightly different schemas</title>
      <link>https://community.databricks.com/t5/data-engineering/load-csv-files-with-slightly-different-schemas/m-p/6420#M2570</link>
      <description>&lt;P&gt;@Morten Stakkeland​&amp;nbsp;if i am not wrong, you want to handle schema changes from source to target. can you please check schema evolution in Databricks &lt;/P&gt;</description>
      <pubDate>Tue, 04 Apr 2023 19:24:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/load-csv-files-with-slightly-different-schemas/m-p/6420#M2570</guid>
      <dc:creator>karthik_p</dc:creator>
      <dc:date>2023-04-04T19:24:45Z</dc:date>
    </item>
    <item>
      <title>Re: Load CSV files with slightly different schemas</title>
      <link>https://community.databricks.com/t5/data-engineering/load-csv-files-with-slightly-different-schemas/m-p/6421#M2571</link>
      <description>&lt;P&gt;Hi @Morten Stakkeland​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Please refer below blog that might help you-&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/ingestion/auto-loader/schema.html" alt="https://docs.databricks.com/ingestion/auto-loader/schema.html" target="_blank"&gt;Configure schema inference and evolution in Auto Loader | Databricks on AWS&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 05 Apr 2023 05:28:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/load-csv-files-with-slightly-different-schemas/m-p/6421#M2571</guid>
      <dc:creator>Ajay-Pandey</dc:creator>
      <dc:date>2023-04-05T05:28:10Z</dc:date>
    </item>
    <item>
      <title>Re: Load CSV files with slightly different schemas</title>
      <link>https://community.databricks.com/t5/data-engineering/load-csv-files-with-slightly-different-schemas/m-p/6422#M2572</link>
      <description>&lt;P&gt;Thanks for the input. I previously experimented with the Auto Loader, but was stopped by the fact that the headers in my csv files contains spaces and illegal characters. Hence the error&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;AnalysisException: 
Found invalid character(s) among " ,;{}()\n\t=" in the column names of your
schema. 
Please enable column mapping by setting table property 'delta.columnMapping.mode' to 'name'.
For more details, refer to &lt;A href="https://docs.microsoft.com/azure/databricks/delta/delta-column-mapping" target="test_blank"&gt;https://docs.microsoft.com/azure/databricks/delta/delta-column-mapping&lt;/A&gt;
Or you can use alias to rename it.&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Passing an option to set the column mapping to name did not resolve it&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;    .writeStream
    .format("delta")
    .option("checkpointLocation",checkpoint_location)
    .option("optimizeWrite","True")
    .option("schemaEvolutionMode","addNewColumns")
    .option("delta.columnMapping.mode", "name")
    .trigger(once=True)
    .toTable(table_name))&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Creating the target table with mapping name prior to starting the stream did not help, as I got error messages related to schema mismatch. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I could not think of any way to rename all my 80+ columns "on the fly". &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 05 Apr 2023 07:39:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/load-csv-files-with-slightly-different-schemas/m-p/6422#M2572</guid>
      <dc:creator>MRTN</dc:creator>
      <dc:date>2023-04-05T07:39:29Z</dc:date>
    </item>
    <item>
      <title>Re: Load CSV files with slightly different schemas</title>
      <link>https://community.databricks.com/t5/data-engineering/load-csv-files-with-slightly-different-schemas/m-p/6423#M2573</link>
      <description>&lt;P&gt;For reference - for anybody struggling with the same issues. All online examples using auto loader are written as one block statement on the form: &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;(spark.readStream.format("cloudFiles")
  .option("cloudFiles.format", "csv")
  # The schema location directory keeps track of your data schema over time
  .option("cloudFiles.schemaLocation", "&amp;lt;path_to_checkpoint&amp;gt;")
  .load("&amp;lt;path_to_source_data&amp;gt;")
  .writeStream
  .option("checkpointLocation", "&amp;lt;path_to_checkpoint&amp;gt;")
  .start("&amp;lt;path_to_target")
)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;The solution was to split this into three as follows &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df=(spark.readStream.format("cloudFiles")
  .option("cloudFiles.format", "csv")
  # The schema location directory keeps track of your data schema over time
  .option("cloudFiles.schemaLocation", "&amp;lt;path_to_checkpoint&amp;gt;")
  .load("&amp;lt;path_to_source_data&amp;gt;"))
&amp;nbsp;
for c in df.columns:
    df = df.withColumnRenamed(c, c.replace(" ", "_").replace("(","%28").replace(")","%29").replace("/","%2F"))
&amp;nbsp;
df.writeStream
  .option("checkpointLocation", "&amp;lt;path_to_checkpoint&amp;gt;")
  .start("&amp;lt;path_to_target")&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 12 Apr 2023 08:08:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/load-csv-files-with-slightly-different-schemas/m-p/6423#M2573</guid>
      <dc:creator>MRTN</dc:creator>
      <dc:date>2023-04-12T08:08:17Z</dc:date>
    </item>
  </channel>
</rss>

