<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Autoloader to concatenate CSV files that updates regularly into a single parquet dataframe. in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/autoloader-to-concatenate-csv-files-that-updates-regularly-into/m-p/75589#M34996</link>
    <description>&lt;P&gt;Here is the code (forgot to add)&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;(spark.readStream.format(&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;cloudFiles&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.option(&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;cloudFiles.format&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;csv&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.option(&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;header&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;false&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;) &lt;/SPAN&gt;&lt;SPAN&gt;# Assuming the CSV files have headers&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.schema(schema) &lt;/SPAN&gt;&lt;SPAN&gt;# Specify the schema here&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.option(&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;cloudFiles.schemaLocation&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;, checkpoint_dir)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.load(source_files)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.writeStream&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.format(&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;parquet&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;) &lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.option(&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;checkpointLocation&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;, checkpoint_dir)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.trigger(&lt;/SPAN&gt;&lt;SPAN&gt;availableNow&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;True&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.start(output_path)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Mon, 24 Jun 2024 13:05:26 GMT</pubDate>
    <dc:creator>Kjetil</dc:creator>
    <dc:date>2024-06-24T13:05:26Z</dc:date>
    <item>
      <title>Autoloader to concatenate CSV files that updates regularly into a single parquet dataframe.</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-to-concatenate-csv-files-that-updates-regularly-into/m-p/75588#M34995</link>
      <description>&lt;P&gt;I have multiple large CSV files. One or more of these files changes now and then (a few times a day). The changes in the CSV files are both of type update and append (so both new rows) and updates of old. I want to combine all CSV files into a dataframe then write to parquet. So far I have the code below. I want to ensure that no rows are duplicated if there is an update. Say I have three files, a.csv, b.csv and c.csv. Now c.csv updates. I want to create a dataframe that puts these three csv files into one dataframe (they all have the same schema). I got the code below that does this, however, I am not sure what happens if c.csv updates. Will everything in c.csv overwrite the old information in c.csv? That is what I want.&lt;/P&gt;</description>
      <pubDate>Mon, 24 Jun 2024 13:03:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-to-concatenate-csv-files-that-updates-regularly-into/m-p/75588#M34995</guid>
      <dc:creator>Kjetil</dc:creator>
      <dc:date>2024-06-24T13:03:43Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader to concatenate CSV files that updates regularly into a single parquet dataframe.</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-to-concatenate-csv-files-that-updates-regularly-into/m-p/75589#M34996</link>
      <description>&lt;P&gt;Here is the code (forgot to add)&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;(spark.readStream.format(&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;cloudFiles&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.option(&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;cloudFiles.format&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;csv&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.option(&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;header&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;false&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;) &lt;/SPAN&gt;&lt;SPAN&gt;# Assuming the CSV files have headers&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.schema(schema) &lt;/SPAN&gt;&lt;SPAN&gt;# Specify the schema here&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.option(&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;cloudFiles.schemaLocation&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;, checkpoint_dir)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.load(source_files)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.writeStream&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.format(&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;parquet&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;) &lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.option(&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;checkpointLocation&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;, checkpoint_dir)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.trigger(&lt;/SPAN&gt;&lt;SPAN&gt;availableNow&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;True&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.start(output_path)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Mon, 24 Jun 2024 13:05:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-to-concatenate-csv-files-that-updates-regularly-into/m-p/75589#M34996</guid>
      <dc:creator>Kjetil</dc:creator>
      <dc:date>2024-06-24T13:05:26Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader to concatenate CSV files that updates regularly into a single parquet dataframe.</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-to-concatenate-csv-files-that-updates-regularly-into/m-p/75595#M34999</link>
      <description>&lt;P&gt;autoloader expects new files, not updates/overwrites of old files.&lt;/P&gt;&lt;P&gt;So basically autoloader will look for new filenames in the directory and process all new files.&lt;BR /&gt;And those files are (depending on settings/size) processed file by file or all at the same time.&lt;/P&gt;&lt;P&gt;If you want to make sure you have no dups, you will have to create a function that is processed with each batch and called by forEachBatch.&lt;BR /&gt;&lt;A href="https://docs.databricks.com/en/structured-streaming/foreach.html" target="_blank"&gt;https://docs.databricks.com/en/structured-streaming/foreach.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 24 Jun 2024 14:08:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-to-concatenate-csv-files-that-updates-regularly-into/m-p/75595#M34999</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2024-06-24T14:08:10Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader to concatenate CSV files that updates regularly into a single parquet dataframe.</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-to-concatenate-csv-files-that-updates-regularly-into/m-p/75754#M35046</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/105685"&gt;@Kjetil&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;Please let us know if you still have issue or if&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/14792"&gt;@-werners-&lt;/a&gt;&amp;nbsp;response could be mark as a best solution. Thank you&lt;/P&gt;</description>
      <pubDate>Tue, 25 Jun 2024 23:26:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-to-concatenate-csv-files-that-updates-regularly-into/m-p/75754#M35046</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2024-06-25T23:26:06Z</dc:date>
    </item>
  </channel>
</rss>

