<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Read just the new file ??? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/read-just-the-new-file/m-p/14627#M9099</link>
    <description>&lt;P&gt;My apologies, I read it a little incorrect originally. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;For your use case I would use &lt;A href="https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-copy-into.html" alt="https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-copy-into.html" target="_blank"&gt;COPY INTO&lt;/A&gt; which will only load the files you have not processed yet. You could use &lt;A href="https://docs.databricks.com/delta/delta-streaming.html" alt="https://docs.databricks.com/delta/delta-streaming.html" target="_blank"&gt;structured streaming&lt;/A&gt; to do this or the Databricks &lt;A href="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html" alt="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html" target="_blank"&gt;AutoLoader&lt;/A&gt; but those would be a little more complex. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;For structured streaming you can use a ".trigger(once=True)" to use the streaming API as a batch process. You would use the checkpoint location on the write to track which files have been processed. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;With AutoLoader you can use the "File Listing" option to identify which files have been used last. You will still want to use the .trigger(once=True) argument here as well. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Here are examples below on how to use the COPY INTO command:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;# copy into delta by providing a file location
&amp;nbsp;
COPY INTO delta.`abfss://container@storageAccount.dfs.core.windows.net/deltaTables/target`
FROM (
  SELECT _c0::bigint key, _c1::int index, _c2 textData
  FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path'
)
FILEFORMAT = CSV
PATTERN = 'folder1/file_[a-g].csv'
&amp;nbsp;
# copy into delta by providing a table but must be an existing delta table so you create it first
&amp;nbsp;
CREATE TABLE target as 
(
 _c0 long, 
_c1 integer, 
_c2 string
)
USING DELTA
&amp;nbsp;
COPY INTO target_table
FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path'
FILEFORMAT = CSV
PATTERN = 'folder1/file_[a-g].csv'&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 23 Sep 2021 18:03:31 GMT</pubDate>
    <dc:creator>Ryan_Chynoweth</dc:creator>
    <dc:date>2021-09-23T18:03:31Z</dc:date>
    <item>
      <title>Read just the new file ???</title>
      <link>https://community.databricks.com/t5/data-engineering/read-just-the-new-file/m-p/14624#M9096</link>
      <description>&lt;P&gt;Hi guys,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;How can I read just the new file in a batch process ?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Can you help me ? pleas&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you &lt;/P&gt;</description>
      <pubDate>Thu, 23 Sep 2021 15:21:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-just-the-new-file/m-p/14624#M9096</guid>
      <dc:creator>William_Scardua</dc:creator>
      <dc:date>2021-09-23T15:21:30Z</dc:date>
    </item>
    <item>
      <title>Re: Read just the new file ???</title>
      <link>https://community.databricks.com/t5/data-engineering/read-just-the-new-file/m-p/14625#M9097</link>
      <description>&lt;P&gt;What type of file? Is the file stored in a storage account? &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Typically, you would read and write data with something like the following code: &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;# read a parquet file
df = spark.read.format("parquet").load("/path/to/file")
&amp;nbsp;
# write the data as a file
df.write.format("delta").save("/path/to/delta/table")
&amp;nbsp;
# write the data as a managed table
df.write.format("delta").saveAsTable("table_name")
&amp;nbsp;
&amp;nbsp;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Please reference this &lt;A href="https://docs.databricks.com/data/data.html" alt="https://docs.databricks.com/data/data.html" target="_blank"&gt;documentation&lt;/A&gt; for some more information. &lt;/P&gt;</description>
      <pubDate>Thu, 23 Sep 2021 15:41:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-just-the-new-file/m-p/14625#M9097</guid>
      <dc:creator>Ryan_Chynoweth</dc:creator>
      <dc:date>2021-09-23T15:41:47Z</dc:date>
    </item>
    <item>
      <title>Re: Read just the new file ???</title>
      <link>https://community.databricks.com/t5/data-engineering/read-just-the-new-file/m-p/14626#M9098</link>
      <description>&lt;P&gt;Thank you for you feedback @Ryan Chynoweth​&amp;nbsp;&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;For example, imagine that situation:&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;time1- I have some CSV files landing in my hdfs directory (landing/file1.csv, landing/file2.csv)&lt;/P&gt;&lt;P&gt;time2- My batch PySpark read the hdfs landing directory and write in hdfs bronze directory (bronze/);&lt;/P&gt;&lt;P&gt;time3- New CSV files arrive in hdfs landing directory (landing/file3.csv, landing/file4.csv)&lt;/P&gt;&lt;P&gt;time4- In this point the batch PySpark need to read only are new files (landing/file3.csv, landing/file4.csv) to append to the bonze hdfs directory (bronze/)&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;In na stream (WriteStream) have the 'checkpointLocation' option, but in na batch ? I need to developer a python control for this situation ?  &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Can you understand ?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;tsk&lt;/P&gt;</description>
      <pubDate>Thu, 23 Sep 2021 17:24:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-just-the-new-file/m-p/14626#M9098</guid>
      <dc:creator>William_Scardua</dc:creator>
      <dc:date>2021-09-23T17:24:15Z</dc:date>
    </item>
    <item>
      <title>Re: Read just the new file ???</title>
      <link>https://community.databricks.com/t5/data-engineering/read-just-the-new-file/m-p/14627#M9099</link>
      <description>&lt;P&gt;My apologies, I read it a little incorrect originally. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;For your use case I would use &lt;A href="https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-copy-into.html" alt="https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-copy-into.html" target="_blank"&gt;COPY INTO&lt;/A&gt; which will only load the files you have not processed yet. You could use &lt;A href="https://docs.databricks.com/delta/delta-streaming.html" alt="https://docs.databricks.com/delta/delta-streaming.html" target="_blank"&gt;structured streaming&lt;/A&gt; to do this or the Databricks &lt;A href="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html" alt="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader.html" target="_blank"&gt;AutoLoader&lt;/A&gt; but those would be a little more complex. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;For structured streaming you can use a ".trigger(once=True)" to use the streaming API as a batch process. You would use the checkpoint location on the write to track which files have been processed. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;With AutoLoader you can use the "File Listing" option to identify which files have been used last. You will still want to use the .trigger(once=True) argument here as well. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Here are examples below on how to use the COPY INTO command:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;# copy into delta by providing a file location
&amp;nbsp;
COPY INTO delta.`abfss://container@storageAccount.dfs.core.windows.net/deltaTables/target`
FROM (
  SELECT _c0::bigint key, _c1::int index, _c2 textData
  FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path'
)
FILEFORMAT = CSV
PATTERN = 'folder1/file_[a-g].csv'
&amp;nbsp;
# copy into delta by providing a table but must be an existing delta table so you create it first
&amp;nbsp;
CREATE TABLE target as 
(
 _c0 long, 
_c1 integer, 
_c2 string
)
USING DELTA
&amp;nbsp;
COPY INTO target_table
FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path'
FILEFORMAT = CSV
PATTERN = 'folder1/file_[a-g].csv'&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 23 Sep 2021 18:03:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-just-the-new-file/m-p/14627#M9099</guid>
      <dc:creator>Ryan_Chynoweth</dc:creator>
      <dc:date>2021-09-23T18:03:31Z</dc:date>
    </item>
    <item>
      <title>Re: Read just the new file ???</title>
      <link>https://community.databricks.com/t5/data-engineering/read-just-the-new-file/m-p/14628#M9100</link>
      <description>&lt;P&gt;wowwwww that's right @Ryan Chynoweth​&amp;nbsp;, I can use 'once=True'  in  streaming API  &lt;span class="lia-unicode-emoji" title=":grinning_face_with_smiling_eyes:"&gt;😄&lt;/span&gt; &lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;Thank you very much man &lt;/P&gt;</description>
      <pubDate>Thu, 23 Sep 2021 18:52:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-just-the-new-file/m-p/14628#M9100</guid>
      <dc:creator>William_Scardua</dc:creator>
      <dc:date>2021-09-23T18:52:35Z</dc:date>
    </item>
    <item>
      <title>Re: Read just the new file ???</title>
      <link>https://community.databricks.com/t5/data-engineering/read-just-the-new-file/m-p/14629#M9101</link>
      <description>&lt;P&gt;Happy to help! &lt;/P&gt;</description>
      <pubDate>Mon, 27 Sep 2021 18:38:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-just-the-new-file/m-p/14629#M9101</guid>
      <dc:creator>Ryan_Chynoweth</dc:creator>
      <dc:date>2021-09-27T18:38:34Z</dc:date>
    </item>
  </channel>
</rss>

