<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Autoloader and &amp;quot;cleanSource&amp;quot; in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/autoloader-and-quot-cleansource-quot/m-p/19151#M12804</link>
    <description>&lt;P&gt;Yes, but I'm guessing as part of the native spark implementation for file streaming i think it should specify either way? &lt;/P&gt;</description>
    <pubDate>Wed, 01 Jun 2022 13:03:26 GMT</pubDate>
    <dc:creator>laurencewells</dc:creator>
    <dc:date>2022-06-01T13:03:26Z</dc:date>
    <item>
      <title>Autoloader and "cleanSource"</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-and-quot-cleansource-quot/m-p/19149#M12802</link>
      <description>&lt;P&gt;Hi All, &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We are trying to use the Spark 3 structured streaming feature/option ".option('cleanSource','archive')" to archive processed files. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This is working as expected using the standard spark implementation, however does not appear to work using autoloader. I cannot see any documentation to specify whether this supported or not. Whether it is a bug or expected. We have tried various tweaks etc to no avail.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;is this a bug or expected?&lt;/P&gt;&lt;P&gt;Is there a an alternate approach using autoloader? &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks Larry&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df = (
  spark.readStream
  .format("cloudFiles") \
  .option("cloudFiles.format", "csv") \
  .option("cleanSource","archive")
  .option("sourceArchiveDir",archivePath)
  .option('header', 'true')
  .schema(schema)
  .load(path)
  .withColumn("loadDate",lit(datetime.utcnow()))
)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 31 May 2022 14:45:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-and-quot-cleansource-quot/m-p/19149#M12802</guid>
      <dc:creator>laurencewells</dc:creator>
      <dc:date>2022-05-31T14:45:38Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader and "cleanSource"</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-and-quot-cleansource-quot/m-p/19150#M12803</link>
      <description>&lt;P&gt;&lt;A href="https://docs.databricks.com/ingestion/auto-loader/options.html#common-auto-loader-options" alt="https://docs.databricks.com/ingestion/auto-loader/options.html#common-auto-loader-options" target="_blank"&gt;https://docs.databricks.com/ingestion/auto-loader/options.html#common-auto-loader-options&lt;/A&gt;&lt;/P&gt;&lt;P&gt;cleanSource is not a listed option so it won't do anything.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Maybe &lt;A href="https://docs.databricks.com/ingestion/auto-loader/production.html#event-retention" alt="https://docs.databricks.com/ingestion/auto-loader/production.html#event-retention" target="_blank"&gt;event retention&lt;/A&gt; is something you can use?&lt;/P&gt;</description>
      <pubDate>Wed, 01 Jun 2022 12:59:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-and-quot-cleansource-quot/m-p/19150#M12803</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-06-01T12:59:48Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader and "cleanSource"</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-and-quot-cleansource-quot/m-p/19151#M12804</link>
      <description>&lt;P&gt;Yes, but I'm guessing as part of the native spark implementation for file streaming i think it should specify either way? &lt;/P&gt;</description>
      <pubDate>Wed, 01 Jun 2022 13:03:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-and-quot-cleansource-quot/m-p/19151#M12804</guid>
      <dc:creator>laurencewells</dc:creator>
      <dc:date>2022-06-01T13:03:26Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader and "cleanSource"</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-and-quot-cleansource-quot/m-p/19152#M12805</link>
      <description>&lt;P&gt;Autoloader is only available on Databricks, not in the OSS version of Spark so it is totally possible.&lt;/P&gt;&lt;P&gt;Maybe a databricks dev can step in and clear this out?&lt;/P&gt;</description>
      <pubDate>Wed, 01 Jun 2022 13:12:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-and-quot-cleansource-quot/m-p/19152#M12805</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-06-01T13:12:47Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader and "cleanSource"</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-and-quot-cleansource-quot/m-p/19153#M12806</link>
      <description>&lt;P&gt;Apologies i meant the cleanSource option is part of native spark 3.0, therefore if it doesn't work in autoloader i would expect to see that its not supported in the docs. Or Error if its included in the code. Currently it accepts it and does nothing. &lt;/P&gt;</description>
      <pubDate>Mon, 06 Jun 2022 14:23:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-and-quot-cleansource-quot/m-p/19153#M12806</guid>
      <dc:creator>laurencewells</dc:creator>
      <dc:date>2022-06-06T14:23:58Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader and "cleanSource"</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-and-quot-cleansource-quot/m-p/19154#M12807</link>
      <description>&lt;P&gt;It seems the docs state what is supported, not what is not supported.&lt;/P&gt;&lt;P&gt;But I agree that could be a discussion point.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Btw, the standard Spark read function doesn't return errors either when you pass in an invalid option, it is ignored.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 07 Jun 2022 11:11:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-and-quot-cleansource-quot/m-p/19154#M12807</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-06-07T11:11:14Z</dc:date>
    </item>
  </channel>
</rss>

