cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Autoloader and "cleanSource"

laurencewells
New Contributor III

Hi All,

We are trying to use the Spark 3 structured streaming feature/option ".option('cleanSource','archive')" to archive processed files.

This is working as expected using the standard spark implementation, however does not appear to work using autoloader. I cannot see any documentation to specify whether this supported or not. Whether it is a bug or expected. We have tried various tweaks etc to no avail.

is this a bug or expected?

Is there a an alternate approach using autoloader?

Thanks Larry

df = (
  spark.readStream
  .format("cloudFiles") \
  .option("cloudFiles.format", "csv") \
  .option("cleanSource","archive")
  .option("sourceArchiveDir",archivePath)
  .option('header', 'true')
  .schema(schema)
  .load(path)
  .withColumn("loadDate",lit(datetime.utcnow()))
)

5 REPLIES 5

-werners-
Esteemed Contributor III

https://docs.databricks.com/ingestion/auto-loader/options.html#common-auto-loader-options

cleanSource is not a listed option so it won't do anything.

Maybe event retention is something you can use?

Yes, but I'm guessing as part of the native spark implementation for file streaming i think it should specify either way?

-werners-
Esteemed Contributor III

Autoloader is only available on Databricks, not in the OSS version of Spark so it is totally possible.

Maybe a databricks dev can step in and clear this out?

Apologies i meant the cleanSource option is part of native spark 3.0, therefore if it doesn't work in autoloader i would expect to see that its not supported in the docs. Or Error if its included in the code. Currently it accepts it and does nothing.

-werners-
Esteemed Contributor III

It seems the docs state what is supported, not what is not supported.

But I agree that could be a discussion point.

Btw, the standard Spark read function doesn't return errors either when you pass in an invalid option, it is ignored.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now