cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Autoloader and "cleanSource"

laurencewells
New Contributor III

Hi All,

We are trying to use the Spark 3 structured streaming feature/option ".option('cleanSource','archive')" to archive processed files.

This is working as expected using the standard spark implementation, however does not appear to work using autoloader. I cannot see any documentation to specify whether this supported or not. Whether it is a bug or expected. We have tried various tweaks etc to no avail.

is this a bug or expected?

Is there a an alternate approach using autoloader?

Thanks Larry

df = (
  spark.readStream
  .format("cloudFiles") \
  .option("cloudFiles.format", "csv") \
  .option("cleanSource","archive")
  .option("sourceArchiveDir",archivePath)
  .option('header', 'true')
  .schema(schema)
  .load(path)
  .withColumn("loadDate",lit(datetime.utcnow()))
)

5 REPLIES 5

-werners-
Esteemed Contributor III

https://docs.databricks.com/ingestion/auto-loader/options.html#common-auto-loader-options

cleanSource is not a listed option so it won't do anything.

Maybe event retention is something you can use?

Yes, but I'm guessing as part of the native spark implementation for file streaming i think it should specify either way?

-werners-
Esteemed Contributor III

Autoloader is only available on Databricks, not in the OSS version of Spark so it is totally possible.

Maybe a databricks dev can step in and clear this out?

Apologies i meant the cleanSource option is part of native spark 3.0, therefore if it doesn't work in autoloader i would expect to see that its not supported in the docs. Or Error if its included in the code. Currently it accepts it and does nothing.

-werners-
Esteemed Contributor III

It seems the docs state what is supported, not what is not supported.

But I agree that could be a discussion point.

Btw, the standard Spark read function doesn't return errors either when you pass in an invalid option, it is ignored.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group