cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Autoloader and "cleanSource"

laurencewells
New Contributor III

Hi All,

We are trying to use the Spark 3 structured streaming feature/option ".option('cleanSource','archive')" to archive processed files.

This is working as expected using the standard spark implementation, however does not appear to work using autoloader. I cannot see any documentation to specify whether this supported or not. Whether it is a bug or expected. We have tried various tweaks etc to no avail.

is this a bug or expected?

Is there a an alternate approach using autoloader?

Thanks Larry

df = (
  spark.readStream
  .format("cloudFiles") \
  .option("cloudFiles.format", "csv") \
  .option("cleanSource","archive")
  .option("sourceArchiveDir",archivePath)
  .option('header', 'true')
  .schema(schema)
  .load(path)
  .withColumn("loadDate",lit(datetime.utcnow()))
)

5 REPLIES 5

-werners-
Esteemed Contributor III

https://docs.databricks.com/ingestion/auto-loader/options.html#common-auto-loader-options

cleanSource is not a listed option so it won't do anything.

Maybe event retention is something you can use?

Yes, but I'm guessing as part of the native spark implementation for file streaming i think it should specify either way?

-werners-
Esteemed Contributor III

Autoloader is only available on Databricks, not in the OSS version of Spark so it is totally possible.

Maybe a databricks dev can step in and clear this out?

Apologies i meant the cleanSource option is part of native spark 3.0, therefore if it doesn't work in autoloader i would expect to see that its not supported in the docs. Or Error if its included in the code. Currently it accepts it and does nothing.

-werners-
Esteemed Contributor III

It seems the docs state what is supported, not what is not supported.

But I agree that could be a discussion point.

Btw, the standard Spark read function doesn't return errors either when you pass in an invalid option, it is ignored.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.