topic Re: Autoloader and "cleanSource" in Data Engineering

Autoloader and "cleanSource"

laurencewells — Tue, 31 May 2022 14:45:38 GMT

Hi All,

We are trying to use the Spark 3 structured streaming feature/option ".option('cleanSource','archive')" to archive processed files.

This is working as expected using the standard spark implementation, however does not appear to work using autoloader. I cannot see any documentation to specify whether this supported or not. Whether it is a bug or expected. We have tried various tweaks etc to no avail.

is this a bug or expected?

Is there a an alternate approach using autoloader?

Thanks Larry

df = (
  spark.readStream
  .format("cloudFiles") \
  .option("cloudFiles.format", "csv") \
  .option("cleanSource","archive")
  .option("sourceArchiveDir",archivePath)
  .option('header', 'true')
  .schema(schema)
  .load(path)
  .withColumn("loadDate",lit(datetime.utcnow()))
)

Re: Autoloader and "cleanSource"

-werners- — Wed, 01 Jun 2022 12:59:48 GMT

https://docs.databricks.com/ingestion/auto-loader/options.html#common-auto-loader-options

cleanSource is not a listed option so it won't do anything.

Maybe event retention is something you can use?

Re: Autoloader and "cleanSource"

laurencewells — Wed, 01 Jun 2022 13:03:26 GMT

Yes, but I'm guessing as part of the native spark implementation for file streaming i think it should specify either way?

Re: Autoloader and "cleanSource"

-werners- — Wed, 01 Jun 2022 13:12:47 GMT

Autoloader is only available on Databricks, not in the OSS version of Spark so it is totally possible.

Maybe a databricks dev can step in and clear this out?

Re: Autoloader and "cleanSource"

laurencewells — Mon, 06 Jun 2022 14:23:58 GMT

Apologies i meant the cleanSource option is part of native spark 3.0, therefore if it doesn't work in autoloader i would expect to see that its not supported in the docs. Or Error if its included in the code. Currently it accepts it and does nothing.

Re: Autoloader and "cleanSource"

-werners- — Tue, 07 Jun 2022 11:11:14 GMT

It seems the docs state what is supported, not what is not supported.

But I agree that could be a discussion point.

Btw, the standard Spark read function doesn't return errors either when you pass in an invalid option, it is ignored.