Databricks

sanjay · ‎03-02-2023

Hi,

I am running autoloader which is running continuously and checks for new file every 1 minute. I need to store when file was received/processed but its giving me date when autoloader started.

Here is my code.

df = (spark

.readStream

.format("cloudFiles")

.option("cloudFiles.format", "json")

.option("cloudFiles.includeExistingFiles", "true")

.option("cloudFiles.validateOptions", "true")

.option("cloudFiles.region", "us-east-1")

.option("cloudFiles.backfillInterval", "1 day")

.option("cloudFiles.fetchParallelism", 100)

.option("cloudFiles.useNotifications", "true")

.schema(streamSchema)

.load(raw_path)

.withColumn('process_date',lit(date.today()))

)

(df

.writeStream

.format("delta")

.outputMode("append")

.option("checkpointLocation", bronze_checkpoint_path)

.option("path", bronze_path)

.option("mergeSchema", True)

.trigger(processingTime="1 minute") # or set this to whatever makes sense to the data source

.start()

)

Appreciate any help.

Regards,

Sanjay

Lakshay · ‎03-02-2023

Hi @Sanjay Jain , Currently we don't have a way to delete the files automatically. However, we are working on a feature called "CleanSource" which will do this. Currently, it is in private preview. You can explore that option.

Or the other way is to develop a small code that uses the file metadata column information to delete the files periodically.

View solution in original post

Lakshay · ‎03-02-2023

Hi @Sanjay Jain , You can use the File Metadata column functionality to collect that information.

Ref doc:- https://docs.databricks.com/ingestion/file-metadata-column.html

sanjay · ‎03-02-2023

Thank you Lakshay. Its helpful.

Another query related to autoloader

How to delete files automatically once its processed successfully.

Regards,

Sanjay

Lakshay · ‎03-02-2023

Hi @Sanjay Jain , Currently we don't have a way to delete the files automatically. However, we are working on a feature called "CleanSource" which will do this. Currently, it is in private preview. You can explore that option.

Or the other way is to develop a small code that uses the file metadata column information to delete the files periodically.