cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How can I get date when autoloader processes the file

sanjay
Valued Contributor II

Hi,

I am running autoloader which is running continuously and checks for new file every 1 minute. I need to store when file was received/processed but its giving me date when autoloader started.

Here is my code.

df = (spark

   .readStream

   .format("cloudFiles")

   .option("cloudFiles.format", "json")

   .option("cloudFiles.includeExistingFiles", "true")

   .option("cloudFiles.validateOptions", "true")

   .option("cloudFiles.region", "us-east-1")

   .option("cloudFiles.backfillInterval", "1 day")

   .option("cloudFiles.fetchParallelism", 100)

   .option("cloudFiles.useNotifications", "true")

   .schema(streamSchema)

   .load(raw_path)

   .withColumn('process_date',lit(date.today()))

 )

(df

 .writeStream

 .format("delta")

 .outputMode("append")

 .option("checkpointLocation", bronze_checkpoint_path)

 .option("path", bronze_path)

 .option("mergeSchema", True)

 .trigger(processingTime="1 minute") # or set this to whatever makes sense to the data source

 .start() 

)

Appreciate any help.

Regards,

Sanjay

1 ACCEPTED SOLUTION

Accepted Solutions

Lakshay
Esteemed Contributor
Esteemed Contributor

Hi @Sanjay Jain​ , Currently we don't have a way to delete the files automatically. However, we are working on a feature called "CleanSource" which will do this. Currently, it is in private preview. You can explore that option.

Or the other way is to develop a small code that uses the file metadata column information to delete the files periodically.

View solution in original post

4 REPLIES 4

Lakshay
Esteemed Contributor
Esteemed Contributor

Hi @Sanjay Jain​ , You can use the File Metadata column functionality to collect that information.

Ref doc:- https://docs.databricks.com/ingestion/file-metadata-column.html

sanjay
Valued Contributor II

Thank you Lakshay. Its helpful.

Another query related to autoloader

  1. How to delete files automatically once its processed successfully.

Regards,

Sanjay

Lakshay
Esteemed Contributor
Esteemed Contributor

Hi @Sanjay Jain​ , Currently we don't have a way to delete the files automatically. However, we are working on a feature called "CleanSource" which will do this. Currently, it is in private preview. You can explore that option.

Or the other way is to develop a small code that uses the file metadata column information to delete the files periodically.

sanjay
Valued Contributor II

Thank you Lakshay.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.