cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How can I get date when autoloader processes the file

sanjay
Valued Contributor II

Hi,

I am running autoloader which is running continuously and checks for new file every 1 minute. I need to store when file was received/processed but its giving me date when autoloader started.

Here is my code.

df = (spark

   .readStream

   .format("cloudFiles")

   .option("cloudFiles.format", "json")

   .option("cloudFiles.includeExistingFiles", "true")

   .option("cloudFiles.validateOptions", "true")

   .option("cloudFiles.region", "us-east-1")

   .option("cloudFiles.backfillInterval", "1 day")

   .option("cloudFiles.fetchParallelism", 100)

   .option("cloudFiles.useNotifications", "true")

   .schema(streamSchema)

   .load(raw_path)

   .withColumn('process_date',lit(date.today()))

 )

(df

 .writeStream

 .format("delta")

 .outputMode("append")

 .option("checkpointLocation", bronze_checkpoint_path)

 .option("path", bronze_path)

 .option("mergeSchema", True)

 .trigger(processingTime="1 minute") # or set this to whatever makes sense to the data source

 .start() 

)

Appreciate any help.

Regards,

Sanjay

1 ACCEPTED SOLUTION

Accepted Solutions

Lakshay
Databricks Employee
Databricks Employee

Hi @Sanjay Jain​ , Currently we don't have a way to delete the files automatically. However, we are working on a feature called "CleanSource" which will do this. Currently, it is in private preview. You can explore that option.

Or the other way is to develop a small code that uses the file metadata column information to delete the files periodically.

View solution in original post

4 REPLIES 4

Lakshay
Databricks Employee
Databricks Employee

Hi @Sanjay Jain​ , You can use the File Metadata column functionality to collect that information.

Ref doc:- https://docs.databricks.com/ingestion/file-metadata-column.html

sanjay
Valued Contributor II

Thank you Lakshay. Its helpful.

Another query related to autoloader

  1. How to delete files automatically once its processed successfully.

Regards,

Sanjay

Lakshay
Databricks Employee
Databricks Employee

Hi @Sanjay Jain​ , Currently we don't have a way to delete the files automatically. However, we are working on a feature called "CleanSource" which will do this. Currently, it is in private preview. You can explore that option.

Or the other way is to develop a small code that uses the file metadata column information to delete the files periodically.

sanjay
Valued Contributor II

Thank you Lakshay.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now