cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How can I get date when autoloader processes the file

sanjay
Valued Contributor II

Hi,

I am running autoloader which is running continuously and checks for new file every 1 minute. I need to store when file was received/processed but its giving me date when autoloader started.

Here is my code.

df = (spark

   .readStream

   .format("cloudFiles")

   .option("cloudFiles.format", "json")

   .option("cloudFiles.includeExistingFiles", "true")

   .option("cloudFiles.validateOptions", "true")

   .option("cloudFiles.region", "us-east-1")

   .option("cloudFiles.backfillInterval", "1 day")

   .option("cloudFiles.fetchParallelism", 100)

   .option("cloudFiles.useNotifications", "true")

   .schema(streamSchema)

   .load(raw_path)

   .withColumn('process_date',lit(date.today()))

 )

(df

 .writeStream

 .format("delta")

 .outputMode("append")

 .option("checkpointLocation", bronze_checkpoint_path)

 .option("path", bronze_path)

 .option("mergeSchema", True)

 .trigger(processingTime="1 minute") # or set this to whatever makes sense to the data source

 .start() 

)

Appreciate any help.

Regards,

Sanjay

1 ACCEPTED SOLUTION

Accepted Solutions

Lakshay
Databricks Employee
Databricks Employee

Hi @Sanjay Jain​ , Currently we don't have a way to delete the files automatically. However, we are working on a feature called "CleanSource" which will do this. Currently, it is in private preview. You can explore that option.

Or the other way is to develop a small code that uses the file metadata column information to delete the files periodically.

View solution in original post

4 REPLIES 4

Lakshay
Databricks Employee
Databricks Employee

Hi @Sanjay Jain​ , You can use the File Metadata column functionality to collect that information.

Ref doc:- https://docs.databricks.com/ingestion/file-metadata-column.html

sanjay
Valued Contributor II

Thank you Lakshay. Its helpful.

Another query related to autoloader

  1. How to delete files automatically once its processed successfully.

Regards,

Sanjay

Lakshay
Databricks Employee
Databricks Employee

Hi @Sanjay Jain​ , Currently we don't have a way to delete the files automatically. However, we are working on a feature called "CleanSource" which will do this. Currently, it is in private preview. You can explore that option.

Or the other way is to develop a small code that uses the file metadata column information to delete the files periodically.

sanjay
Valued Contributor II

Thank you Lakshay.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group