cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

load files filtered by last_modified in PySpark

az38
New Contributor II

Hi, community!

How do you think what is the best way to load from Azure ADLS (actually, filesystem doesn't matter) into df onli files modified after some point in time?

Is there any function like input_file_name() but for last_modified to use it in a way ?

df = spark.read.json("abfss://container@storageaccount.dfs.core.windows.net/*/*/*/*/*.json").withColumn("filename", input_file_name()).where("filename == '******'")

2 REPLIES 2

pvignesh92
Honored Contributor

Hi @Aleksei Zhukovโ€‹ , I don't think there is an inbuilt function for capturing the timestamp of source files. However if you want to perform an incremental ingestion using Databricks, there are different approaches

  1. One simple way would be to use Databricks Autoloader
  2. Other approach would be to maintain a control table to keep a track of the last load timestamp and keep comparing with the modified timestamps of your files to identify the new files and load them. This might need to be done in Python as no direct functions in Spark
  3. You move the processed files to an archive path so that your input path will just have new files that you need to process.

This is exactly what I have explored on my recent medium blog. Please see if this helps.

--

Databricks Auto Loader is an interesting feature that can be used to load data incrementally.

โœณ It can process new data files as they arrive in the cloud object stores

โœณ It can be used to ingest JSON, CSV, PARQUET, AVRO, ORC, TEXT and even Binary file formats

โœณ Auto Loader can support a scale of even million files per hour. It maintains the state information at a checkpoint location in a key-value store called RocksDB. As the state is now maintained in the checkpoint, it can resume from where it was left off even in times of failure and can guarantee exactly-once semantics.

Please find my write-up on Databricks AutoLoader on Medium here. Happy for any feedbacks ๐Ÿ™‚

๐Ÿ”… Databricks Autoloader Series- Accelerating Incremental Data Ingestion: https://lnkd.in/ew3vaPmp

๐Ÿ”… Databricks Auto Loader Seriesโ€” The basics: https://lnkd.in/e2zanWfc

Thanks,

Vignesh

venkatcrc
New Contributor III

_metadata will provide file modification timestamp. I tried on dbfs but not sure for ADLS.

https://docs.databricks.com/ingestion/file-metadata-column.html

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group