cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Self Dependency TumblingWindowTrigger in adf

fjrodriguez
New Contributor III

Hey !

I would like to migrate one ADF batch ingestion which has a TumblingWindowTrigger  on top of the pipeline which pretty much check each 15 min if a file is landing, normally the files land in daily basis so will process it accordingly once in a day, and this self-dependency allow us to guarantee that the File + 1 which is arriving will be processed if the previous one was ingested correctly. 

I see in Databricks workflow there are 3 kind of triggers: Schedule, Files Arrival and Continuous - what should be the homologue to the TumblingWindowTrigger  and how to set Self Dependency in order to maintain the same approach.

1 ACCEPTED SOLUTION

Accepted Solutions

szymon_dybczak
Esteemed Contributor III

Hi @fjrodriguez ,

What about using databricks autoloader and triggering workflow every 15 min? Autoloader automatically detects what new files has arrived since last trigger of a job and will load only new files to target table. You can use available now trigger option which consumes all available records as an incremental batch.


So, let's say you prepare a notebook that will use autoloader. Now you will schedule this notebook using databricks workflows with option Max concurrent runs = 1. This will ensure that your job will run every 15 minutes, it will consume all new files that appeared within that period and if processing takes longer than 15 minutes it will wait for a previois job to finish,

View solution in original post

5 REPLIES 5

szymon_dybczak
Esteemed Contributor III

Hi @fjrodriguez ,

What about using databricks autoloader and triggering workflow every 15 min? Autoloader automatically detects what new files has arrived since last trigger of a job and will load only new files to target table. You can use available now trigger option which consumes all available records as an incremental batch.


So, let's say you prepare a notebook that will use autoloader. Now you will schedule this notebook using databricks workflows with option Max concurrent runs = 1. This will ensure that your job will run every 15 minutes, it will consume all new files that appeared within that period and if processing takes longer than 15 minutes it will wait for a previois job to finish,

@szymon_dybczak  - so lets assume tomorrow morning Files ingestion fail - What will happen with next one ? I want the next ingestion should not happen and retain it till fixing the stuck one.

With ADF is straightforward with re-triggering the  one that failed and then will automatically ingest the files with are queued after fix the failed ingestion.

szymon_dybczak
Esteemed Contributor III

That's the beauty of autoloader. It stores succesfully processed files in checkpoint location. But if processing fails for whatever reason, autoloader will try to reingest all the files that weren't succefully loaded in previous run + all the new files that appeared.

fjrodriguez
New Contributor III

Hi @szymon_dybczak ,

sounds reasonable, will propone this approach. Thanks ๐Ÿ™‚