cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Table listener

maikel
Contributor II

Hello Community,

I would like to ask whether itโ€™s possible to define a job that checks for updates in a table at a specified frequency.

Here is my use case:

  • Data is uploaded to a table located in Catalog A, Schema B, Table C (a.b.c).

  • I need to transfer this data to another table, x.y.z (within the same workspace).

  • Before saving the final result, several transformation steps must be applied.

I assume that creating a scheduled job would be one approach. However, is it possible to configure the job to check, for example, every 10 minutes whether new data has appeared in a.b.c, and only run if it has? Or would you recommend a different solution?

 

1 ACCEPTED SOLUTION

Accepted Solutions

szymon_dybczak
Esteemed Contributor III

Hi @maikel ,

I think update trigger will be perfect solution for your scenario. Check below docs: 

https://docs.databricks.com/aws/en/jobs/trigger-table-update

View solution in original post

4 REPLIES 4

szymon_dybczak
Esteemed Contributor III

Hi @maikel ,

I think update trigger will be perfect solution for your scenario. Check below docs: 

https://docs.databricks.com/aws/en/jobs/trigger-table-update

maikel
Contributor II

Thank you @szymon_dybczak ! This sounds very good! I have already tested it and it does exactly what I have wanted to achieve!
As a second option I found pipeline and DLT. What do you think about it? Or it is too much to my use case.

maikel
Contributor II

Also I have a question about failure handling in case of triggered jobs. Let's say new data has come to the source table and job failed for any reason. If I rerun it or the next batch of data comes to source table, will the data from the failed job be still considered? Or we just loosing it?

szymon_dybczak
Esteemed Contributor III

Hi @maikel ,

It's just a trigger mechanism, so it totally depends on how you implement your pipeline that will be responsible for consuming the data.
For instance if you use structure streaming based approach (i.e autoloader) then even if your pipeline fails then you can be sure that when you re-run you won't miss any data.
If you want to implement incremental approach yourself then you need to find a proper key for a given a table and find an attribute that will help you discover only new records that appeared/changed since last loading.