cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

ADF logs into Databricks

8b1tz
Contributor

Hello, I would like to know the best way to insert Datafactory activity logs into my Databricks delta table, so that I can use dashbosrd and create monitoring in Databricks itself , can you help me? I would like every 5 minutes for all activity logs in the data factory to be inserted into the Databricks delta table, that is, if 10 pipelines are completed, the logs of these 10 are inserted into the delta. Please note: no logs cannot be missing. I want a solution that is considered good practice, economical and efficient, can you help me with this?

24 REPLIES 24

Geez, thank you very much! So I'm going to do it like this: the job checks if the specific blob has been updated, if so, I activate the notebook that catches the events within the storage and uses the checkpoint to not catch them, what do you think? 

I have a question: Should I use only the job trigger and a notebook without Auto Loader, use only the Auto Loader, or use the job trigger along with the Auto Loader?

jacovangelder
Honored Contributor

I agree that consuming an event hub is not as straightforward, but it is doable by setting up a kafka stream in Spark. To be honest I find autoloader a bit cumbersome especially for this usecase, but hey if it works it works. 

yes, in the end I just want to put the logs in the delta table and that's it, I don't want to store anything, I know that maybe sending the logs to storage might not be a good option but I tried the event hub a lot and I couldn't :(. in the end I think I'm going to use the file trigger that arrived in the job and use a notebook, I just don't know how I'm going to guarantee that there's no duplication, would that be ok, maybe delete the ones I've consumed, I don't know, I'm afraid of deleting them? a log that arrived on time after reading... 

Hi @8b1tz ,

Once again, if you use auto loader it guarantees exactly once semantics, so there shouldn't be any duplicates.
The same applies if you were to use Event Hub, it's just different data source, but same concept of structured streaming applies (auto loader is built upon structered streaming).

Below is a snippet from documentation:

As files are discovered, their metadata is persisted in a scalable key-value store (RocksDB) in the checkpoint location of your Auto Loader pipeline. This key-value store ensures that data is processed exactly once.

In case of failures, Auto Loader can resume from where it left off by information stored in the checkpoint location and continue to provide exactly-once guarantees when writing data into Delta Lake. You don’t need to maintain or manage any state yourself to achieve fault tolerance or exactly-once semantics.

 

I highly recommend you to get to know how auto loader work (or more generally, how structured streaming works). Read documentation, watch some video on YT. 

 

My fear is just that the price will be absurdly expensive, can this happen? In the end, it does not need to be in real time, it can be inserted every 10 minutes for example. 

Every ten minute could be expensive, but it all depends on many factors like cluster size, amount of data etc. Couldn't you run it every hour for example? 

but the autoloader is not streaming? It needs to run continuously, I can't understand how it can run this way, could you explain it to me better? 

Auto loader is using spark structered streaming, but you can use it in a "batch" mode. In one of my earlier responses I've mentioned that you can ran it as batch jobs with Trigger.AvailableNow. And once again link to documentation. 

Configure Auto Loader for production workloads | Databricks on AWS

How it works:

- you setup diagonostic setting to load logs into storage directory (for the sake of example let's called it  -> "input_data")

- you configure auto loader and in configuration you point to that path -> "input_data"

- you configure job to run once in an hour

- your job start and auto loader will load all files that are in "input_data"  to the target table. When the job ends the job cluster will be terminated

- in the meantime another logs are written to the storage (to the "input_data" directory)

- hours passed, so once again you're job is starting. This time auto loader will load only new files that arrived since last time 

 

Just to let you know: it worked! Thank you very much, really. 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group