topic Re: ADF logs into Databricks in Data Engineering

ADF logs into Databricks

8b1tz — Sun, 21 Jul 2024 22:22:36 GMT

Hello, I would like to know the best way to insert Datafactory activity logs into my Databricks delta table, so that I can use dashbosrd and create monitoring in Databricks itself , can you help me? I would like every 5 minutes for all activity logs in the data factory to be inserted into the Databricks delta table, that is, if 10 pipelines are completed, the logs of these 10 are inserted into the delta. Please note: no logs cannot be missing. I want a solution that is considered good practice, economical and efficient, can you help me with this?

Re: ADF logs into Databricks

daniel_sahal — Mon, 22 Jul 2024 12:46:48 GMT

@8b1tz
You can use ADF Rest API to read the logs.
Ex: https://medium.com/creative-data/custom-logging-in-azure-data-factory-and-azure-synapse-analytics-f084643a5489

Re: ADF logs into Databricks

szymon_dybczak — Mon, 22 Jul 2024 12:58:03 GMT

Hi @8b1tz ,

You can also configure ADF diagnostic settings. You can send it to a storage location, Log Analytics, or Event Hubs.
If you send it to storage location, then you can create i.e external storage location and directly query those logs in Databricks.

Configure diagnostic settings and a workspace - Azure Data Factory | Microsoft Learn

Re: ADF logs into Databricks

jacovangelder — Mon, 22 Jul 2024 14:39:59 GMT

How fancy do you want to go? You can send ADF diagnostic settings to an event hub and stream them into a delta table in Databricks. Or you can send them to a storage account and build a workflow with 5 minute interval that loads the storage blob into a delta table. The new Variant datatype might be your friend here.

Re: ADF logs into Databricks

8b1tz — Mon, 22 Jul 2024 15:44:05 GMT

I'm thinking about sending the logs to the event hub and leaving a job running continuously in Databricks taking the events and inserting them, what do you think? Will it be too expensive? If it is expensive, at least I believe it is the most scalable and robust solution.

Re: ADF logs into Databricks

jacovangelder — Tue, 23 Jul 2024 06:40:29 GMT

It depends on what you find costly. I would ask yourself the question if you really need it in a 5 minute interval. If so, then there won't be much difference in pricing leaving a cheap cluster running and streaming it, compared to having a (serverless) workflow run every 5 minutes.

Re: ADF logs into Databricks

8b1tz — Tue, 23 Jul 2024 13:07:32 GMT

So, yesterday I spent almost the whole day trying to implement Databricks consuming the Event hub, it ended up not working, should I try another way? Do you suggest something simpler to implement?

Re: ADF logs into Databricks

szymon_dybczak — Tue, 23 Jul 2024 13:08:58 GMT

Simpler would be to just send logs to storage location and consume it in that way, maybe with autloader.

Re: ADF logs into Databricks

8b1tz — Tue, 23 Jul 2024 13:15:15 GMT

Send to storage and consume one by one? Would this be scalable? How would I fetch only the missing ones? Should I delete the ones that have already been processed? Would this be more costly? What do you think?

Re: ADF logs into Databricks

szymon_dybczak — Tue, 23 Jul 2024 13:18:43 GMT

No one by one, you need to configure diagnostic setting to dump logs into storage account. Then you configure databricks autoloader to point to this log location and it will handle loading those file for you. Under the hood autoloader uses spark structered streaming, so with each run it will only load newly added logs files.

Read below documentation entry and as a best practice use File notification mode:

What is Auto Loader? - Azure Databricks | Microsoft Learn

Re: ADF logs into Databricks

8b1tz — Tue, 23 Jul 2024 13:25:36 GMT

Oh, can you give me a video that shows this better? I've never used Databricks auto loader (I'm new to this area) Do you need new configurations on the cluster? Can you do it with a job?

Re: ADF logs into Databricks

szymon_dybczak — Tue, 23 Jul 2024 13:42:22 GMT

Sure, here is a couple of worth watching. And yes, you can use it with job. Only configuration that is required, is setting up storage queue and event grid if you want use File Notification mode. Databricks can do it for you automatically if you give service principal sufficient premissiom. Watch below videos and you will get the idea.

Accelerating Data Ingestion with Databricks Autoloader (youtube.com)Autoloader in databricks (youtube.com)
DP-203: 36 - Automating the process with Azure Databricks Autoloader (youtube.com)

Re: ADF logs into Databricks

8b1tz — Tue, 23 Jul 2024 13:49:26 GMT

Therefore, I insert the pipeline diagnostics into the Storage --> A Databricks notebook is automatically triggered --> I can process the data in this notebook using filters and then insert it into the Delta Table.

Is it something like this?

Re: ADF logs into Databricks

8b1tz — Tue, 23 Jul 2024 20:14:01 GMT

I did it! Thank you very much!

Just one question: do I need to create a task that runs continuously, or can I schedule it?

I didn't understand the event grid part. Could you send me a screenshot of it?

I want it to combine all the new files and insert them into the Delta Table, for example, every 10 minutes it should insert the new ones (if there are any), without the risk of duplication. Or do I need to run it continuously? I'm concerned about the cost.

Re: ADF logs into Databricks

szymon_dybczak — Tue, 23 Jul 2024 20:41:15 GMT

Hi @8b1tz ,

Glad that it worked for you. You don't have to run it continuously, you can ran it as batch jobs with Trigger.AvailableNow (look at below link, cost consideration sections):

Configure Auto Loader for production workloads | Databricks on AWS

As of event grid part, read about File Notification Mode in Autoloader (or watch below video). In short, this mode is recommended to efficiently ingest large amount of data.
In file notification mode, Auto Loader automatically (you can set it manually if you prefer) sets up a notification service (Event Grid) and queue service (Storage Queue) that subscribes to file events from the input directory.
So it works like this, new file arrives on your storage then event grid sends information about new file to storage queue. Finally, autoloader checks if there are new files at storage queue to process. If auto loader succesfully processed data it empties the queue and saves those information in checkpoint.

Auto loader will combine all new data into target table, so in each run it will load only new data.

Az Databricks # 28:- Autoloader in Databricks || File Notification mode (youtube.com)

Re: ADF logs into Databricks

8b1tz — Tue, 23 Jul 2024 21:02:56 GMT

Geez, thank you very much! So I'm going to do it like this: the job checks if the specific blob has been updated, if so, I activate the notebook that catches the events within the storage and uses the checkpoint to not catch them, what do you think?

Re: ADF logs into Databricks

8b1tz — Tue, 23 Jul 2024 22:25:30 GMT

I have a question: Should I use only the job trigger and a notebook without Auto Loader, use only the Auto Loader, or use the job trigger along with the Auto Loader?

Re: ADF logs into Databricks

jacovangelder — Wed, 24 Jul 2024 07:46:25 GMT

I agree that consuming an event hub is not as straightforward, but it is doable by setting up a kafka stream in Spark. To be honest I find autoloader a bit cumbersome especially for this usecase, but hey if it works it works.

Re: ADF logs into Databricks

8b1tz — Wed, 24 Jul 2024 11:38:56 GMT

yes, in the end I just want to put the logs in the delta table and that's it, I don't want to store anything, I know that maybe sending the logs to storage might not be a good option but I tried the event hub a lot and I couldn't :(. in the end I think I'm going to use the file trigger that arrived in the job and use a notebook, I just don't know how I'm going to guarantee that there's no duplication, would that be ok, maybe delete the ones I've consumed, I don't know, I'm afraid of deleting them? a log that arrived on time after reading...

Re: ADF logs into Databricks

szymon_dybczak — Wed, 24 Jul 2024 11:50:42 GMT

Hi @8b1tz ,

Once again, if you use auto loader it guarantees exactly once semantics, so there shouldn't be any duplicates.
The same applies if you were to use Event Hub, it's just different data source, but same concept of structured streaming applies (auto loader is built upon structered streaming).

Below is a snippet from documentation:

As files are discovered, their metadata is persisted in a scalable key-value store (RocksDB) in the checkpoint location of your Auto Loader pipeline. This key-value store ensures that data is processed exactly once.

In case of failures, Auto Loader can resume from where it left off by information stored in the checkpoint location and continue to provide exactly-once guarantees when writing data into Delta Lake. You don’t need to maintain or manage any state yourself to achieve fault tolerance or exactly-once semantics.

I highly recommend you to get to know how auto loader work (or more generally, how structured streaming works). Read documentation, watch some video on YT.