07-21-2024 03:22 PM
Hello, I would like to know the best way to insert Datafactory activity logs into my Databricks delta table, so that I can use dashbosrd and create monitoring in Databricks itself , can you help me? I would like every 5 minutes for all activity logs in the data factory to be inserted into the Databricks delta table, that is, if 10 pipelines are completed, the logs of these 10 are inserted into the delta. Please note: no logs cannot be missing. I want a solution that is considered good practice, economical and efficient, can you help me with this?
07-22-2024 07:39 AM
How fancy do you want to go? You can send ADF diagnostic settings to an event hub and stream them into a delta table in Databricks. Or you can send them to a storage account and build a workflow with 5 minute interval that loads the storage blob into a delta table. The new Variant datatype might be your friend here.
07-23-2024 06:42 AM
Sure, here is a couple of worth watching. And yes, you can use it with job. Only configuration that is required, is setting up storage queue and event grid if you want use File Notification mode. Databricks can do it for you automatically if you give service principal sufficient premissiom. Watch below videos and you will get the idea.
Accelerating Data Ingestion with Databricks Autoloader (youtube.com)Autoloader in databricks (youtube.com)
DP-203: 36 - Automating the process with Azure Databricks Autoloader (youtube.com)
07-24-2024 05:15 AM
Auto loader is using spark structered streaming, but you can use it in a "batch" mode. In one of my earlier responses I've mentioned that you can ran it as batch jobs with Trigger.AvailableNow. And once again link to documentation.
Configure Auto Loader for production workloads | Databricks on AWS
How it works:
- you setup diagonostic setting to load logs into storage directory (for the sake of example let's called it -> "input_data")
- you configure auto loader and in configuration you point to that path -> "input_data"
- you configure job to run once in an hour
- your job start and auto loader will load all files that are in "input_data" to the target table. When the job ends the job cluster will be terminated
- in the meantime another logs are written to the storage (to the "input_data" directory)
- hours passed, so once again you're job is starting. This time auto loader will load only new files that arrived since last time
07-22-2024 05:46 AM
@8b1tz
You can use ADF Rest API to read the logs.
Ex: https://medium.com/creative-data/custom-logging-in-azure-data-factory-and-azure-synapse-analytics-f0...
07-22-2024 05:58 AM
Hi @8b1tz ,
You can also configure ADF diagnostic settings. You can send it to a storage location, Log Analytics, or Event Hubs.
If you send it to storage location, then you can create i.e external storage location and directly query those logs in Databricks.
Configure diagnostic settings and a workspace - Azure Data Factory | Microsoft Learn
07-22-2024 07:39 AM
How fancy do you want to go? You can send ADF diagnostic settings to an event hub and stream them into a delta table in Databricks. Or you can send them to a storage account and build a workflow with 5 minute interval that loads the storage blob into a delta table. The new Variant datatype might be your friend here.
07-22-2024 08:44 AM
I'm thinking about sending the logs to the event hub and leaving a job running continuously in Databricks taking the events and inserting them, what do you think? Will it be too expensive? If it is expensive, at least I believe it is the most scalable and robust solution.
07-22-2024 11:40 PM - edited 07-22-2024 11:40 PM
It depends on what you find costly. I would ask yourself the question if you really need it in a 5 minute interval. If so, then there won't be much difference in pricing leaving a cheap cluster running and streaming it, compared to having a (serverless) workflow run every 5 minutes.
07-23-2024 06:07 AM
So, yesterday I spent almost the whole day trying to implement Databricks consuming the Event hub, it ended up not working, should I try another way? Do you suggest something simpler to implement?
07-23-2024 06:08 AM
Simpler would be to just send logs to storage location and consume it in that way, maybe with autloader.
07-23-2024 06:15 AM
Send to storage and consume one by one? Would this be scalable? How would I fetch only the missing ones? Should I delete the ones that have already been processed? Would this be more costly? What do you think?
07-23-2024 06:18 AM
No one by one, you need to configure diagnostic setting to dump logs into storage account. Then you configure databricks autoloader to point to this log location and it will handle loading those file for you. Under the hood autoloader uses spark structered streaming, so with each run it will only load newly added logs files.
Read below documentation entry and as a best practice use File notification mode:
What is Auto Loader? - Azure Databricks | Microsoft Learn
07-23-2024 06:25 AM
Oh, can you give me a video that shows this better? I've never used Databricks auto loader (I'm new to this area) Do you need new configurations on the cluster? Can you do it with a job?
07-23-2024 06:42 AM
Sure, here is a couple of worth watching. And yes, you can use it with job. Only configuration that is required, is setting up storage queue and event grid if you want use File Notification mode. Databricks can do it for you automatically if you give service principal sufficient premissiom. Watch below videos and you will get the idea.
Accelerating Data Ingestion with Databricks Autoloader (youtube.com)Autoloader in databricks (youtube.com)
DP-203: 36 - Automating the process with Azure Databricks Autoloader (youtube.com)
07-23-2024 06:49 AM
Therefore, I insert the pipeline diagnostics into the Storage --> A Databricks notebook is automatically triggered --> I can process the data in this notebook using filters and then insert it into the Delta Table.
Is it something like this?
07-23-2024 01:11 PM - edited 07-23-2024 01:14 PM
I did it! Thank you very much!
Just one question: do I need to create a task that runs continuously, or can I schedule it?
I didn't understand the event grid part. Could you send me a screenshot of it?
I want it to combine all the new files and insert them into the Delta Table, for example, every 10 minutes it should insert the new ones (if there are any), without the risk of duplication. Or do I need to run it continuously? I'm concerned about the cost.
07-23-2024 01:37 PM - edited 07-23-2024 01:41 PM
Hi @8b1tz ,
Glad that it worked for you. You don't have to run it continuously, you can ran it as batch jobs with Trigger.AvailableNow (look at below link, cost consideration sections):
Configure Auto Loader for production workloads | Databricks on AWS
As of event grid part, read about File Notification Mode in Autoloader (or watch below video). In short, this mode is recommended to efficiently ingest large amount of data.
In file notification mode, Auto Loader automatically (you can set it manually if you prefer) sets up a notification service (Event Grid) and queue service (Storage Queue) that subscribes to file events from the input directory.
So it works like this, new file arrives on your storage then event grid sends information about new file to storage queue. Finally, autoloader checks if there are new files at storage queue to process. If auto loader succesfully processed data it empties the queue and saves those information in checkpoint.
Auto loader will combine all new data into target table, so in each run it will load only new data.
Az Databricks # 28:- Autoloader in Databricks || File Notification mode (youtube.com)
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group