topic Re: Databricks job trigger in specific times in Data Engineering

Databricks job trigger in specific times

dbx_deltaSharin — Wed, 18 Sep 2024 13:53:41 GMT

Hello,

I have a Databricks notebook that processes data and generates a list of JSON objects called "list_json". Each JSON object contains an item called "time_to_send" (in UTC datetime format). I want to find the best way to send these JSON messages in a POST request within 1 hour before the "time_to_send". What is the best approach to achieve this?

Thank you.

Re: Databricks job trigger in specific times

szymon_dybczak — Wed, 18 Sep 2024 15:13:34 GMT

Hi @dbx_deltaSharin ,

You can write python function that will consume this list_json as argument and send post request for each object inside list. Since you need to send request within an hour you can use python multiprocessing or asyncio library to make it faster.

But it depends of how many objects you have in your list etc

Re: Databricks job trigger in specific times

filipniziol — Wed, 18 Sep 2024 20:03:45 GMT

Hi @dbx_deltaSharin ,

Additionally to @szymon_dybczak , if you're using Azure, you might consider an architecture where, instead of sending the request directly to your API, you send a message to an Azure Queue or Service Bus. Then, an Azure Function with a Queue Trigger can pick up the message and send it to the API. This approach enhances scalability and reliability because Azure Functions can process multiple requests concurrently and scale automatically based on demand. This can be achieved with other cloud providers as they offer similar services.

Re: Databricks job trigger in specific times

dbx_deltaSharin — Thu, 19 Sep 2024 06:30:52 GMT

Hi everyone,

Thank you for your responses to my question.

@szymon_dybczak, if I understood correctly, your suggestion is based on running the Databricks job in continuous mode. However, this might incur significant costs if the cluster is running every hour.

@filipniziol, your proposal seems like a viable solution. I would just like to get a clearer idea of the associated costs to be able to compare the two options.

For clarification, the initial notebook is designed to run once a day to update and compute the JSON list. Another notebook is needed to process this JSON data and handle the post-processing, starting one hour before the "time_to_send."