01-10-2024 02:05 AM
We aim to reduce the amount of notebooks we create to a minimum and instead make these fairly flexible. Therefore we have a Factory setup that takes in a parameter to varies the logic.
However when it comes to Workflows we are forced to create multiple workflows that do more or less the same thing:
- Run notebook with Parameter X
- Run notebook with Parameter Y
- Run notebook with Parameter Z
Is there any development ongoing to have multiple schedules per Workflow? And the schedule could come with a parameter input?
That way we only have 1 Workflow and we instead have 3 different schedules with parameter X, Y, Z.
01-11-2024 09:18 AM
Hi @marcuskw , could you share more details on your use case. It would be helpful to know why you need multiple schedules per Workflow?
01-12-2024 06:30 AM - edited 01-12-2024 06:40 AM
Hi
We have a Factory logic that looks something like this:
class Factory:
def __init__(self, job_parameter: str):
self.job_parameter = job_parameter
def set_objects(self):
if self.job_parameter == "A":
from path.A import LogicClass
elif self.job_parameter == "B":
from path.B import LogicClass
return LogicClass
The aim is to have a generic notebook that would then look like this:
from pyspark.dbutils import DBUtils
from path.factory import Factory
job_parameter = dbutils.widgets.get("job_parameter")
LogicClass= Factory(job_parameter).set_objects()
LogicClass.run_business_logic()
When we want to use orchestration we are forced to create multiple jobs:
A solution where we only have 1 job would help us here where we have:
01-12-2024 11:16 AM
As far as scheduling is concerned, you should be able to combine the two schedules into one cron schedule.
To pass different parameters, you can store the parameters in a small table and fetch the values from there based on condition. For ex:- if it is a delta load or a historical load.
01-18-2024 12:44 PM
Hi,
I don't think this really helps me in this use case. Optimally I was after having 1 workflow with the following settings:
We could have 5-6 different workflows with varying parameters/schedues, but all these workflows run the very same Notebook but with varying parameters/schedules.
A little picture to illustrate 🙂
02-20-2024 06:50 AM
Did you figure out if this was possible?
I too find it that we have too many workflows and I would rather have them combined, but have different parts or the workflow run on different schedules.
02-20-2024 06:54 AM
Unfortunatly not, our current solution is to have multiple workflows that run the same notebooks but with varying input parameters.
Results in a bit of workflow bloat both in UI and CI/CD process
02-20-2024 06:58 AM
Thanks for the quick reply. Sorry to hear that. I think that we will quickly grow tired of workflow bloating. We've also considering starting to use Databricks MLOps offering, where MLJobs etc are workflows which will further add noise to the workflows tab. Kinda sad. The UI is kinda shit as it is already.
I got an idea but haven't tried it though, you can do conditional tasks in workflows, do you think one is able to use those to have only parts of a workflow trigger when the workflow is triggered?
I have also considered creating a run table but then it quickly becomes work that I don't want to do and query that to determine what parts of the workflow to run. I really feel like this is an easy feature for Databricks to implement if they just start logging workflow runs (which they already do)
02-20-2024 07:15 AM - edited 02-20-2024 07:17 AM
I imagine Databricks would have to alter the schema of their Jobs API to implement a solution where schedule would also be an Id field instead of just Job Id. I imagine it would be possible to have a lookup table and append what parameters were run and then infer what the next parameter would be, but that would increase ETL time.
Our team haven't seen the need to implement more complicated workflows, here all our workflows have 1 task and that is to run a notebook. That one notebook runs different endpoints/logic/methods using a parallelism/async logic so that is our way of implementing multiple "tasks".
We build solutions where ETL time is an important factor, here multiple tasks also create an issue. For example if you create a Task1 -> Task2 -> Task3 that does a simple print(1) you will see that there is an overhead of approximately 7 seconds between tasks.
09-18-2024 04:49 PM
We're also running into this issue on my team where having multiple cron schedules would be handy. We have some pipelines that we want run on multiple schedules, say to refresh data "Run every Sunday at midnight" and "Run on the first day of the month at midnight". Right now we ended up building in our own logic for run concurrent and have the sub-workflow inside two different master workflows that run on those schedules.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group