cancel
Showing results for 
Search instead for 
Did you mean: 
Community Discussions
cancel
Showing results for 
Search instead for 
Did you mean: 

Create a Workflow Schedule with varying Parameters

marcuskw
Contributor

We aim to reduce the amount of notebooks we create to a minimum and instead make these fairly flexible. Therefore we have a Factory setup that takes in a parameter to varies the logic.

However when it comes to Workflows we are forced to create multiple workflows that do more or less the same thing:
- Run notebook with Parameter X
- Run notebook with Parameter Y
- Run notebook with Parameter Z

Is there any development ongoing to have multiple schedules per Workflow? And the schedule could come with a parameter input?

That way we only have 1 Workflow and we instead have 3 different schedules with parameter X, Y, Z.

9 REPLIES 9

Lakshay
Esteemed Contributor
Esteemed Contributor

Hi @marcuskw , could you share more details on your use case. It would be helpful to know why you need multiple schedules per Workflow?

Hi  
We have a Factory logic that looks something like this:

 

class Factory:
    def __init__(self, job_parameter: str): 
        self.job_parameter = job_parameter 
    
    def set_objects(self): 
        if self.job_parameter == "A": 
            from path.A import LogicClass 

        elif self.job_parameter == "B": 
            from path.B import LogicClass

        return LogicClass

 

The aim is to have a generic notebook that would then look like this:

 

from pyspark.dbutils import DBUtils 
from path.factory import Factory 

job_parameter = dbutils.widgets.get("job_parameter") 
LogicClass= Factory(job_parameter).set_objects()
LogicClass.run_business_logic()

 

When we want to use orchestration we are forced to create multiple jobs:

  • Job "A" which runs the generic notebook with "job_parameter" = "A" with a schedule
  • Job "B" which runs the generic notebook with "job_parameter" = "B" with a schedule

A solution where we only have 1 job would help us here where we have:

  • Job "Run generic notebook"
    • Schedule 1 with "job_parameter" = "A"
    • Schedule 2 with "job_parameter" = "B"

 

Lakshay
Esteemed Contributor
Esteemed Contributor

As far as scheduling is concerned, you should be able to combine the two schedules into one cron schedule.

To pass different parameters, you can store the parameters in a small table and fetch the values from there based on condition. For ex:- if it is a delta load or a historical load.

Hi,

I don't think this really helps me in this use case. Optimally I was after having 1 workflow with the following settings:

  • Schedule 1:
    • Running every hour
    • job_parameter = "SAP" to import Class SAP where we get data related to that ERP system
  • Schedule 2:
    • Runs every 10 minutes
    • job_parameter = "Workday" to import Class Workday where we get data related to that ERP system

We could have 5-6 different workflows with varying parameters/schedues, but all these workflows run the very same Notebook but with varying parameters/schedules.

A little picture to illustrate 🙂

Skjermbilde 2024-01-18 214302.png

Kaniz
Community Manager
Community Manager

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 
 

AlexVB
New Contributor III

Did you figure out if this was possible?

I too find it that we have too many workflows and I would rather have them combined, but have different parts or the workflow run on different schedules.

Unfortunatly not, our current solution is to have multiple workflows that run the same notebooks but with varying input parameters.

Results in a bit of workflow bloat both in UI and CI/CD process

AlexVB
New Contributor III

Thanks for the quick reply. Sorry to hear that. I think that we will quickly grow tired of workflow bloating. We've also considering starting to use Databricks MLOps offering, where MLJobs etc are workflows which will further add noise to the workflows tab. Kinda sad. The UI is kinda shit as it is already.

I got an idea but haven't tried it though, you can do conditional tasks in workflows, do you think one is able to use those to have only parts of a workflow trigger when the workflow is triggered?

I have also considered creating a run table but then it quickly becomes work that I don't want to do and query that to determine what parts of the workflow to run. I really feel like this is an easy feature for Databricks to implement if they just start logging workflow runs (which they already do)

 

I imagine Databricks would have to alter the schema of their Jobs API to implement a solution where schedule would also be an Id field instead of just Job Id. I imagine it would be possible to have a lookup table and append what parameters were run and then infer what the next parameter would be, but that would increase ETL time.

Our team haven't seen the need to implement more complicated workflows, here all our workflows have 1 task and that is to run a notebook. That one notebook runs different endpoints/logic/methods using a parallelism/async logic so that is our way of implementing multiple "tasks".
We build solutions where ETL time is an important factor, here multiple tasks also create an issue. For example if you create a Task1 -> Task2 -> Task3 that does a simple print(1) you will see that there is an overhead of approximately 7 seconds between tasks.