3 weeks ago
Hello community,
I'm planning to migrate some logics of Airflow DAGs on Databricks Workflow. But, I was facing out to some doubts that I have in order to migrate (to find the respective) the logic of my actual code from DAGs to Workflow.
There are two points:
3 weeks ago
Hi @jeremy98 , you can leverage databricks dbutils.notebook.run() context to dynamically created jobs/tasks.
Refer following link for more https://docs.databricks.com/en/notebooks/notebook-workflows.html
2 weeks ago
Hi Hari, thanks for your response! So, are you essentially suggesting moving forward by using notebooks to build these new workflows? In Airflow DAGs, we typically have a main DAG where we define tasks, and within each task, there are functionalities that call methods from Python files in our "custom libraries." Are you suggesting moving all these custom methods from the libraries I’ve defined into simple notebooks?
I want to ask suggestions also to our Databricks Employees like @Alberto_Umana, @Walter_C. Thanks for your help guys.
2 weeks ago
Databricks Workflows offers a similar concept to Airflow's dynamic task mapping through the "For each" task type
.expand()
function:.expand()
in Airflow).For example, if you have a list of dates to process, you could set up a "For each" task that iterates over these dates and runs a notebook or Python wheel for each one.
Reference: https://docs.databricks.com/en/jobs/for-each.html
In Databricks Workflows, there isn't a direct equivalent to Airflow's get_current_context()
function. However, you can access similar information through different means:
{{job.run_id}}
gives you the current job run ID{{job.start_time}}
provides the job start time2 weeks ago
Hello,
Thank you for your amazing response! You provided all the information I needed at this moment, and I truly appreciate it.
However, I have a doubt regarding the following example code:
(Old) file.py - Based on Airflow DAGs:
with (...):
@task
def task_1():
...
process_email()
...
@task
def task_2():
...
(New) file.py - Databricks Workflow:
Here, all task definitions are consolidated into a single notebook file, with callable functions that execute the required tasks using the base_parameters passed for example calling function: "task_1", in the other type of task, function: "task_2".
But I think I made a mistake in the settings. Instead of properly splitting the old file.py into individual tasks (where each task corresponds to a separate notebook file), we replicated the entire structure for each task in one notebook.
Now, regarding the process_email method inside task_1:
Additionally, could you clarify which parts of the code are most suitable for inclusion in a wheel package file?
Thank you again for your help!
2 weeks ago
Yes, it is a good practice to create a separate task for the process_email
functionality. This will help in modularizing your code and make it easier to manage and debug. Each task should ideally perform a single responsibility, and separating process_email
into its own task aligns with this principle.
The parts of the code that are better to be included in a wheel package are:
process_email
that are used across multiple tasks or notebooks.2 weeks ago
Hello, thanks for your explanation, really helpful! Instead, for testing unique tasks etc. Building Unit Test or Integration tests, how to use databricks yml to set this?
2 weeks ago
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group