cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Sharing Output between different tasks for MLOps pipeline as a Databricks Jobs

rahuja
New Contributor III

Hello Everyone

We are trying to create an ML pipeline on Databricks using the famous Databricks workflows. Currently our pipeline includes having 3 major components: Data Ingestion, Model Training and Model Testing. My question is whether it is possible to share the output of one task to another (i.e. to share data generated by ingestion task to model training task). Currently we are saving the data in the DBFS volumes and reading it from there but I believe that this approach would fail if the dataset is too big. Is there a more elegant way to pass the output from one task to another maybe something similar to what we can do when creating Azure ML pipeline.

#MachineLearning #DataScience #MLOps

5 REPLIES 5

Hkesharwani
Contributor II

Hi,
There is a way to share value from one task to another, but this will only work when the pipeline is executed from workflow.

#Code from which you want to pass the value.
dbutils.jobs.taskValues.set(key='first_notebook_list', value=<value or variable you want to pass>)

#Code for notebook in which you  want to access the previous notebook value.
list_object = dbutils.jobs.taskValues.get(taskKey = "<task_name_from_which_value_to_be fetched>", key = "first_notebook_list", default = 00, debugValue = 0)

 

Harshit Kesharwani
Self-taught Data Engineer | Seeking Remote Full-time Opportunities

Kaniz_Fatma
Community Manager
Community Manager

Hi @rahujaIn Databricks, you can use task values to pass arbitrary parameters between tasks within a job. This allows you to share information between different components of your ML pipeline, such as Data ....

  • Task values are a way to communicate data between tasks in a Databricks job.
  • You can use the taskValues subutility in Databricks Utilities to set and retrieve task values.
  • These values can be referenced in subsequent tasks, making it easier to create expressive workflows.
  • Suppose you have two notebook tasks: Get_user_data (for Data Ingestion) and Analyze_user_data (for Model Training).
  • In the Get_user_data task, you can set task values for the user’s name and age using Python commands.
  • You can now use dynamic value references in your notebooks to reference task values set in upstream tasks.
  • For example, to reference the value with the key name set by the Get_user_data task, use {{tasks.Get_user_data.values.name}}.

If you need further assistance or have any other questions, feel free to ask! 😊

rahuja
New Contributor III

Hi @Kaniz_Fatma for your quick reply. I will test it out in our scenario and let you know. Just for confirmation if I have two scripts (e.g. ingest.py and train.py) and in my task named "ingest" I do something like inside ingest.py I run:
dbutils.jobs.taskValues.set(taskKey = "ingest", key = "processed_data", value=data)

then should I pass inside the pipeline for the train.py: {{tasks.ingest.values.processed_data}}?

rahuja
New Contributor III

@Kaniz_Fatma I looked into your solution and it seems like that the value you set or get needs to be json serialisable this means I can not pass for e.g. a spark or pandas dataframe from one step to another directly. I will have to serialise and de-serialise it. Is there any step for passing Big Data between various steps of the jobs?

rahuja
New Contributor III

@Kaniz_Fatma @Hkesharwani  any updates?

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!