cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Job Clusters With Multiple Tasks

gyapar
New Contributor II

Hi all,

I'm trying to do creating one job cluster with one configuration or specification which has a workflow and this workflow needs to have 3 dependent tasks as a straight line. For example, t1->t2->t3. 

In databricks there are some constraints also like a job can contain maximum 100 task as a specification and  maximum 1000 concurrent task runs as an instance. Besides, they are saying databricks has own orchestration inside workflow. 

 These questions below are for the community.

How can I utilize my job cluster?

Does the orchestrator run 1000 concurrent instance even the workflow has 3 task in the job?

Does databricks support a queue as an input for workflow or a task inside workflow? (Instead of giving parameters)

 

Actually, I don't know how databricks running internally. Is it scaling up and down tasks or if it is yes, are they scaling up and down by looking which metrics? I would like to scale up my specific task 't2' like in the example, respect to events or inputs inside queue or anywhere which are dynamically created.

Is it possible with managed workflow powered by the orchestrator you have?

Or Do we need to use some custom tool like Apache Airflow?   

 

Thanks.

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @gyapar, Certainly! Let’s dive into your questions about Databricks job clusters, orchestration, and scaling. 🚀

 

Utilizing Databricks Job Clusters:

  • A job cluster in Databricks is a non-interactive way to run an application, such as an ETL job or data analysis task. You can configure a job cluster with specific settings (e.g., number of workers, instance types) to execute your tasks.
  • To utilize your job cluster effectively:
    • Define your job configuration or specification, including the desired cluster size, libraries, and environment settings.
    • Create a workflow (a directed acyclic graph or DAG) that represents the sequence of tasks you want to execute.
    • Submit your job to the cluster, and Databricks will manage the execution of your tasks.
    • Monitor job progress, logs, and performance using the Databricks UI or API.

Orchestration and Concurrent Instances:

  • Databricks now supports task orchestration within jobs. You can run multiple tasks as a DAG, simplifying the creation and management of data and machine learning workflows.
  • The orchestrator handles task dependencies, ensuring that tasks run in the correct order.
  • Regarding concurrent instances:
    • Even if your workflow has only 3 tasks, Databricks can handle up to 1000 concurrent task runs within a job instance.
    • The orchestrator efficiently manages task execution, regardless of the number of tasks in your workflow.

Input Queues and Parameters:

  • Databricks primarily uses parameters for task input. However, if you need to pull data from external sources dynamically (e.g., a queue), you can use custom logic within your tasks.
  • While Databricks doesn’t directly support input queues, you can design your workflow to fetch data from external systems as needed.

Scaling Specific Tasks (e.g., ‘t2’):

  • Databricks automatically scales task execution based on the cluster configuration.
  • If you want to dynamically scale a specific task (like ‘t2’), consider the following approaches:
    • Custom Logic: Implement custom logic within your task to adjust its behaviour based on events or inputs.
    • Apache Airflow Integration: Databricks integrates well with Apache Airflow. You can use Airflow to manage complex workflows, including dynamic scaling based on external events.
    • Managed Workflow: While Databricks provides orchestration, for more advanced use cases, combining it with custom tools like Airflow might be beneficial.

Managed Workflow and Orchestrator:

  • Databricks’ managed workflow, powered by the orchestrator, simplifies task coordination and execution.
  • However, for highly customized scaling or intricate workflows, using additional tools like Apache Airflow allows greater flexibility.

In summary, Databricks offers robust orchestration capabilities, but for specialized scenarios, consider integrating with tools like Apache Airflow. Choose the approach that best aligns with your specific requirements and complexity. 🌟

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!