cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT Compute: "Ephemeral" Job Compute vs. All-purpose compute 2.0 ... WHY?

ChristianRRL
Contributor III

Hi there, this is a follow-up from a discussion I started last month

Solved: Re: DLT Compute: "Ephemeral" Job Compute vs. All-p... - Databricks Community - 71661

Based on what was discussed, I understand that it's not possible to use "All Purpose Clusters" with DLT Pipelines. I would like to understand WHY this is the case? I'm not sure I follow why Databricks wouldn't allow this as a possible implementation since the "Ephemeral" Job Compute clusters effectively always cost more since they require spinning up new resources when we already have All Purpose Clusters up & running.

Is there something I'm missing here?

4 REPLIES 4

Kaniz_Fatma
Community Manager
Community Manager

Hi @ChristianRRLThere are a few key reasons why DLT pipelines cannot use all-purpose clusters:

  1. All-purpose clusters are designed for interactive/collaborative usage in development, ad hoc analysis, and data exploration, while job clusters run to execute a specific, automated job after which they immediately release resources. DLT pipelines are automated jobs, not interactive workloads.
  2. Job clusters are "ephemeral" - they are created and terminated as needed for each pipeline. this allows each pipeline to run in a fully isolated environment. Using a shared all-purpose cluster would not provide the same isolation.
  3. Job clusters are scoped to a single job run and cannot be used by other jobs or runs of the same job. An all-purpose cluster is shared across multiple workloads.
  4. Libraries cannot be declared in a shared job cluster configuration. Dependent libraries must be added in task settings. This is not possible with an all-purpose cluster.
  5. When running a task on an existing all-purpose cluster, it is treated as a data analytics (all-purpose) workload subject to different pricing than a data engineering (task) workload on a new job cluster.

So in summary, the ephemeral nature of job clusters, isolation requirements, library management, and pricing differences make them a better fit for DLT pipelines than using a shared all-purpose cluster. The cost of spinning up new job clusters is offset by the benefits of a dedicated, isolated environment optimized for the pipeline workload.

ChristianRRL
Contributor III

Good morning @Kaniz_Fatma, I think most of these points make sense, particularly running pipelines in a "fully isolated environment". I can understand that this can be a best practice (or in this case only practice) allowed by Databricks, but I'm still somewhat confused as to why there isn't at least an option to leverage the all-purpose clusters with DLT jobs (even if just as a non-default option). Out of curiosity, do you know if there's been any sort of discussion in Databricks to making this possible in the future?

Additionally, with respect to point (5) with data analytics (all-purpose) clusters and the job workloads being subject to "different pricing" than the data engineering (task) workloads, how might I best compare/contrast pricing between these two? For example, at the moment DLT is effectively *only* adding costs since our existing setup assumes that the all-purpose clusters are in a sense "set in stone" and any additional compute such as the task job clusters cost more since they are not using our existing all-purpose clusters. Maybe if we had a better idea as to what kind of cost savings we may get with DLT job clusters compared with all-purpose clusters, we may be able to shift some compute load out of all-purpose and more concretely save on costs rather than just adding to it.

@Kaniz_Fatma / @raphaelblg quick follow-up on this one. Wondering if anyone can provide a bit more feedback on the last points I wrote.

raphaelblg
Honored Contributor
Honored Contributor

@ChristianRRL regarding on why DLT doesn't allow you to use all-purpose clusters:

1. The DLT runtime is derived from the shared compute DBR, it's not the same runtime and has different features than the common all-purpose runtime. A DLT pipeline is not capable of being executed in any of the all-purpose cluster runtimes.

2. DLT is a different product than all-purpose compute, with different prices. 

Feel free to use our Pricing Calculator to compare prices. At the current moment, if you run the exact same workload, with the same driver and workers instance types (and workers amount) on DLT, it should bill you with less DBUs than on all-purpose.

 

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group