cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT Compute Resources - What Compute Is It???

ChristianRRL
Contributor II

Hi there, I'm wondering if someone can help me understand what compute resources DLT uses? It's not clear to me at all if it uses the last compute cluster I had been working on, or something else entirely.

Can someone please help clarify this?

ChristianRRL_0-1703025826377.png

 

5 REPLIES 5

Rajeev45
New Contributor III
New Contributor III

Hello,When you create a DLT pipeline you will specify ‘compute’ configuration under the 'Create pipeline' section, based on the defined configuration DLT creates cluster for you

Workflows ==> Delta Live Tables ==> Create pipeline ==> Compute

You can also check the  details about the cluster when you open the Spark UI or logs under the “update details” section once the cluster is initialized.Hope this clarifies your question.

Sorry @Rajeev45 but this is still not clear to me. I see the `Compute` section in the DLT pipeline configuration, but we have multiple "all-purpose compute" clusters and a few "personal compute" clusters. But it's not clear to me at all which of these compute clusters (if any) are used when I run a particular DLT job.

When DLT jobs run, do they use some kind of "ephemeral" compute that is generated and deleted every time a (production) DLT job runs? Is there a way to get a DLT job to run on a currently running standard cluster?

Does this make sense?

quakenbush
Contributor

Sounds pretty much like a job cluster to me.

As far I know there are only two types of clusters: all-purpose (for interactive work) and job clusters which execute  jobs, just like the name implies. Personal clusters are non-shared all-purpose clusters where only one user/the owner has access to. I hope that's right - I'm beginner myself.

@quakenbush I think this makes sense. At the heart of my reason for asking the original question, I'm basically trying to get a better understanding of computing resource utilization and understand if our costs would be greater or lower with using Delta Live Tables (and it's respective job cluster).

At the moment, we have several all-purpose compute clusters with varying 30-60-90 minute inactivity termination timeouts. But if we start using DLT more I'm wondering how much costs will be impacted when using DLT development vs DLT production vs an Auto Loader notebook running on our existing resources for example.

quakenbush
Contributor

Well, one thing they emphasize in the 'Adavanced Data Engineer' Training is that job-clusters will terminate within 5 minutes after a job is completed. So this could be in support of your theory to lower costs. I think job-cluster are actually designed to do just that - execute jobs then terminate & free-up any ressources. They seem to be light-weighted, eg. not allowing for retries. That's why development/debugging should be done using all-purpose clusters. I'd say you're perfectly aligned with Databrick's architecture and recommendation here. Using DLT/job-clusters also makes it easier to go 'serverless' in the future, should there be any need for it.

I don't have that much experience with DLT yet, tbh, the fact we need to develop & test queries in a "normal" notebook, then copy them to a DLT pipeline sounds somewhat clunky...

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!