12-19-2023 02:47 PM
Hi there, I'm wondering if someone can help me understand what compute resources DLT uses? It's not clear to me at all if it uses the last compute cluster I had been working on, or something else entirely.
Can someone please help clarify this?
12-20-2023 03:33 PM
Hello,When you create a DLT pipeline you will specify ‘compute’ configuration under the 'Create pipeline' section, based on the defined configuration DLT creates cluster for you
Workflows ==> Delta Live Tables ==> Create pipeline ==> Compute
You can also check the details about the cluster when you open the Spark UI or logs under the “update details” section once the cluster is initialized.Hope this clarifies your question.
12-21-2023 09:24 AM
Sorry @Rajeev45 but this is still not clear to me. I see the `Compute` section in the DLT pipeline configuration, but we have multiple "all-purpose compute" clusters and a few "personal compute" clusters. But it's not clear to me at all which of these compute clusters (if any) are used when I run a particular DLT job.
When DLT jobs run, do they use some kind of "ephemeral" compute that is generated and deleted every time a (production) DLT job runs? Is there a way to get a DLT job to run on a currently running standard cluster?
Does this make sense?
12-21-2023 11:14 AM
Sounds pretty much like a job cluster to me.
As far I know there are only two types of clusters: all-purpose (for interactive work) and job clusters which execute jobs, just like the name implies. Personal clusters are non-shared all-purpose clusters where only one user/the owner has access to. I hope that's right - I'm beginner myself.
12-21-2023 01:47 PM
@quakenbush I think this makes sense. At the heart of my reason for asking the original question, I'm basically trying to get a better understanding of computing resource utilization and understand if our costs would be greater or lower with using Delta Live Tables (and it's respective job cluster).
At the moment, we have several all-purpose compute clusters with varying 30-60-90 minute inactivity termination timeouts. But if we start using DLT more I'm wondering how much costs will be impacted when using DLT development vs DLT production vs an Auto Loader notebook running on our existing resources for example.
12-22-2023 12:04 AM
Well, one thing they emphasize in the 'Adavanced Data Engineer' Training is that job-clusters will terminate within 5 minutes after a job is completed. So this could be in support of your theory to lower costs. I think job-cluster are actually designed to do just that - execute jobs then terminate & free-up any ressources. They seem to be light-weighted, eg. not allowing for retries. That's why development/debugging should be done using all-purpose clusters. I'd say you're perfectly aligned with Databrick's architecture and recommendation here. Using DLT/job-clusters also makes it easier to go 'serverless' in the future, should there be any need for it.
I don't have that much experience with DLT yet, tbh, the fact we need to develop & test queries in a "normal" notebook, then copy them to a DLT pipeline sounds somewhat clunky...
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group