cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

SingleNode all-purpose cluster for small ETLs

RicksDB
Contributor II

Hi,

I have many "small" jobs than needs to be executed quickly and at a predictable low cost from several Azure Data Factory pipelines. For this reason, I configured a small single node cluster to execute those processes. For the moment, everything seems to run as expected and I get approximatively 30s execution for each job after the first execution.

However, based on the documentation, it seems as if my use case is not officially supported. Am I understanding this correctly? It this simply a warning or will I have potential issues with this solution?

image

1 ACCEPTED SOLUTION

Accepted Solutions

BilalAslamDbrx
Honored Contributor III

Exactly what @Joseph Kambourakis​ said. Single node clusters are designed to be used for single-user machine learning use cases. Think of them as a laptop in the sky.

@E H​ ​ your use case is really good, we get this all the time. We are working hard to bring serverless clusters to the Data Science & Engineer Workspace. Once we have those, you will get super fast startup time. Is that the ideal solution in your mind for your use case?

View solution in original post

6 REPLIES 6

Anonymous
Not applicable

Hello again! As before, if, after a while, if the community does not respond, we'll get back to this.

Anonymous
Not applicable

In this sense they mean shared among many users. If you had 4 different users submitting jobs to a single node cluster you'd have some trouble with the resource balancing.

If what you're doing is currently working, keep doing it!

BilalAslamDbrx
Honored Contributor III

Exactly what @Joseph Kambourakis​ said. Single node clusters are designed to be used for single-user machine learning use cases. Think of them as a laptop in the sky.

@E H​ ​ your use case is really good, we get this all the time. We are working hard to bring serverless clusters to the Data Science & Engineer Workspace. Once we have those, you will get super fast startup time. Is that the ideal solution in your mind for your use case?

RicksDB
Contributor II

Hi @Bilal Aslam​ 

Serverless clusters would definitely help regarding the speed required for the small jobs of most of my clients.

That being said, most of these clients requires calculating the "worst case" for most technologies when presenting a business case.

Right now, I am able to do so using interactive clusters since I can assume the worst (744 hours) knowing that the jobs will be queued thus respecting budget if it happens.

Will it be possible to put quotas to achieve the same thing? (I.e ensure there no unexpected high charge such as an infinite loops caused by a user causing high cost instead of email alerts and custom scripts detecting such errors)

If this is achievable, this is exactly what we need.

Thanks

BilalAslamDbrx
Honored Contributor III

@E H​ we're definitely thinking about budgets and quotas for jobs. There are several things we can do, ranked in order of rough complexity-to-implement:

  1. Display the DBU cost of each job in the Jobs UI.
  2. Alert on the DBU cost of a job (e.g. "Alert me if this job costs >20 DBUs")
  3. Alert on the $$ cost of a job (e.g. "Alert me if this job costs >$5")

Thoughts on what you'd prefer?

RicksDB
Contributor II

@Bilal Aslam​  In my case, it usually depends on the customers and their SLA. Most of them usually do not have a "true" high SLA requirement thus prefer the jobs to be throttled when the actual cost is within a certain range of the budget instead of scaling indefinitely.

In an ideal world, solution 1 and 3 would be implemented. Option 3 would be configurable to optionally add throttling when required.

The throttling feature would be used to estimate the worst case.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group