topic Re: SingleNode all-purpose cluster for small ETLs in Data Engineering

SingleNode all-purpose cluster for small ETLs

RicksDB — Thu, 30 Dec 2021 01:43:46 GMT

Hi,

I have many "small" jobs than needs to be executed quickly and at a predictable low cost from several Azure Data Factory pipelines. For this reason, I configured a small single node cluster to execute those processes. For the moment, everything seems to run as expected and I get approximatively 30s execution for each job after the first execution.

However, based on the documentation, it seems as if my use case is not officially supported. Am I understanding this correctly? It this simply a warning or will I have potential issues with this solution?

Re: SingleNode all-purpose cluster for small ETLs

Anonymous — Thu, 30 Dec 2021 17:29:47 GMT

Hello again! As before, if, after a while, if the community does not respond, we'll get back to this.

Re: SingleNode all-purpose cluster for small ETLs

Anonymous — Fri, 31 Dec 2021 00:15:16 GMT

In this sense they mean shared among many users. If you had 4 different users submitting jobs to a single node cluster you'd have some trouble with the resource balancing.

If what you're doing is currently working, keep doing it!

Re: SingleNode all-purpose cluster for small ETLs

BilalAslamDbrx — Mon, 03 Jan 2022 14:55:03 GMT

Exactly what @Joseph Kambourakis said. Single node clusters are designed to be used for single-user machine learning use cases. Think of them as a laptop in the sky.

@E H your use case is really good, we get this all the time. We are working hard to bring serverless clusters to the Data Science & Engineer Workspace. Once we have those, you will get super fast startup time. Is that the ideal solution in your mind for your use case?

Re: SingleNode all-purpose cluster for small ETLs

RicksDB — Mon, 03 Jan 2022 17:39:16 GMT

Hi @Bilal Aslam

Serverless clusters would definitely help regarding the speed required for the small jobs of most of my clients.

That being said, most of these clients requires calculating the "worst case" for most technologies when presenting a business case.

Right now, I am able to do so using interactive clusters since I can assume the worst (744 hours) knowing that the jobs will be queued thus respecting budget if it happens.

Will it be possible to put quotas to achieve the same thing? (I.e ensure there no unexpected high charge such as an infinite loops caused by a user causing high cost instead of email alerts and custom scripts detecting such errors)

If this is achievable, this is exactly what we need.

Thanks

Re: SingleNode all-purpose cluster for small ETLs

BilalAslamDbrx — Tue, 04 Jan 2022 10:36:44 GMT

@E H we're definitely thinking about budgets and quotas for jobs. There are several things we can do, ranked in order of rough complexity-to-implement:

Display the DBU cost of each job in the Jobs UI.
Alert on the DBU cost of a job (e.g. "Alert me if this job costs >20 DBUs")
Alert on the $$ cost of a job (e.g. "Alert me if this job costs >$5")

Thoughts on what you'd prefer?

Re: SingleNode all-purpose cluster for small ETLs

RicksDB — Tue, 04 Jan 2022 13:52:06 GMT

@Bilal Aslam In my case, it usually depends on the customers and their SLA. Most of them usually do not have a "true" high SLA requirement thus prefer the jobs to be throttled when the actual cost is within a certain range of the budget instead of scaling indefinitely.

In an ideal world, solution 1 and 3 would be implemented. Option 3 would be configurable to optionally add throttling when required.

The throttling feature would be used to estimate the worst case.