topic Re: When to increase maximum bound vs when to increase cluster size? in Data Engineering

When to increase maximum bound vs when to increase cluster size?

jerry747847 — Wed, 12 Oct 2022 06:43:49 GMT

Hello experts,

For the below question, I am trying to understand why option C was selected instead of B? As B would also have resolved the issue

Question 40

A data analyst has noticed that their Databricks SQL queries are running too slowly. They claim that this issue is affecting all of their sequentially run queries. They ask the data engineering team for help. The data engineering team notices that each of the queries uses the same SQL endpoint, but the SQL endpoint is not used by any other user. Which of the following approaches can the data engineering team use to improve the latency of the data analyst’s queries?

A. They can turn on the Serverless feature for the SQL endpoint.

B. They can increase the maximum bound of the SQL endpoint’s scaling range.

C. They can increase the cluster size of the SQL endpoint.

D. They can turn on the Auto Stop feature for the SQL endpoint.

E. They can turn on the Serverless feature for the SQL endpoint and change the Spot Instance Policy to “Reliability Optimized.”

This is the link for the practice exam

Re: When to increase maximum bound vs when to increase cluster size?

Anonymous — Wed, 12 Oct 2022 17:53:21 GMT

I think the key is the word sequential. If the queries are sequential, they won't trigger autoscaling. For sequential, you'd want a larger VM. Autoscaling will be triggered by lots of cluster use/many queries.

Re: When to increase maximum bound vs when to increase cluster size?

jerry747847 — Thu, 13 Oct 2022 15:04:26 GMT

@Joseph Kambourakis Thanks a lot for your response. Yes, I agree with you. By autoscaling, you meant adding new clusters or adding more worker nodes to the existing cluster? Can you please share any references you have on this?

Re: When to increase maximum bound vs when to increase cluster size?

Anonymous — Thu, 13 Oct 2022 15:13:48 GMT

https://docs.databricks.com/clusters/configure.html#cluster-size-and-autoscaling

Autoscaling is automatic cluster sizing up or down by adding or removing VMs to an existing cluster.

Re: When to increase maximum bound vs when to increase cluster size?

jerry747847 — Thu, 13 Oct 2022 15:38:55 GMT

@Joseph Kambourakis great thanks.👍 Just to make things clear, here Query 1 is fired first followed by Query2, Query3, etc. like that in sequential order. So if Query 1 runs on say 4 worker nodes of the same VM instead of 2 worker nodes (option B) and similarly for Query 2, Query 3, etc. won't that address the performance issue highlighted in the question

Re: When to increase maximum bound vs when to increase cluster size?

Anonymous — Thu, 13 Oct 2022 15:46:55 GMT

The key is sequential. 2 has to wait for 1. The autoscaling is triggered by jobs sitting in the queue. 2 won't be in the queue yet because it's waiting for 1 to finish and then it gets added.

If all the queries were submitted at the same time, then more autoscaling would help.

Re: When to increase maximum bound vs when to increase cluster size?

JRL — Tue, 07 Feb 2023 16:52:56 GMT

On a sql server, there are wait states. Wait states occur when several processors (vCPUs) are processing and several threads are working through the processors. A longer running thread that has dependencies, can cause the thread that may have begun on a different process, but is dependent on the one that has the longer running thread to complete, will cause a wait state and slow down the dependent query. From the sysadmin POV, this could also be seen as hard faults in memory. The issue above from a SQL perspective is to increase the maximum bound which add endpoints (threads) and logically create additional flows - more streams, faster flow. From the sysadmin POV, adding additional clusters to increase the size of the endpoint would also solve the problem because the amount of memory allows for swifter execution. In either case, it should resolve the issue. It would make more sense to allow Azure or AWS to sort this out with a Serverless feature that is Reliably Optimizied. However, that feature will simply perform the feat of scaling the cluster. Assuming AWS/Azure have a handle on it, and their methodology is to scale the cluster, the better answer..., based on their handling of the situation, is to scale upward with the cluster. I suspect the answer here is the Databricks people aren't always thinking like SQL Server database administrators. They have lots of cluster power. Frankly, if it were my issue, I would try the maximum bound increase first. If that didn't work then I'd go for the increased cluster size. But my thinking is a cost optimization. There is not a part of the question that asks how to do this in a cost effective way. So - final answer - C: boost the cluster.