โ10-11-2022 11:43 PM
Hello experts,
For the below question, I am trying to understand why option C was selected instead of B? As B would also have resolved the issue
Question 40
A data analyst has noticed that their Databricks SQL queries are running too slowly. They claim that this issue is affecting all of their sequentially run queries. They ask the data engineering team for help. The data engineering team notices that each of the queries uses the same SQL endpoint, but the SQL endpoint is not used by any other user. Which of the following approaches can the data engineering team use to improve the latency of the data analystโs queries?
A. They can turn on the Serverless feature for the SQL endpoint.
B. They can increase the maximum bound of the SQL endpointโs scaling range.
C. They can increase the cluster size of the SQL endpoint.
D. They can turn on the Auto Stop feature for the SQL endpoint.
E. They can turn on the Serverless feature for the SQL endpoint and change the Spot Instance Policy to โReliability Optimized.โ
โ10-13-2022 08:46 AM
The key is sequential. 2 has to wait for 1. The autoscaling is triggered by jobs sitting in the queue. 2 won't be in the queue yet because it's waiting for 1 to finish and then it gets added.
If all the queries were submitted at the same time, then more autoscaling would help.
โ10-12-2022 10:53 AM
I think the key is the word sequential. If the queries are sequential, they won't trigger autoscaling. For sequential, you'd want a larger VM. Autoscaling will be triggered by lots of cluster use/many queries.
โ10-13-2022 08:04 AM
@Joseph Kambourakisโ Thanks a lot for your response. Yes, I agree with you. By autoscaling, you meant adding new clusters or adding more worker nodes to the existing cluster? Can you please share any references you have on this?
โ10-13-2022 08:13 AM
https://docs.databricks.com/clusters/configure.html#cluster-size-and-autoscaling
Autoscaling is automatic cluster sizing up or down by adding or removing VMs to an existing cluster.
โ10-13-2022 08:38 AM
@Joseph Kambourakisโ great thanks.๐ Just to make things clear, here Query 1 is fired first followed by Query2, Query3, etc. like that in sequential order. So if Query 1 runs on say 4 worker nodes of the same VM instead of 2 worker nodes (option B) and similarly for Query 2, Query 3, etc. won't that address the performance issue highlighted in the question
โ10-13-2022 08:46 AM
The key is sequential. 2 has to wait for 1. The autoscaling is triggered by jobs sitting in the queue. 2 won't be in the queue yet because it's waiting for 1 to finish and then it gets added.
If all the queries were submitted at the same time, then more autoscaling would help.
โ02-07-2023 08:52 AM
On a sql server, there are wait states. Wait states occur when several processors (vCPUs) are processing and several threads are working through the processors. A longer running thread that has dependencies, can cause the thread that may have begun on a different process, but is dependent on the one that has the longer running thread to complete, will cause a wait state and slow down the dependent query. From the sysadmin POV, this could also be seen as hard faults in memory. The issue above from a SQL perspective is to increase the maximum bound which add endpoints (threads) and logically create additional flows - more streams, faster flow. From the sysadmin POV, adding additional clusters to increase the size of the endpoint would also solve the problem because the amount of memory allows for swifter execution. In either case, it should resolve the issue. It would make more sense to allow Azure or AWS to sort this out with a Serverless feature that is Reliably Optimizied. However, that feature will simply perform the feat of scaling the cluster. Assuming AWS/Azure have a handle on it, and their methodology is to scale the cluster, the better answer..., based on their handling of the situation, is to scale upward with the cluster. I suspect the answer here is the Databricks people aren't always thinking like SQL Server database administrators. They have lots of cluster power. Frankly, if it were my issue, I would try the maximum bound increase first. If that didn't work then I'd go for the increased cluster size. But my thinking is a cost optimization. There is not a part of the question that asks how to do this in a cost effective way. So - final answer - C: boost the cluster.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group