Databricks Community

marko · ‎03-24-2023

We are going to be a databricks customer and did some PoC tests. Our one test contains dataset in one partitioned table (15colums) is roughly 250M rows, each partition is ~50K-150K rows.

Occasionally we have hundreds (up to one thousand) concurrent users, each querying (sql select) different partition. According to

https://community.databricks.com/s/question/0D58Y00009WgtBiSAJ/data-bricks-sql-is-allowing-10-querie...

https://community.databricks.com/s/question/0D58Y00009ZuIHUSA3/sql-queries-on-databricks

, I can only run 10 concurrent SQL queries per cluster. So if I need to run more queries concurrently, I can increase number of clusters per SQL Warehouse (e.g. if I have 10 clusters per SQL Warehouse I can run 10x10=100 concurrent queries, isn't it?). I think that similar behaviour is possible to accomplish with more 'High concurrent clusters'.

Or are there others possibilities how to accomplish the requirement of 1 thousand of concurrent queries? Thx.

M.

Anonymous · ‎03-24-2023

@Marian Kovac :

Yes, you are correct that the maximum number of concurrent queries per cluster in Databricks is 10 by default. However, you can increase this limit by adjusting the spark.databricks.maxConcurrentRuns configuration parameter. This can be set on a per-cluster basis by modifying the cluster configuration, or on a global level by modifying the cluster policy.

However, even with this increased limit, running 1000 concurrent queries on a single cluster might not be practical due to resource limitations. In this case, you can indeed increase the number of clusters per SQL Warehouse to increase the overall concurrency limit. Additionally, you can consider using autoscaling clusters, which can dynamically adjust the number of nodes based on the workload, to ensure that you have enough resources to handle the load.

Another approach is to use query caching to reduce the number of concurrent queries. If many users are querying the same partition, you can cache the results of the query and serve subsequent requests from the cache. This can reduce the number of queries hitting the database and increase overall performance.

marko · ‎03-26-2023

Thank you for your reply; To summarize, I can either (1) for spark cluster create multiple clusters and/or set spark.databricks.maxConcurrentRuns to increase concurrency per cluster, or (2) for SQL Warehouse set multiple clusters in one SQL Warehouse and/or create more SQL Warehouses.

Maybe I have one more additional question. Is hive_metastore somehow bottleneck during concurrent sql select? When I run some sql query, is there a request to hive_metastore on the 'read path', what could cause the bottleneck? And if yes, when I create more spark clusters do these clusters share one hive_metastore? Or hive_metastore is cached per cluster or even per worker?

And in case of SQL Warehouse, how concurrent sql queries are affected by hive_metastore?

Or can you please direct me to the documentation where these details are given? Thank you very much

M.

marko · ‎04-17-2023

Please, do you have any progress? Thx

Anonymous · ‎04-17-2023

@Marian Kovac :

Regarding your additional question, the Hive metastore is used by Databricks to store metadata about tables, views, and partitions in a central repository. It does not typically become a bottleneck during concurrent SQL queries. The reason is that the metastore is not involved in the query execution itself, but rather in the query planning phase, where the SQL engine needs to resolve the schema of the tables being queried.

However, if you have a large number of concurrent queries, it is possible that the metastore could become a bottleneck if it is not properly configured or if the number of concurrent queries is so high that it causes contention for access to the metastore database. In such cases, you may need to consider increasing the resources of the metastore database, or even sharding it across multiple instances.

When you create more Spark clusters, they share the same metastore database by default. The metastore is cached per cluster, which means that each Spark cluster maintains a local cache of the metadata that it needs to execute its queries. This cache is updated periodically by querying the metastore database, but the frequency of updates can be configured to balance performance and consistency requirements.

Similarly, the concurrent SQL queries in SQL Warehouse are not affected by the Hive metastore. SQL Warehouse uses its own metadata store to manage the schema of the tables and views it creates, and this metadata store is optimized for high concurrency access.

Anonymous · ‎03-25-2023

Hi @Marian Kovac

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!