cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Facing issue While executing DDL and DML queries in 12.0 cluster runtime version.

YSDPrasad
New Contributor III

Hi all,

Currently we are using Driver: Standard_D32s_v3 · Workers: Standard_D32_v3 · 2-8 workers · 6.4 Extended Support (includes Apache Spark 2.4.5, Scala 2.11) cluster. For this we are running 24/7 streaming notebook on trigger of every minute and 5 more pipelines with different time line triggers. when we are starting the all triggers after 24hrs normal execution. Later streaming and pipeline related notebook are started failed. It means notebooks execution time is increasing and every command taking more time to execute.cluster metricsAbove image shows cluster metrics after 24 hrs of execution.

https://dbricks.co/3VWeuRJ after going through this link I started upgrading the 12.0 runtime version. For this I am facing issue while executing DDL and DML queries in the below mentioned format.

import com.microsoft.azure.sqldb.spark.config.Config

import com.microsoft.azure.sqldb.spark.connect._

import com.microsoft.azure.sqldb.spark.query._

val query = "Truncate table tablename"

val config = Config(Map(

 "url"     -> dbutils.secrets.get(scope = "azurekeyvault-scope", key = "DW-URL"),

 "databaseName" -> dbutils.secrets.get(scope = "azurekeyvault-scope", key = "DW-DBName"),

 "user"     -> dbutils.secrets.get(scope = "azurekeyvault-scope", key = "DW-Username"),

 "password"   -> dbutils.secrets.get(scope = "azurekeyvault-scope", key = "DW-Password"),

 "queryCustom" -> query

))

sqlContext.sqlDBQuery(config)

can anyone suggest best way of approach to resolve my issue.

Thanks and Regards.

1 ACCEPTED SOLUTION

Accepted Solutions

YSDPrasad
New Contributor III

Hi Suteja,

We verified the code and removed unwanted command also still we are facing issue. We tried clear cache using spark.catalog.clearCache() this code. Still there is no change in cluster performance.

Truncate table is now working fine. I found different for executing in the query in 12.0 runtime version.

import java.sql.{Connection, DriverManager, Statement}

val jdbcUrl = dbutils.secrets.get(scope = "azurekeyvault-scope", key = "nycsqlserver-DW-ConnectionString")

val user = dbutils.secrets.get(scope = "azurekeyvault-scope", key = "DW-Username")

val password = dbutils.secrets.get(scope = "azurekeyvault-scope", key = "DW-Password")

val updateQuery = "Truncate table tablename"

var conn: Connection = null

var stmt: Statement = null

try {

 conn = DriverManager.getConnection(jdbcUrl, user, password)

 stmt = conn.createStatement()

 stmt.executeUpdate(updateQuery)

} catch {

 case e: Exception => e.printStackTrace()

} finally {

 if (stmt != null) stmt.close()

 if (conn != null) conn.close()

}

By using above code I can execute the queries.

On the periodically restart part is there any chance to improve the performance with out restarting the cluster. Else, Is there any automatic cluster restart option available in databricks to avoid manual restart.

View solution in original post

5 REPLIES 5

Anonymous
Not applicable

@Someswara Durga Prasad Yaralgadda​ : I am giving you some ideas on how to think about this. Please test and implement what best suits you.

There are a few things you can do to optimize the performance of your notebooks:

  1. Consider using a higher performance cluster with more powerful worker nodes if your current cluster is struggling to keep up with your workloads.
  2. Optimize your queries to improve their performance. For example, you can try using more efficient SQL queries or optimizing your data pipeline to reduce the amount of data being processed.
  3. If you are experiencing slow query performance, consider using caching or optimizing your data storage to improve query speed.
  4. Avoid running unnecessary code in your notebooks. This can slow down the overall execution time and consume resources unnecessarily.

In regards to the specific issue you are facing, truncating a table can be a resource-intensive operation and may take longer to execute on larger tables. You may want to consider optimizing the table schema or partitioning your data to improve the performance of your queries. Additionally, you can try using the DROP TABLE statement instead of TRUNCATE TABLE, which can sometimes be faster.

Lastly, if your notebook is running 24/7, it may be a good idea to periodically restart the kernel to free up any resources that may have been consumed over time.

YSDPrasad
New Contributor III

Hi Suteja,

We verified the code and removed unwanted command also still we are facing issue. We tried clear cache using spark.catalog.clearCache() this code. Still there is no change in cluster performance.

Truncate table is now working fine. I found different for executing in the query in 12.0 runtime version.

import java.sql.{Connection, DriverManager, Statement}

val jdbcUrl = dbutils.secrets.get(scope = "azurekeyvault-scope", key = "nycsqlserver-DW-ConnectionString")

val user = dbutils.secrets.get(scope = "azurekeyvault-scope", key = "DW-Username")

val password = dbutils.secrets.get(scope = "azurekeyvault-scope", key = "DW-Password")

val updateQuery = "Truncate table tablename"

var conn: Connection = null

var stmt: Statement = null

try {

 conn = DriverManager.getConnection(jdbcUrl, user, password)

 stmt = conn.createStatement()

 stmt.executeUpdate(updateQuery)

} catch {

 case e: Exception => e.printStackTrace()

} finally {

 if (stmt != null) stmt.close()

 if (conn != null) conn.close()

}

By using above code I can execute the queries.

On the periodically restart part is there any chance to improve the performance with out restarting the cluster. Else, Is there any automatic cluster restart option available in databricks to avoid manual restart.

Anonymous
Not applicable

@Someswara Durga Prasad Yaralgadda​ :

I'm glad to hear that you were able to resolve the issue with executing DDL queries in the new runtime version.

Regarding your question about periodically restarting the cluster to improve performance, this is a common practice to prevent long-running clusters from becoming unstable and to ensure that the cluster resources are being utilized effectively. By periodically restarting the cluster, you can release any accumulated resources and refresh the runtime environment, which can help to optimize performance.

However, restarting the cluster manually can be time-consuming and may cause disruptions to ongoing workloads. Databricks provides an automatic cluster termination feature that allows you to specify a time or duration for your cluster to be active, after which the cluster will automatically terminate. This feature can help you to save costs by ensuring that your clusters are only running when they are needed, and it can also help to ensure that the cluster is refreshed periodically to optimize performance.

To enable automatic cluster termination, you can navigate to the Cluster Settings in the Databricks workspace and select the "Auto Termination" option. From there, you can specify the maximum idle time or duration for the cluster, after which the cluster will be terminated automatically.

In addition to automatic cluster termination, you may also consider optimizing your code and workloads to reduce the memory and compute resources required by the cluster. This can include techniques such as data pruning, caching, and partitioning, as well as optimizing your code and queries to reduce the amount of data that needs to be processed. By optimizing your workloads and resources, you can help to ensure that your clusters are running efficiently and cost-effectively.

Kaniz
Community Manager
Community Manager

Hi @Someswara Durga Prasad Yaralgadda​  (Customer)​, We haven’t heard from you since the last response from @Suteja Kanuri​  (Customer)​, and I was checking back to see if her suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Anonymous
Not applicable

Hi @Someswara Durga Prasad Yaralgadda​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.