Thursday
Hi,
I am using JDBC driver to execute an insert statement with several thousand of rows (~4MB). It takes several seconds to complete and for some reason consumes 1 full CPU core for it.
It seems like a lot of the time is spent in this method:
com.databricks.client.hivecommon.utils.HiveCommonQueryTranslationUtils.stripCatalogName
Sample stack trace:
void java.util.regex.Pattern.compile()
void java.util.regex.Pattern.<init>(String, int)
Pattern java.util.regex.Pattern.compile(String, int)
String com.databricks.client.hivecommon.utils.HiveCommonQueryTranslationUtils.RemoveCatalogFromQueryStringInternal(String, String, ILogger)
String com.databricks.client.hivecommon.utils.HiveCommonQueryTranslationUtils.stripCatalogName(String, ILogger, HiveJDBCSettings, IWarningListener)
void com.databricks.client.hivecommon.dataengine.HiveJDBCNativeQueryExecutor.<init>(ILogger, IHiveClient, HiveJDBCStatement, String, HiveJDBCCommonConnection, boolean, ConnSettingRequestMap, boolean, boolean)
IQueryExecutor com.databricks.client.hivecommon.dataengine.HiveJDBCDataEngine.prepare(String)
void com.databricks.client.jdbc.common.SPreparedStatement.<init>(String, IStatement, SConnection, int)
void com.databricks.client.jdbc.jdbc41.S41PreparedStatement.<init>(String, IStatement, SConnection, int)
void com.databricks.client.jdbc.jdbc42.S42PreparedStatement.<init>(String, IStatement, SConnection, int)
void com.databricks.client.hivecommon.jdbc42.Hive42PreparedStatement.<init>(String, HiveJDBCStatement, SConnection, int)
SPreparedStatement com.databricks.client.spark.jdbc.SparkJDBCObjectFactory.createPreparedStatement(String, IStatement, SConnection, int)
IJDBCPreparedStatement com.databricks.client.jdbc.common.JDBCObjectFactory.newPreparedStatement(String, IStatement, SConnection, int)
IJDBCPreparedStatement com.databricks.client.jdbc.common.SConnection$6.create(IStatement)
IJDBCStatement com.databricks.client.jdbc.common.SConnection$6.create(IStatement)
IJDBCStatement com.databricks.client.jdbc.common.SConnection$StatementCreator.create()
IJDBCPreparedStatement com.databricks.client.jdbc.common.SConnection.prepareStatement(String, int, int)
PreparedStatement com.databricks.client.jdbc.common.SConnection.prepareStatement(String, int, int)
How can this be fixed so it would not be CPU bound?
Driver version:
com.databricks:databricks-jdbc:2.6.40
Thursday
Hi @ivni ,
Yes, that method could be CPU intensive. According to driver's docs it removes catalog name from query statement. But it doing this via regex patterns - this is heavy operation from CPU perspective, especially if you have a lot of complex queries.
What you can try to do is to add useNativeQuery=1 to your connection string. With that setting, the driver passes the SQL queries verbatim to Databricks.
Friday - last edited Friday
Thank you for the suggestion, but useNativeQuery=1 doesn't seem to reduce CPU usage. Usage example:
String sql = Files.readString(Path.of("insert.sql"));
String url = "jdbc:databricks://host.cloud.databricks.com:443/data;connschema=schema;transportMode=http;ssl=1;AuthMech=3;httpPath=/path;useNativeQuery=1";
Properties props = new Properties();
props.setProperty("user", "token");
props.setProperty("password", "<token>");
props.setProperty("useNativeQuery", "1");
Driver driver = DriverManager.getDriver(url);
try (Connection conn = driver.connect(url, props);
Statement st = conn.createStatement()) {
st.execute(sql);
}
Any other suggestions?
Friday
Hi,
You can also try to disable this StripCatalogName=0 in your jdbc connection string.
Friday
StripCatalogName=0 doesn't seem to have effect either.
Friday
Ok, one last thing. Try to add explicitly to jdbc connection string information about catalog and connSchema
ConnCatalog=your_catalog;ConnSchema=your_schema;
Friday
So I guess something like this?
jdbc:databricks://host.cloud.databricks.com:443;httpPath=/path;ConnCatalog=data;ConnSchema=schema;transportMode=http;ssl=1;AuthMech=3;useNativeQuery=1;StripCatalogName=0
These measures don't seem to influence CPU consumption.
Friday
Could you once again check stack trace then? In previous message you wrote that major time is spent at below method:
com.databricks.client.hivecommon.utils.HiveCommonQueryTranslationUtils.stripCatalogName
How it looks like now?
Friday
It is still there:
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now