topic Databrick JDBC Driver making "List Column SQL" Query Everytime in Data Engineering

Databrick JDBC Driver making "List Column SQL" Query Everytime

varunjaincse — Wed, 04 Dec 2024 12:28:22 GMT

I am trying to use the Databricks JDBC Spark Driver to run sql queries on the SQL Warehouse

Sample connection String

String TOKEN = "<token>";
String HTTP_PATH = "/sql/1.0/warehouses/<sql-warehouse-id>";

final String connStr = "jdbc:spark://discover.cloud.databricks.com:443/default;" +
        "TransportMode=http;SSL=1;" +
        "HTTPPath=" + HTTP_PATH + ";" +
        "AuthMech=11;" +
        "Auth_Flow=0;" +
        "Auth_AccessToken=" + TOKEN + ";";
try {
    final Connection connection = DriverManager.getConnection(connStr);
    final Statement statement = connection.createStatement();

    final ResultSet rset = statement.executeQuery("SELECT count(*) FROM table_name limit 10");
    int columnCount = rset.getMetaData().getColumnCount();

    while (rset.next()) {
        for (int i = 1; i <= columnCount; i++) {
            System.out.print(rset.getString(i) + "\t");
        }
        System.out.println();
    }

    connection.close();
} catch (SQLException e) {
    System.err.println(e);
}

Above code is always running two SQL on the Warehouse
1. Listing columns 'catalog : Spark, schemaPattern : <schema_name>, tablePattern : <table_name>, columnName : null' -- Take 15+ sec everytime
2. Actual query

I tried reusing the connection object it didn't work. Any idea what going on here?
Using Driver Version: 2.6.18

Re: Databrick JDBC Driver making "List Column SQL" Query Everytime

VZLA — Wed, 04 Dec 2024 13:12:57 GMT

Hello, thank you for your question.

The initial metadata query (the "Listing Columns" query) is tied to the SparkGetColumnsOperation class, which is part of the Apache Hive ThriftServer and Spark's handling of JDBC metadata operations.

Can you please confirm the SELECT COUNT LIMIT statement is triggering this first 15+ seconds query, and not the rset.getMetaData().getColumnCount(); ?

With respect to the SELECT statement, at least initially, it is ambiguous which catalog or schema the table resides in, so the driver needs to validate metadata to ensure the table and columns exist. But, the rset.getMetaData().getColumnCount(), explicitly fetches metadata for the query's result set. Internally, it uses DatabaseMetaData.getColumns() or similar operations which could be invoking the SparkGetColumnsOperation class. The metadata operation retrieves schema information about the columns in the result set (names, types, nullability, etc.), even if the query result is a single column like count(*). I would expect this to be more probable trigger for the metadata query as it ensures the ResultSet is correctly structured.

Please try some of these and compare the results:

Skip the metadata call and directly process the ResultSet:
- final ResultSet rset = statement.executeQuery("SELECT count(*) FROM table_name LIMIT 10"); while (rset.next()) { System.out.println(rset.getInt(1)); // Directly fetch the first column }
Use the FQN:
- final ResultSet rset = statement.executeQuery("SELECT count(*) FROM my_catalog.my_schema.table_name LIMIT 10");
Use PreparedStatements:
- String query = "SELECT count(*) FROM table_name LIMIT 10";
  PreparedStatement preparedStatement = connection.prepareStatement(query);
  ResultSet rset = preparedStatement.executeQuery();
  
  while (rset.next()) {
  System.out.println(rset.getInt(1));
  }

So, ultimately something like this:

String TOKEN = "<token>"; String HTTP_PATH = "/sql/1.0/warehouses/<sql-warehouse-id>"; final String connStr = "jdbc:spark://discover.cloud.databricks.com:443/default;" + "TransportMode=http;SSL=1;" + "HTTPPath=" + HTTP_PATH + ";" + "AuthMech=11;" + "Auth_Flow=0;" + "Auth_AccessToken=" + TOKEN + ";"; try (Connection connection = DriverManager.getConnection(connStr); Statement statement = connection.createStatement()) { final ResultSet rset = statement.executeQuery("SELECT count(*) FROM my_catalog.my_schema.table_name LIMIT 10"); while (rset.next()) { System.out.println(rset.getInt(1)); // Directly fetch the result } } catch (SQLException e) { e.printStackTrace(); }

Re: Databrick JDBC Driver making "List Column SQL" Query Everytime

varunjaincse — Thu, 05 Dec 2024 08:09:21 GMT

Thank you for your time. I got this fix by adding Param UseNativeQuery=1, in my jdbc connection string. By default it take value 2. As per the driver documentation

0: The connector transforms the queries emitted by applications and converts them into an equivalent form in HiveQL.
1: The connector does not transform the queries emitted by applications, so the native query is used.
2: The connector automatically sets this property to either 0 or 1, depending on the server capabilities.