10-09-2024 01:41 AM - edited 10-09-2024 01:47 AM
so i was trying to integrate databricks in my java code above is code snippet used for jdbc connection while trying to read data from databricks table i am facing issue where df is returning data as header only have provided expected and actual outputs of df after investigating alittle on google tried changing dbtable name with and without schema name but still same issue is there
11-01-2024 07:02 AM
Can you please try it this way:
package com.example.databricks;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class DatabricksJDBCApp {
public static void main(String[] args) {
// Initialize Spark Session
SparkSession spark = SparkSession.builder()
.appName("Databricks JDBC Example")
.master("local")
.getOrCreate();
String pwd = "XXXXXXXXXXXXXXXXXXXXXXXX";
String warehouseId = "XXXXXXXXXXXXXXX";
String host = "XXXXXXXXXXXXXXXXXXX";
// JDBC URL to connect to Databricks
String url = "jdbc:databricks://" + host + ":443;" +
"transportMode=http;ssl=1;" +
"HttpPath=/sql/1.0/warehouses/" + warehouseId + ";" +
"UID=token;PWD=" + pwd;
// Use a SQL query to fetch the data
String query = "(SELECT * FROM framework_databricks.test_table3) AS temp";
// JDBC Driver Class
String driver = "com.databricks.client.jdbc.Driver";
Dataset<Row> databricksDF = spark.read()
.format("jdbc")
.option("url", url)
.option("dbtable", query) // Use SQL query as dbtable
.option("driver", driver)
.load();
// Show schema and data
databricksDF.printSchema();
System.out.println("Row Count: " + databricksDF.count());
databricksDF.show();
// Stop Spark session
spark.stop();
}
}
11-05-2024 05:32 AM
For me also same issue, can you please help?
from pyspark.sql import SparkSession
import pyspark
import os
from datetime import datetime
from subprocess import PIPE, run
jdbc_url = "jdbc:databricks://adb-xxxxx.1.azuredatabricks.net:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=/sql/1.0/warehouses/xxxx;"
username = "token"
password = "dapxxxxxx15f12"
# Use a SQL query to fetch the data
query = "(SELECT * FROM test_catalog.ams.sample) AS temp"
driver = "com.databricks.client.jdbc.Driver"
spark = SparkSession \
.builder \
.appName("Databricks JDBC Read") \
.config("spark.jars", "/home/spark/shared/user-libs/spark/DatabricksJDBC42.jar")\
.config("spark.sql.sources.jdbc.useNativeQuery", "false") \
.getOrCreate()
# # Read data from Databricks SQL endpoint
df = spark.read \
.format("jdbc") \
.option("url", jdbc_url) \
.option("dbtable", query) \
.option("user", username) \
.option("password", password) \
.option("driver", driver) \
.load()
df.printSchema()
df.show()
# stop the session
spark.stop()
Result:-
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
root
|-- first_name: string (nullable = true)
+----------+
|first_name|
+----------+
|first_name|
|first_name|
|first_name|
|first_name|
|first_name|
|first_name|
|first_name|
|first_name|
|first_name|
+----------+
(python) bash-5.1$
11-05-2024 07:50 AM
@Pingleinferyx @dixonantony My apologies I misread the problem, can you try setting the .option("header", "true") in the spark.read or explicitly mention the schema and see if that helps?
11-29-2024 07:37 AM - edited 11-29-2024 07:42 AM
When I attempt either of those options I get the following errors:
defining the .schema(explicit_schema) returns the same result of having the headers as the data for every row
OR
[Databricks][JDBCDriver](500051) ERROR processing query/statement. Error Code: 0, SQL state: null, Query: SELECT * FROM edl.test_table WHERE 1=0, Error message from Server: Configuration header is not available..
11-13-2024 03:04 AM
Any update?
11-29-2024 06:54 AM
I'm encountering the same issue - has there been an update on this?
11-29-2024 09:45 AM - edited 11-29-2024 09:46 AM
After reading through the Driver documentation I've finally found a solution that appears to work for me. I've added .option("UseNativeQuery", 0) to my JDBC connection. The query that was being passed from the Databricks Driver to the Databricks Cluster was being altered to select the column names from my subquery, as opposed to the data values. By passing .option("UseNativeQuery", 0) the driver now alters my query to select the correct values (by transforming into a HiveQL syntax). My understanding is that the overhead is slightly higher to transform the query - but at least it's actually working. Unlike with "UseNativeQuery"=1 or "UseNativeQuery"=2 (which is what it defaults to if not specified in the .options()).
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group