05-28-2024 11:13 AM
Hello,
I have an Azure sql warehouse serverless instance that I can connect to using databricks-sql-connector. But, when I try to use pyspark and jdbc driver url, I can't read or write.
See my code below
def get_jdbc_url():
# Define your Databricks parameters
server_hostname, http_path, access_token = get_connection_configs()
default_catalog = "researchers"
# Build the Spark JDBC URL
jdbc_url = f"jdbc:databricks://{server_hostname}:443;httpPath={http_path};AuthMech=3;UID=token;PWD={access_token};ConnCatalog={default_catalog}"
return jdbc_url
def get_spark() -> SparkSession:
# Initialize SparkSession
conf = SparkConf()
# unfortunately, there is no automatic way for pyspark to download the jar itself
# TODO set to your own jar path
conf.set("spark.jars",
"/home/username/repos/uwcip-research/code-examples/DatabricksJDBC42-2.6.38.1068/DatabricksJDBC42.jar")
conf.set("spark.driver.extraClassPath",
"/home/username/repos/uwcip-research/code-examples/DatabricksJDBC42-2.6.38.1068/DatabricksJDBC42.jar")
conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark = SparkSession.builder \
.appName("connecting to databricks") \
.config(conf=conf) \
.getOrCreate()
return spark
def read_table():
spark = get_spark()
jdbc_url = get_jdbc_url()
dbtable = "researchers.dev.test_transcripts"
df = spark.read \
.format("jdbc") \
.option("url", jdbc_url) \
.option("dbtable", dbtable) \
.load()
top5 = df.head(5)
print(top5)
return
Below is the error and read results I got when call read_table(). See that it simply returned column names 5 times. When I call df.printSchema(), it works fine.
ERROR StatusLogger Unrecognized conversion specifier [msg] starting at position 54 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [n]
ERROR StatusLogger Unrecognized conversion specifier [n] starting at position 56 in conversion pattern.
[Row(video_name='video_name', video_transcript='video_transcript'), Row(video_name='video_name', video_transcript='video_transcript'), Row(video_name='video_name', video_transcript='video_transcript'), Row(video_name='video_name', video_transcript='video_transcript'), Row(video_name='video_name', video_transcript='video_transcript')]
Additionally, probably unrelated logger error
ERROR StatusLogger Unable to create Lookup for bundle
java.lang.ClassCastException: class org.apache.logging.log4j.core.lookup.ResourceBundleLookup
at java.base/java.lang.Class.asSubclass(Class.java:3640)
at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.lookup.Interpolator.<init>(Interpolator.java:84)
at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.lookup.Interpolator.<init>(Interpolator.java:105)
at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.config.AbstractConfiguration.<init>(AbstractConfiguration.java:135)
at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.config.NullConfiguration.<init>(NullConfiguration.java:32)
at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.LoggerContext.<clinit>(LoggerContext.java:74)
...
ERROR StatusLogger Unable to create Lookup for ctx
java.lang.ClassCastException: class org.apache.logging.log4j.core.lookup.ContextMapLookup
at java.base/java.lang.Class.asSubclass(Class.java:3640)
at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.lookup.Interpolator.<init>(Interpolator.java:84)
05-29-2024 03:17 AM
Hi @amelia1, The error messages you provided indicate that there might be a problem with the log4j configuration and formatting. Additionally, the repeated column names suggest that there might be an issue with how the data is being retrieved.
Here are some steps to help you resolve this issue:
Log4j Issue:
Column Names Issue:
`
) in your SQL query to avoid issues.sqlQuery = "SELECT `column1`, `column2`, `column3` FROM tableschema.tablename WHERE start_date >= '2023-11-11' AND end_date < '2023-11-15' LIMIT 3"
Check Your JDBC URL:
dbtable
parameter in your read_table()
function. Make sure it corresponds to the correct table name in your Azure SQL warehouse.Spark Configuration:
Legacy Column Naming (Databricks Runtime 7.0+):
SHOW DATABASES
command returns the column name as databaseName
.spark.sql.legacy.keepCommandOutputSchema
...
1.Test with a Simple Query:
SELECT * FROM your_table
) and see if the problem persists.Hopefully, this will help you resolve the issue and retrieve the data correctly. If you encounter any further issues or need additional assistance, feel free to ask! 😊
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group