pyspark read data using jdbc url returns column names only
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-28-2024 11:26 AM
Hello,
I have a remote azure sql warehouse serverless instance that I can access using databricks-sql-connector. I can read/write/update tables no problem.
But, I'm also trying to read/write/update tables using local pyspark + jdbc drivers. But when I try to run df=spark.read.format("jdbc").... df.head(5), only the column names are return 5 times. See my code below.
def get_jdbc_url():
# Define your Databricks parameters
server_hostname, http_path, access_token = get_connection_configs()
default_catalog = "researchers"
# Build the Spark JDBC URL
jdbc_url = f"jdbc:databricks://{server_hostname}:443;httpPath={http_path};AuthMech=3;UID=token;PWD={access_token};ConnCatalog={default_catalog}"
return jdbc_url
def get_spark() -> SparkSession:
# Initialize SparkSession
conf = SparkConf()
# unfortunately, there is no automatic way for pyspark to download the jar itself
# TODO set to your own jar path
conf.set("spark.jars",
"/home/username/repos/uwcip-research/code-examples/DatabricksJDBC42-2.6.38.1068/DatabricksJDBC42.jar")
conf.set("spark.driver.extraClassPath",
"/home/username/repos/uwcip-research/code-examples/DatabricksJDBC42-2.6.38.1068/DatabricksJDBC42.jar")
conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark = SparkSession.builder \
.appName("connecting to databricks") \
.config(conf=conf) \
.getOrCreate()
return spark
def read_table():
spark = get_spark()
jdbc_url = get_jdbc_url()
dbtable = "researchers.dev.test_transcripts"
df = spark.read \
.format("jdbc") \
.option("url", jdbc_url) \
.option("dbtable", dbtable) \
.load()
top5 = df.head(5)
print(top5)
The error I got:
ERROR StatusLogger Unable to create Lookup for date
java.lang.ClassCastException: class org.apache.logging.log4j.core.lookup.DateLookup
at java.base/java.lang.Class.asSubclass(Class.java:3640)
at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.lookup.Interpolator.<init>(Interpolator.java:84)
at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.lookup.Interpolator.<init>(Interpolator.java:105)
....
ERROR StatusLogger Unable to create Lookup for env
java.lang.ClassCastException: class org.apache.logging.log4j.core.lookup.EnvironmentLookup
at java.base/java.lang.Class.asSubclass(Class.java:3640)
....
ERROR StatusLogger Unrecognized format specifier [d]
ERROR StatusLogger Unrecognized conversion specifier [d] starting at position 16 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [thread]
ERROR StatusLogger Unrecognized conversion specifier [thread] starting at position 25 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [level]
ERROR StatusLogger Unrecognized conversion specifier [level] starting at position 35 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [logger]
ERROR StatusLogger Unrecognized conversion specifier [logger] starting at position 47 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [msg]
ERROR StatusLogger Unrecognized conversion specifier [msg] starting at position 54 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [n]
ERROR StatusLogger Unrecognized conversion specifier [n] starting at position 56 in conversion pattern.
[Row(video_name='video_name', video_transcript='video_transcript'), Row(video_name='video_name', video_transcript='video_transcript'), Row(video_name='video_name', video_transcript='video_transcript'), Row(video_name='video_name', video_transcript='video_transcript'), Row(video_name='video_name', video_transcript='video_transcript')]
So, looks to be some sort of log4j issue (not sure if relevant), and then only the column names are returned 5 times. any help is greatly appreciated.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-29-2024 03:16 PM
Hi @amelia1 how are you?
What you got was indeed the top 5 rows (see that it was the Row class). What does it show when you run display(df)?
I'm thinking it might be something related to your schema, since you did not defined that, it can read the data in an unstructured matter, replacing some of the fields.
Can you please try defining it and passing on the spar.read?
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType
# Create a SparkSession
spark = SparkSession.builder.appName("Example").getOrCreate()
# Define the schema
schema = StructType([
StructField("video_name", StringType(), True),
StructField("video_transcript", StringType(), True)
df = spark.read \
.format("jdbc") \
.schema(schema) \
.option("url", jdbc_url) \
.option("dbtable", dbtable) \
.load()
display(df)
Let me know if that solves your problem.
Best,
Alessandro
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-23-2024 01:19 PM
I added `UseNativeQuery=0` in url. It works for me.

