cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

pyspark read data using jdbc url returns column names only

amelia1
New Contributor II

Hello,

I have a remote azure sql warehouse serverless instance that I can access using databricks-sql-connector. I can read/write/update tables no problem.

But, I'm also trying to read/write/update tables using local pyspark + jdbc drivers. But when I try to run df=spark.read.format("jdbc").... df.head(5), only the column names are return 5 times. See my code below.



def get_jdbc_url():
# Define your Databricks parameters
server_hostname, http_path, access_token = get_connection_configs()
default_catalog = "researchers"
# Build the Spark JDBC URL
jdbc_url = f"jdbc:databricks://{server_hostname}:443;httpPath={http_path};AuthMech=3;UID=token;PWD={access_token};ConnCatalog={default_catalog}"
return jdbc_url


def get_spark() -> SparkSession:
# Initialize SparkSession
conf = SparkConf()
# unfortunately, there is no automatic way for pyspark to download the jar itself
# TODO set to your own jar path
conf.set("spark.jars",
"/home/username/repos/uwcip-research/code-examples/DatabricksJDBC42-2.6.38.1068/DatabricksJDBC42.jar")
conf.set("spark.driver.extraClassPath",
"/home/username/repos/uwcip-research/code-examples/DatabricksJDBC42-2.6.38.1068/DatabricksJDBC42.jar")
conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

spark = SparkSession.builder \
.appName("connecting to databricks") \
.config(conf=conf) \
.getOrCreate()
return spark
def read_table():
spark = get_spark()
jdbc_url = get_jdbc_url()
dbtable = "researchers.dev.test_transcripts"

df = spark.read \
.format("jdbc") \
.option("url", jdbc_url) \
.option("dbtable", dbtable) \
.load()

top5 = df.head(5)
print(top5)

 

The error I got:

ERROR StatusLogger Unable to create Lookup for date
java.lang.ClassCastException: class org.apache.logging.log4j.core.lookup.DateLookup
at java.base/java.lang.Class.asSubclass(Class.java:3640)
at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.lookup.Interpolator.<init>(Interpolator.java:84)
at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.lookup.Interpolator.<init>(Interpolator.java:105)

....

ERROR StatusLogger Unable to create Lookup for env
java.lang.ClassCastException: class org.apache.logging.log4j.core.lookup.EnvironmentLookup
at java.base/java.lang.Class.asSubclass(Class.java:3640)

....

ERROR StatusLogger Unrecognized format specifier [d]
ERROR StatusLogger Unrecognized conversion specifier [d] starting at position 16 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [thread]
ERROR StatusLogger Unrecognized conversion specifier [thread] starting at position 25 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [level]
ERROR StatusLogger Unrecognized conversion specifier [level] starting at position 35 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [logger]
ERROR StatusLogger Unrecognized conversion specifier [logger] starting at position 47 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [msg]
ERROR StatusLogger Unrecognized conversion specifier [msg] starting at position 54 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [n]
ERROR StatusLogger Unrecognized conversion specifier [n] starting at position 56 in conversion pattern.
[Row(video_name='video_name', video_transcript='video_transcript'), Row(video_name='video_name', video_transcript='video_transcript'), Row(video_name='video_name', video_transcript='video_transcript'), Row(video_name='video_name', video_transcript='video_transcript'), Row(video_name='video_name', video_transcript='video_transcript')]

 

So, looks to be some sort of log4j issue (not sure if relevant), and then only the column names are returned 5 times. any help is greatly appreciated.

 

 

1 REPLY 1

anardinelli
New Contributor III
New Contributor III

Hi @amelia1 how are you?

What you got was indeed the top 5 rows (see that it was the Row class). What does it show when you run display(df)?

I'm thinking it might be something related to your schema, since you did not defined that, it can read the data in an unstructured matter, replacing some of the fields.

Can you please try defining it and passing on the spar.read?

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType

# Create a SparkSession
spark = SparkSession.builder.appName("Example").getOrCreate()

# Define the schema
schema = StructType([
    StructField("video_name", StringType(), True),
    StructField("video_transcript", StringType(), True)


   df = spark.read \
        .format("jdbc") \
        .schema(schema) \
        .option("url", jdbc_url) \
        .option("dbtable", dbtable) \
        .load()

display(df)

Let me know if that solves your problem.

Best,

Alessandro

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!