Databricks Community

amelia1 · ‎05-28-2024

Hello,

I have a remote azure sql warehouse serverless instance that I can access using databricks-sql-connector. I can read/write/update tables no problem.

But, I'm also trying to read/write/update tables using local pyspark + jdbc drivers. But when I try to run df=spark.read.format("jdbc").... df.head(5), only the column names are return 5 times. See my code below.



def get_jdbc_url():
    # Define your Databricks parameters
    server_hostname, http_path, access_token = get_connection_configs()
    default_catalog = "researchers"
    # Build the Spark JDBC URL
    jdbc_url = f"jdbc:databricks://{server_hostname}:443;httpPath={http_path};AuthMech=3;UID=token;PWD={access_token};ConnCatalog={default_catalog}"
    return jdbc_url


def get_spark() -> SparkSession:
    # Initialize SparkSession
    conf = SparkConf()
    # unfortunately, there is no automatic way for pyspark to download the jar itself
    # TODO set to your own jar path
    conf.set("spark.jars",
             "/home/username/repos/uwcip-research/code-examples/DatabricksJDBC42-2.6.38.1068/DatabricksJDBC42.jar")
    conf.set("spark.driver.extraClassPath",
             "/home/username/repos/uwcip-research/code-examples/DatabricksJDBC42-2.6.38.1068/DatabricksJDBC42.jar")
    conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

    spark = SparkSession.builder \
        .appName("connecting to databricks") \
        .config(conf=conf) \
        .getOrCreate()
    return spark

def read_table():
    spark = get_spark()
    jdbc_url = get_jdbc_url()
    dbtable = "researchers.dev.test_transcripts"

    df = spark.read \
        .format("jdbc") \
        .option("url", jdbc_url) \
        .option("dbtable", dbtable) \
        .load()

    top5 = df.head(5)
    print(top5)

The error I got:

ERROR StatusLogger Unable to create Lookup for date
java.lang.ClassCastException: class org.apache.logging.log4j.core.lookup.DateLookup
at java.base/java.lang.Class.asSubclass(Class.java:3640)
at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.lookup.Interpolator.<init>(Interpolator.java:84)
at com.databricks.client.jdbc42.internal.apache.logging.log4j.core.lookup.Interpolator.<init>(Interpolator.java:105)

....

ERROR StatusLogger Unable to create Lookup for env
java.lang.ClassCastException: class org.apache.logging.log4j.core.lookup.EnvironmentLookup
at java.base/java.lang.Class.asSubclass(Class.java:3640)

....

ERROR StatusLogger Unrecognized format specifier [d]
ERROR StatusLogger Unrecognized conversion specifier [d] starting at position 16 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [thread]
ERROR StatusLogger Unrecognized conversion specifier [thread] starting at position 25 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [level]
ERROR StatusLogger Unrecognized conversion specifier [level] starting at position 35 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [logger]
ERROR StatusLogger Unrecognized conversion specifier [logger] starting at position 47 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [msg]
ERROR StatusLogger Unrecognized conversion specifier [msg] starting at position 54 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [n]
ERROR StatusLogger Unrecognized conversion specifier [n] starting at position 56 in conversion pattern.
[Row(video_name='video_name', video_transcript='video_transcript'), Row(video_name='video_name', video_transcript='video_transcript'), Row(video_name='video_name', video_transcript='video_transcript'), Row(video_name='video_name', video_transcript='video_transcript'), Row(video_name='video_name', video_transcript='video_transcript')]

So, looks to be some sort of log4j issue (not sure if relevant), and then only the column names are returned 5 times. any help is greatly appreciated.

anardinelli · ‎05-29-2024

Hi @amelia1 how are you?

What you got was indeed the top 5 rows (see that it was the Row class). What does it show when you run display(df)?

I'm thinking it might be something related to your schema, since you did not defined that, it can read the data in an unstructured matter, replacing some of the fields.

Can you please try defining it and passing on the spar.read?

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType

# Create a SparkSession
spark = SparkSession.builder.appName("Example").getOrCreate()

# Define the schema
schema = StructType([
    StructField("video_name", StringType(), True),
    StructField("video_transcript", StringType(), True)


   df = spark.read \
        .format("jdbc") \
        .schema(schema) \
        .option("url", jdbc_url) \
        .option("dbtable", dbtable) \
        .load()

display(df)