Databricks Community

cony2025 · ‎03-04-2025

I'm very new to Databricks. I hope this is the right place to ask this question.

I want to use PySpark in a notebook to read data from a Databricks database with the below codes.

databricks_host = "adb-xxxx.azuredatabricks.net"
http_path = "/sql/1.0/warehouses/xxxx"
access_token = "dapixxxx"
jdbc_url = f"jdbc:databricks://{databricks_host}:443/default;transportMode=http;ssl=1;httpPath={http_path};AuthMech=3"

query = "(SELECT DISTINCT building_code, city FROM wphub_poc.gold.v_d_building) as subquery"

df = (spark.read.format("jdbc")
    .option("url", jdbc_url)
    .option("query", query)
    .option("user", "token")
    .option("password", access_token)
    .option("driver", "com.databricks.client.jdbc.Driver")
    .option("fetchsize", "10000")
    .load())

print(df.printSchema())
print(df.show())

However, the df result only contains the litteral texts from the SELECT query, like this. Do I do anything wrong and how to fix it? Thank you!

Screenshot 2025-03-04 225915.png

Screenshot 2025-03-04 211512.png

Advika_ · ‎03-07-2025

Hello, @cony2025!

It seems the issue is due to the query parameter. When using .option("query"), Spark automatically wraps your SQL inside an outer SELECT * FROM ( ), which leads to invalid nested subqueries if your original query already includes an alias.

To avoid this, try using .option("dbtable") instead. It prevents the automatic wrapping and should resolve the issue.

Let me know if this doesn't help!

cony2025 · ‎03-13-2025

Hi @Advika_

Thanks for your answer. I modified to the below according to your recommendation. Howerver, I still received column names, instead of the data, as the result.

df = (spark.read.format("jdbc")
    .option("url", jdbc_url)
    .option("dbtable", query)
    .option("user", "token")
    .option("password", access_token)
    .option("driver", "com.databricks.client.jdbc.Driver")
    .option("fetchsize", "10000")
    .load())

I saw a solution recommending to import and register the dbsqlDialectClass to the JAR file, but I can't figure out how to do it properly. I appreciate any guidance!

Akash_Varuna · ‎03-14-2025

Hi @cony2025 Could you load the whole table as TempView and see if what happens

df = (spark.read.format("jdbc")
    .option("url", jdbc_url)
    .option("dbtable", "wphub_poc.gold.v_d_building")
    .option("user", "token")
    .option("password", access_token)
    .option("driver", "com.databricks.client.jdbc.Driver")
    .option("fetchsize", "10000")
    .load())

df.createOrReplaceTempView("building_data")

spark.sql("SELECT DISTINCT building_code, city FROM building_data").show(truncate=False)

dna1 · ‎03-14-2025

I would try changing the query to something like the following, it should return the column names in the table so you can see if the jdbc call is actually returning the data correctly

SELECT * FROM wphub_poc.gold.v_d_building limit 10

Databricks Community

Use Python notebook to read data from Databricks

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!