Use Python notebook to read data from Databricks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-04-2025 11:11 PM
I'm very new to Databricks. I hope this is the right place to ask this question.
I want to use PySpark in a notebook to read data from a Databricks database with the below codes.
databricks_host = "adb-xxxx.azuredatabricks.net"
http_path = "/sql/1.0/warehouses/xxxx"
access_token = "dapixxxx"
jdbc_url = f"jdbc:databricks://{databricks_host}:443/default;transportMode=http;ssl=1;httpPath={http_path};AuthMech=3"
query = "(SELECT DISTINCT building_code, city FROM wphub_poc.gold.v_d_building) as subquery"
df = (spark.read.format("jdbc")
.option("url", jdbc_url)
.option("query", query)
.option("user", "token")
.option("password", access_token)
.option("driver", "com.databricks.client.jdbc.Driver")
.option("fetchsize", "10000")
.load())
print(df.printSchema())
print(df.show())
However, the df result only contains the litteral texts from the SELECT query, like this. Do I do anything wrong and how to fix it? Thank you!
- Labels:
-
Read data
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-07-2025 03:29 AM
Hello, @cony2025!
It seems the issue is due to the query parameter. When using .option("query"), Spark automatically wraps your SQL inside an outer SELECT * FROM ( ), which leads to invalid nested subqueries if your original query already includes an alias.
To avoid this, try using .option("dbtable") instead. It prevents the automatic wrapping and should resolve the issue.
Let me know if this doesn't help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-13-2025 09:24 PM
Hi @Advika_
Thanks for your answer. I modified to the below according to your recommendation. Howerver, I still received column names, instead of the data, as the result.
df = (spark.read.format("jdbc")
.option("url", jdbc_url)
.option("dbtable", query)
.option("user", "token")
.option("password", access_token)
.option("driver", "com.databricks.client.jdbc.Driver")
.option("fetchsize", "10000")
.load())
I saw a solution recommending to import and register the dbsqlDialectClass to the JAR file, but I can't figure out how to do it properly. I appreciate any guidance!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-14-2025 02:20 AM
Hi @cony2025 Could you load the whole table as TempView and see if what happens
df = (spark.read.format("jdbc")
.option("url", jdbc_url)
.option("dbtable", "wphub_poc.gold.v_d_building")
.option("user", "token")
.option("password", access_token)
.option("driver", "com.databricks.client.jdbc.Driver")
.option("fetchsize", "10000")
.load())
df.createOrReplaceTempView("building_data")
spark.sql("SELECT DISTINCT building_code, city FROM building_data").show(truncate=False)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-14-2025 03:21 AM
I would try changing the query to something like the following, it should return the column names in the table so you can see if the jdbc call is actually returning the data correctly
SELECT * FROM wphub_poc.gold.v_d_building limit 10

