cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Use JDBC connect to databrick default cluster and read table into pyspark dataframe. All the column turned into same as column name

brian_0305
New Contributor II

I used code like below to Use JDBC connect to databrick default cluster and read table into pyspark dataframe

url = 'jdbc:databricks://[workspace domain]:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=[path];AuthMech=3;UID=token;PWD=[your_access_token]'
 
driver_class = 'com.databricks.client.jdbc.Driver'
# create a DataFrame by querying the MySQL database
df_1 = spark.read.format("jdbc")\
.option("url", url)\
.option("driver", driver_class)\
.option("dbtable", 'my_table2')\
.load()
 
df_1.show()

The final df_1 dataframe become

+----+
|name|
+----+
|name|
|name|
|name|
+----+

the final result should be

+----+
|name|
+----+
|John|
|Jane|
|Bob |
+----+

I also tested with code like

df_2 =  spark.sql('select * from my_table2')
df_2.show()

the result dataframe is correct. please advise. thank you!

error

3 REPLIES 3

Ajay-Pandey
Esteemed Contributor III

Hi @yu zhangโ€‹ 

Just try to give your database name also into the code.

I am using same code with database name and it's working fine for me

Ajay Kumar Pandey

does pyspark.read.jdbc function has database name option? could you tell me how to give database name in this function. I think database name included in jdbc url option.

Anonymous
Not applicable

@yu zhangโ€‹ :

It looks like the issue with the first code snippet you provided is that it is not specifying the correct query to retrieve the data from your database.

When using the load() method with the jdbc data source, you need to provide a SQL query in the

dbtable option that retrieves the data you want to load into the DataFrame.

Assuming that your my_table2 table contains a column named name with the values you listed, you can modify your code as follows to retrieve the correct data:

url = 'jdbc:databricks://[workspace domain]:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=[path];AuthMech=3;UID=token;PWD=[your_access_token]'
 
driver_class = 'com.databricks.client.jdbc.Driver'
# create a DataFrame by querying the MySQL database
df_1 = spark.read.format("jdbc")\
.option("url", url)\
.option("driver", driver_class)\
.option("dbtable", '(select name from my_table2) as tmp')\
.load()
 
df_1.show()

This code passes a SQL query to the dbtable option that selects only the name column from your

my_table2 table. The resulting DataFrame should contain the data you expect.

Alternatively, you can use the spark.sql method to directly execute a SQL query against your database and load the results into a DataFrame, as you showed in your second code snippet:

df_2 = spark.sql('select * from my_table2')
df_2.show()

This code should also retrieve the correct data from your database.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group