topic Re: Use JDBC connect to databrick default cluster and read table into pyspark dataframe. All the column turned into same as column name in Data Engineering

Use JDBC connect to databrick default cluster and read table into pyspark dataframe. All the column turned into same as column name

brian_0305 — Wed, 22 Feb 2023 19:45:58 GMT

I used code like below to Use JDBC connect to databrick default cluster and read table into pyspark dataframe

url = 'jdbc:databricks://[workspace domain]:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=[path];AuthMech=3;UID=token;PWD=[your_access_token]'
 
driver_class = 'com.databricks.client.jdbc.Driver'
# create a DataFrame by querying the MySQL database
df_1 = spark.read.format("jdbc")\
.option("url", url)\
.option("driver", driver_class)\
.option("dbtable", 'my_table2')\
.load()
 
df_1.show()

The final df_1 dataframe become

+----+
|name|
+----+
|name|
|name|
|name|
+----+

the final result should be

+----+
|name|
+----+
|John|
|Jane|
|Bob |
+----+

I also tested with code like

df_2 =  spark.sql('select * from my_table2')
df_2.show()

the result dataframe is correct. please advise. thank you!

Re: Use JDBC connect to databrick default cluster and read table into pyspark dataframe. All the column turned into same as column name

Ajay-Pandey — Thu, 23 Feb 2023 04:24:00 GMT

Hi @yu zhang

Just try to give your database name also into the code.

I am using same code with database name and it's working fine for me

Re: Use JDBC connect to databrick default cluster and read table into pyspark dataframe. All the column turned into same as column name

brian_0305 — Mon, 27 Feb 2023 15:57:41 GMT

does pyspark.read.jdbc function has database name option? could you tell me how to give database name in this function. I think database name included in jdbc url option.

Re: Use JDBC connect to databrick default cluster and read table into pyspark dataframe. All the column turned into same as column name

Anonymous — Tue, 25 Apr 2023 04:02:41 GMT

@yu zhang :

It looks like the issue with the first code snippet you provided is that it is not specifying the correct query to retrieve the data from your database.

When using the load() method with the jdbc data source, you need to provide a SQL query in the

dbtable option that retrieves the data you want to load into the DataFrame.

Assuming that your my_table2 table contains a column named name with the values you listed, you can modify your code as follows to retrieve the correct data:

url = 'jdbc:databricks://[workspace domain]:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=[path];AuthMech=3;UID=token;PWD=[your_access_token]'
 
driver_class = 'com.databricks.client.jdbc.Driver'
# create a DataFrame by querying the MySQL database
df_1 = spark.read.format("jdbc")\
.option("url", url)\
.option("driver", driver_class)\
.option("dbtable", '(select name from my_table2) as tmp')\
.load()
 
df_1.show()

This code passes a SQL query to the dbtable option that selects only the name column from your

my_table2 table. The resulting DataFrame should contain the data you expect.

Alternatively, you can use the spark.sql method to directly execute a SQL query against your database and load the results into a DataFrame, as you showed in your second code snippet:

df_2 = spark.sql('select * from my_table2')
df_2.show()

This code should also retrieve the correct data from your database.