cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Issue with quotes in struct type columns when using ODBC

dcrezee
New Contributor III

I'm trying to connect to Databricks using pyodbc and I'm running into an issue with struct columns.

As far as I understand, struct columns and array columns are not supported by pyodbc, but they are converted to JSON. However, when there are nested columns that contain quotes (") , the JSON that is returned is invalid, i.e. the quotes are not escaped.

Dataset I used:

from pyspark.sql import Row
 
data = [
  Row(nested=Row(nested_column="nested with a \" character")),
]
df = spark.sparkContext.parallelize(data).toDF()
 
(
  df.write
  .format("delta")
  .mode("overwrite")
  .option("path", "/delta/quotes_issue/")
  .saveAsTable("default.quotes_issue")
)

Querying the data afterwards through pyodbc (using docs here)

conn = pyodbc.connect("DSN=Databricks_Cluster", autocommit=True)
 
cursor = conn.cursor()
cursor.execute("SELECT * FROM default.quotes_issue")
rows = cursor.fetchall()
for r in rows:
    print(r)

This gives me the following invalid JSON output.

'{"nested_column":"nested with a " character"}'

I've tried with both the Databricks Compute cluster endpoint, as well as the newer SQL Endpoint. Also tried the Databricks SQL Connection for Python, but they all have the same problem.

Is there a way to configure Databricks, or the Simba driver somehow to escape quotes correctly?

1 ACCEPTED SOLUTION

Accepted Solutions

BilalAslamDbrx
Honored Contributor II
Honored Contributor II

@Derk Crezee​ - I learned something today. Apparently ODBC does not convert to JSON. There is no defined spec on how to return complex types, in fact that was added only in SQL 2016. That's exactly what you are running into!

End of history lesson 😄 Here's a fix:

SELECT to_json(nested)
FROM default.quotes_issue

This will return:

{"nested_column":"nested with a \" character"}	

which is valid JSON.

View solution in original post

11 REPLIES 11

Kaniz
Community Manager
Community Manager

Hi @Derk Crezee​ ! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

dcrezee
New Contributor III

Hi @Kaniz Fatma​ ,

Thanks for the reply. I'm still curious to the answer for the question above, but I did manage to find a work around.

Work Around

Using the to_json method, you can let Spark convert all the complex columns into json, instead of relying on the SQL endpoint / Simba Driver to do this for you. If you want to completely automate the use of to_json, the describe table statement can help you find out which datatypes your columns have.

Kaniz
Community Manager
Community Manager

Hi @Derk Crezee​ , Here you'll find the detailed comprehensive explanation about connection to SQL server through databricks.

https://datathirst.net/blog/2018/10/12/executing-sql-server-stored-procedures-on-databricks-pyspark

Please let me know if that helps.

dcrezee
New Contributor III

Hi @Kaniz Fatma​ ,

My question wasn't related to connecting Databricks to SQL Server, but on an error I'm getting while connecting to Databricks using pyodbc.

Thanks,

Derk

Kaniz
Community Manager
Community Manager

Hi @Derk Crezee​ , Yes, I got your question. The below code snippet is already stated in the link shared by me above and it lets you connect Databricks using pyodbc. You can put it in the init script.

%sh
curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add -
curl https://packages.microsoft.com/config/ubuntu/16.04/prod.list > /etc/apt/sources.list.d/mssql-release.list 
apt-get update
ACCEPT_EULA=Y apt-get install msodbcsql17
apt-get -y install unixodbc-dev
sudo apt-get install python3-pip -y
pip3 install --upgrade pyodbc

dcrezee
New Contributor III

Hi @Kaniz Fatma​ , thanks for the reply. I think I misunderstood. So the suggestion is to install a number of package and update pyodbc on the Databricks cluster you are using? Will give it a try.

Kaniz
Community Manager
Community Manager

Hi @Derk Crezee​ , Yes. That's what I meant. Please try the above code.

dcrezee
New Contributor III

Thanks @Kaniz Fatma (Databricks)​ . Have tried adding above script as a startup script to the cluster configuration I'm using within databricks, but issue remains.

BilalAslamDbrx
Honored Contributor II
Honored Contributor II

@Derk Crezee​ - I learned something today. Apparently ODBC does not convert to JSON. There is no defined spec on how to return complex types, in fact that was added only in SQL 2016. That's exactly what you are running into!

End of history lesson 😄 Here's a fix:

SELECT to_json(nested)
FROM default.quotes_issue

This will return:

{"nested_column":"nested with a \" character"}	

which is valid JSON.

Hey Bilal,

Thanks for your reply. This indeed works and is what I'm currently using as a work around (see one of the previous posts).

Thanks,

Derk

BilalAslamDbrx
Honored Contributor II
Honored Contributor II

@Derk Crezee​  sorry I missed that you already had a fix for this.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.