Databricks Community

LukaszJ · ‎02-23-2022

Hello.

I want to know how to do an UPDATE on Azure SQL DataBase from Azure Databricks using PySpark.

I know how to make query as SELECT and turn it into DataFrame, but how to send back some data (as UPDATE on rows)?

I want to use build in pyspark istead of some pyodbc or something else.

Best Regards,

Łukasz

-werners- · ‎02-25-2022

The code you mention is for Azure Synapse. If you want to use plain Azure SQL, your options are limited.

You can use JDBC, but there is no update functionality.

Or you can use the optimized spark connector, but that lacks DML support:

Executing custom SQL through the connector

The previous Azure SQL Connector for Spark provided the ability to execute custom SQL code like DML or DDL statements through the connector. This functionality is out-of-scope of this connector since it is based on the DataSource APIs. This functionality is readily provided by libraries like pyodbc or you can use the standard java sql interfaces as well.

View solution in original post

-werners- · ‎02-23-2022

This is discussed on Stack Overflow. As you see for Azure Synapse there is a way, but for a plain SQL database you will have to use some kind of driver like odbc/jdbc.

LukaszJ · ‎02-25-2022

Hello werners,

Thank you for your reply.

In your link I found this link, where is this peace of code:

# Load data from an Azure Synapse query.
df = spark.read \
[...]
 
# Apply some transformations to the data, then use the
# Data Source API to write the data back to another table in Azure Synapse.
 
df.write \
  .format("com.databricks.spark.sqldw") \
  .option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
  .option("forwardSparkAzureStorageCredentials", "true") \
  .option("dbTable", "<your-table-name>") \
  .option("tempDir", "wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
  .save()

However, it is not a good solution for me.

First of all, what if I want to update one row in the table that has a few milions records?

Do I need to download the entire table into dataframe, update one row and resend to Azure SQL? It looks really ineffective.

Moreover, what if I want to make update on table which has an auto-indexing column? I cannot download the table to a dataframe, update with my own id value, and resend to Azure SQL.

That is why I am using PyOdbc now, because I can make real sql insert / update.

Is there any elegant solution at Databricks?