cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Databricks JDBC & Remote Write

Optum
New Contributor III

Hello,

I'm trying to write to a Delta Table in my Databricks instance from a remote Spark session on a different cluster with the Simba Spark driver. I can do reads, but when I attempt to do a write, I get the following error:

{

  df.write.format("jdbc").mode(SaveMode.Append).options(Map(

    "url" -> "jdbc:spark://adb-<host_id>.azuredatabricks.net:443/default;transportMode=http;ssl=1;httpPath=<http_path>;AuthMech=3;UID=token;PWD=<token>",

    "dbtable" -> "testtable",

    "driver" -> "com.simba.spark.jdbc.Driver"

  )).save()

}

java.sql.SQLFeatureNotSupportedException: [Simba][JDBC](10220) Driver does not support this optional feature.

at com.simba.spark.exceptions.ExceptionConverter.toSQLException(Unknown Source)

    at com.simba.spark.jdbc.common.SPreparedStatement.checkTypeSupported(Unknown Source)

    at com.simba.spark.jdbc.common.SPreparedStatement.setNull(Unknown Source)

    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:677)

    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:856)

    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1$adapted(JdbcUtils.scala:854)

    at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1020)

......

I'm currently using my "Data Science & Engineering" section's cluster to do the connection, where in the Advanced section there are the details to connect via JDBC/ODBC. Some guides indicate to use a SQL Endpoint for this, and that might be my problem, but I do not have permissions to create one at this time. Some posts around, like on StackOverflow, indicate it's an issue with the autocommit feature and that is not supported by the Simba Spark driver, but I'm unsure and I couldn't find a Spark or driver option that indicated to turn that off.

Also, all the documentation for doing spark.writes seem to be for when operating in a notebook instance on the Databricks server and no remote connection examples using the driver. Am I missing where a documentation page for that would be?

Remote Spark Instance

-------------------------------

Spark Version: 3.1.1

Scala Version: 2.1210

Spark Simba JDBC Driver from Databricks: 2.6.22

Databricks Cluster Settings

---------------------

Cloud System: Azure

Policy: Unrestricted

Cluster Mode: Standard

Autoscaling: Enabled

Databricks Runtime Version: 9.1 LTS (includes Apache 3.1.2, Scala 2.12)

Worker & Driver Type: Standard_DS3_v2

Please let me know if you need any other information to help me address my issue.

Thank you,

Kai

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @Kai McNeelyโ€‹ and @Daniel Blevinsโ€‹ , The error is occurring because the particular Simba Driver doesn't support auto-commit. The resolution is to add the JDBC driver's class name to the list of configured drivers which have autoCommit turned off. 

This article describes how to configure the Databricks ODBC and JDBC drivers to connect your tools or clients to Databricks. For a tool or client-specific connection instructions, see the Databricks integrations.

View solution in original post

10 REPLIES 10

Kaniz
Community Manager
Community Manager

Hi @Kai McNeelyโ€‹ ! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

dblevins
New Contributor II

Hi Kaniz,

Any update on a resolution? I am also experiencing the same issue with not being able to write to my Databricks table but I am able to read using a JDBC driver.

Kaniz
Community Manager
Community Manager

Hi @Kai McNeelyโ€‹ and @Daniel Blevinsโ€‹ , The error is occurring because the particular Simba Driver doesn't support auto-commit. The resolution is to add the JDBC driver's class name to the list of configured drivers which have autoCommit turned off. 

This article describes how to configure the Databricks ODBC and JDBC drivers to connect your tools or clients to Databricks. For a tool or client-specific connection instructions, see the Databricks integrations.

pulkitm
New Contributor III

@Kaniz Fatmaโ€‹ The resolution is to add the JDBC driver's class name to the list of configured drivers which have autoCommit turned off --- How can we manage this list on a Windows machine? can you summarize the steps. It would be really appreciated

Optum
New Contributor III

It's not clear to me how to set autocommit off/false. Is this a setting I'm supposed to make in my Spark configurations when I launch it in either -shell/-submit? Or is there some standard JDBC config file somewhere that gets referenced for these kinds of settings being turned off? I ask because the only documentation I see related to turning autocommit off is directly with the java.sql.Connection class, but I am trying to write using df.write.format("jdbc").options(Map( ... )).save() and it throws an error if you attempt to set an autocommit option. I also do not see an autocommit-related option in the Simba JBDC documentation where it has the list of available driver options that can be set via the options(Map( ... )) call. I can't use a Connection object as that only passes direct SQL commands and won't write out a DataFrame; I could loop it and inject the inserts, but that will be too clunky and slow.

User16752239289
Valued Contributor
Valued Contributor

@Kai McNeelyโ€‹  Could you provide me more details on your client ? Such as which version of spark you run your code and do you provide extra spark conf other than the default one ? I tested use the spark 3.x with default spark conf and it works for me.

Optum
New Contributor III

I am using Spark 3.1.2 (listed 3.1.1 in initial post, deployment originally listed 3.1.1) initializing with: spark-shell spark-shell -target:jvm-1.8

I am bringing in the following packages (some used and unused at the moment):

--conf "spark.driver.extraClassPath=~/jars/SparkJDBC42.jar" \

--conf "spark.jars.repositories=path/to/repos" \

--packages "com.microsoft.sqlserver:mssql-jdbc:6.4.0.jre8,net.sourceforge.jtds:jtds:1.3.1,com.oracle.database.jdbc:ojdbc8:21.1.0.0,com.databricks:spark-avro_2.10:4.0.0"

Following are the de-identified setting that come implicitly from my /opt/spark/conf/spark-default.conf:

spark.master=k8s://https://kubernetes.default.svc.cluster.local:443

spark.app.name=sp-worker

spark.eventLog.enabled true

spark.eventLog.dir file:///spark/logs

spark.kubernetes.driver.limit.cores=2

spark.kubernetes.executor.limit.cores=4

spark.executor.instances=2

spark.executor.memory=4G

spark.kubernetes.executor.limit.cores=4

spark.kubernetes.executor.request.cores=250m

spark.kubernetes.driver.label.app=spark-jlab

spark.kubernetes.executor.label.app=spark-jlab

spark.kubernetes.executor.label.deployment=jlab

spark.kubernetes.local.dirs.tmpfs=true

spark.kubernetes.container.image.pullPolicy=IfNotPresent 

spark.kubernetes.container.image=docker.someurl.com/path/to/image/repo

# Home for Python libraries

spark.kubernetes.driver.volumes.persistentVolumeClaim.home.mount.path=/home

spark.kubernetes.driver.volumes.persistentVolumeClaim.home.mount.readOnly=false

spark.kubernetes.driver.volumes.persistentVolumeClaim.home.options.claimName=jlab

spark.kubernetes.executor.volumes.persistentVolumeClaim.home.mount.path=/home

spark.kubernetes.executor.volumes.persistentVolumeClaim.home.mount.readOnly=false

spark.kubernetes.executor.volumes.persistentVolumeClaim.home.options.claimName=jlab

# #Logs

spark.kubernetes.driver.volumes.persistentVolumeClaim.logs.mount.path=/logs

spark.kubernetes.driver.volumes.persistentVolumeClaim.logs.mount.readOnly=false

spark.kubernetes.driver.volumes.persistentVolumeClaim.logs.options.claimName=k8s-sparkhistorylogs

spark.kubernetes.executor.volumes.persistentVolumeClaim.logs.mount.path=/logs

spark.kubernetes.executor.volumes.persistentVolumeClaim.logs.mount.readOnly=false

spark.kubernetes.executor.volumes.persistentVolumeClaim.logs.options.claimName=k8s-sparkhistorylogs

spark.kubernetes.namespace=namespace-00000

spark.kubernetes.authenticate.driver.serviceAccountName=svcactnm-00000

spark.kubernetes.authenticate.oauthToken=token

spark.kubernetes.authenticate.caCertFile=/opt/osficerts/ctc.crt

spark.ui.xContentTypeOptions.enabled=true

spark.kubernetes.pyspark.pythonVersion=3

pulkitm
New Contributor III

Hi @Kai McNeelyโ€‹ , were you able to resolve this? I am also facing the same issue using exact same spark and JDBC driver versions.

Atanu
Esteemed Contributor
Esteemed Contributor

Could you try setting the flag to ignore transactions? Iโ€™m not sure what the exact flag is, but there should be more details in the JDBC manual on how to do this

pulkitm
New Contributor III

I tried by setting IgnoreTransactions flag value to 1, but it still didn't resolve the issue