Databricks 13.3LTS to 15.4 LTS Migration - Spark job with source DB2 database not working

sahil_s_jain
New Contributor III

I'm trying to migrate a spark job from Databricks 13.3 LTS to 15.4 LTS. The spark job uses db2jcc4.jar for DB2 database connection. 

Below is the spark code:

%scala
// Import Spark SQL
import org.apache.spark.sql.SparkSession

// Create a Spark session
val spark = SparkSession.builder
  .appName("Read DB2 Data")
  .getOrCreate()


// Tried below properties but didn't work.
// spark.conf.set("spark.sql.caseSensitive", "true")
// spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
// spark.conf.set("spark.sql.parquet.inferTimestampNTZ.enabled", "false")
// spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY")

// JDBC connection properties
val sqlQuery = """(SELECT * FROM UITC5Schema.AccountDetail_UITest5 order by ID ASC) AS finaldata"""
// Connection URL
val jdbcUrl = s"jdbc:db2://<IP>:50000/db0911"

// Define connection properties
val connectionProperties = new java.util.Properties()
connectionProperties.put("user", "userName1")
connectionProperties.put("password", "pwd123")
connectionProperties.put("driver", "com.ibm.db2.jcc.DB2Driver")
// Tried below property but didn't work.
//connectionProperties.put("nullCatalogMeansCurrent", "false")

// Read data from MSSQL into a DataFrame
val db2DF = spark.read
  .jdbc(jdbcUrl, sqlQuery, connectionProperties)

// Show the data
db2DF.show()

// Perform any transformations or actions on the DataFrame
db2DF.printSchema()

Database schema:

CREATE TABLE "DB2INST1"."CARS"  (
		  "ID" INTEGER , 
		  "CAR" VARCHAR(16 OCTETS) , 
		  "DATE" TIMESTAMP )   
		 IN "USERSPACE1"  
		 ORGANIZE BY ROW;

 For camel case columns, getting below error:

SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-206, SQLSTATE=42703, SQLERRMC=columnName1, DRIVER=4.9.78
	at com.ibm.db2.jcc.am.fd.a(fd.java:676)

When providing columnName1 in double quotes, it is working as expected.

For timestamp columns, getting below error irrespective of column name/casing:

[jcc][t4][10144][10862][4.9.78] Invalid parameter calendar: Parameter cannot be null. ERRORCODE=-4461, SQLSTATE=42815 Caused by: SqlSyntaxErrorException: [jcc][t4][10144][10862][4.9.78] Invalid parameter calendar: Parameter cannot be null. ERRORCODE=-4461, SQLSTATE=42815
	at com.ibm.db2.jcc.am.fd.a(fd.java:676) 	at com.ibm.db2.jcc.am.fd.a(fd.java:60) 	at com.ibm.db2.jcc.am.fd.a(fd.java:103) 	at com.ibm.db2.jcc.am.tm.getTimestamp(tm.java:1043) 	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$15(JdbcUtils.scala:512) 	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$15$adapted(JdbcUtils.scala:511) 	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:376) 	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:357) 	at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) 	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) 	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:26) 	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) 	at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50) 	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$5(UnsafeRowBatchUtils.scala:88) 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110) 	at org.apache.spark.sql.execution.collect.UnsafeRowBatchUtils$.$anonfun$encodeUnsafeRows$3(UnsafeRowBatchUtils.scala:88) 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)

For Databricks compute 13.3 LTS, the provided code is working as expected irrespective of the column name and column data type.

Ismael-K
Databricks Employee
Databricks Employee

@sahil_s_jain Please try setting this spark property on the 15.4 cluster:

 

sparkConf.set("spark.sql.legacy.jdbc.useNullCalendar", "false")

 

After the property is set, Spark will not use a Null calendar instance for JDBC operations.

You can also add the below property to set the spark session timezone:

 

spark.conf.set("spark.sql.session.timeZone", "America/New_York")

 

 

 

View solution in original post

sahil_s_jain
New Contributor III

Thanks for the solution.