i am reading data from IBM DB2 and saving into a MS SQL server (the first step is moving the code itself to databricks, and then we will move the databases to databricks itself).
Problem I'm running into is doing something like the below will take > 1 hour before I stop it, but doing each step individually (using a pandas dataframe in the middle) results in the same thing taking maybe 15-20 minutes. I was wondering why, and what I can do to avoid using pandas.
code that doesn't work/takes forever:
(
(
spark.read.format("jdbc")
.option("driver", "com.ibm.db2.jcc.DB2Driver")
.option("url", connection_url)
.option("query", query)
.load()
)
.write.format("jdbc")
.option("url", sqlsUrl)
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("dbtable", table_name)
.option("user", username)
.option("password", password)
.save(mode=mode)
)