Can't write big DataFrame into MSSQL server by usi...

ninjadev999 · ‎02-11-2022

I'm reading a huge csv file including 39,795,158 records and writing into MSSQL server, on Azure Databricks. The Databricks(notebook) is running on a cluster node with 56 GB Memory, 16 Cores, and 12 workers.

This is my code in Python and PySpark:

from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
 from time import sleep
 
url = "jdbc:sqlserver://{0}:{1};database={2}".format(server, port, database)
spark.conf.set("spark.databricks.io.cache.enabled", True)
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
 
# Read csv file.
df_lake = spark.read \
    .option('header', 'false') \
    .schema(s) \
    .option('delimiter', ',') \
    .csv('wasbs://...')
 
 
batch_size = 60000
rows = df_lake.count()
org_pts = df_lake.rdd.getNumPartitions() # 566
new_pts = 1990
 
# Re-partition the DataFrame
df_repartitioned = df_lake.repartition(new_pts)
 
# Write the DataFrame into MSSQL server, by using JDBC driver
df_repartitioned.write \
            .format("jdbc") \
            .mode("overwrite") \
            .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
            .option("url", url) \
            .option("dbtable", tablename) \
            .option("user", username) \
            .option("password", password) \
            .option("batchsize", batch_size) \
            .save()
sleep(10)

Then I got the logs and errors as following as:

```

rows: 39795158

org_pts: 566

new_pts: 1990

Copy error: An error occurred while calling o9647.save.

: org.apache.spark.SparkException: Job aborted due to stage failure: Task 62 in stage 462.0 failed 4 times, most recent failure: Lost task 62.3 in stage 462.0 (TID 46609) (10.139.64.12 executor 27): com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed.

at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDriverError(SQLServerException.java:234)

at com.microsoft.sqlserver.jdbc.SQLServerConnection.checkClosed(SQLServerConnection.java:1217)

at com.microsoft.sqlserver.jdbc.SQLServerConnection.rollback(SQLServerConnection.java:3508)

at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:728)

at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:857)

at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1$adapted(JdbcUtils.scala:855)

at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1025)

at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1025)

at org.apache.spark.SparkContext.$anonfun$runJob$2(SparkContext.scala:2517)

at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)

at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)

at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)

at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)

at org.apache.spark.scheduler.Task.doRunTask(Task.scala:150)

at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:119)

at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)

at org.apache.spark.scheduler.Task.run(Task.scala:91)

at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:813)

at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1620)

at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:816)

at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)

at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:672)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

...

```

For 3 - 6 millions records, it was no problem. But for 10 millions or above records, it was failed. I'm not sure why it was happened on 10 millions or above records.

Are there any solutions for huge DataFrame process on Azure Databricks?

I posted this error on StackOverflow, too. You can check the error on https://stackoverflow.com/questions/71076648/cant-write-big-dataframe-into-mssql-server-by-using-jdb...

Can't write big DataFrame into MSSQL server by using jdbc driver on Azure Databricks