<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Can't write big DataFrame into MSSQL server by using jdbc driver on Azure Databricks in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/can-t-write-big-dataframe-into-mssql-server-by-using-jdbc-driver/m-p/28728#M20505</link>
    <description>&lt;P&gt;you can try to land the transformed data on some storage in Azure/AWS,&lt;/P&gt;&lt;P&gt;then copy those files to the db using Data Factory or Glue (AWS)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 11 Feb 2022 15:10:38 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2022-02-11T15:10:38Z</dc:date>
    <item>
      <title>Can't write big DataFrame into MSSQL server by using jdbc driver on Azure Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/can-t-write-big-dataframe-into-mssql-server-by-using-jdbc-driver/m-p/28723#M20500</link>
      <description>&lt;P&gt;I'm reading a huge csv file including 39,795,158 records and writing into MSSQL server, on Azure Databricks. The Databricks(notebook) is running on a cluster node with 56 GB Memory, 16 Cores, and 12 workers.&lt;/P&gt;&lt;P&gt;This is my code in Python and PySpark:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
&amp;nbsp;from time import sleep
&amp;nbsp;
url = "jdbc:sqlserver://{0}:{1};database={2}".format(server, port, database)
spark.conf.set("spark.databricks.io.cache.enabled", True)
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
&amp;nbsp;
# Read csv file.
df_lake = spark.read \
    .option('header', 'false') \
    .schema(s) \
    .option('delimiter', ',') \
    .csv('wasbs://...')
&amp;nbsp;
&amp;nbsp;
batch_size = 60000
rows = df_lake.count()
org_pts = df_lake.rdd.getNumPartitions() # 566
new_pts = 1990
&amp;nbsp;
# Re-partition the DataFrame
df_repartitioned = df_lake.repartition(new_pts)
&amp;nbsp;
# Write the DataFrame into MSSQL server, by using JDBC driver
df_repartitioned.write \
            .format("jdbc") \
            .mode("overwrite") \
            .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
            .option("url", url) \
            .option("dbtable", tablename) \
            .option("user", username) \
            .option("password", password) \
            .option("batchsize", batch_size) \
            .save()
sleep(10)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Then I got the logs and errors as following as:&lt;/P&gt;&lt;P&gt;```&lt;/P&gt;&lt;P&gt;rows: 39795158&lt;/P&gt;&lt;P&gt;org_pts: 566&lt;/P&gt;&lt;P&gt;new_pts: 1990&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Copy error: An error occurred while calling o9647.save.&lt;/P&gt;&lt;P&gt;: org.apache.spark.SparkException: Job aborted due to stage failure: Task 62 in stage 462.0 failed 4 times, most recent failure: Lost task 62.3 in stage 462.0 (TID 46609) (10.139.64.12 executor 27): com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed.&lt;/P&gt;&lt;P&gt; at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDriverError(SQLServerException.java:234)&lt;/P&gt;&lt;P&gt; at com.microsoft.sqlserver.jdbc.SQLServerConnection.checkClosed(SQLServerConnection.java:1217)&lt;/P&gt;&lt;P&gt; at com.microsoft.sqlserver.jdbc.SQLServerConnection.rollback(SQLServerConnection.java:3508)&lt;/P&gt;&lt;P&gt; at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:728)&lt;/P&gt;&lt;P&gt; at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:857)&lt;/P&gt;&lt;P&gt; at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1$adapted(JdbcUtils.scala:855)&lt;/P&gt;&lt;P&gt; at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1025)&lt;/P&gt;&lt;P&gt; at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1025)&lt;/P&gt;&lt;P&gt; at org.apache.spark.SparkContext.$anonfun$runJob$2(SparkContext.scala:2517)&lt;/P&gt;&lt;P&gt; at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)&lt;/P&gt;&lt;P&gt; at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)&lt;/P&gt;&lt;P&gt; at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)&lt;/P&gt;&lt;P&gt; at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)&lt;/P&gt;&lt;P&gt; at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)&lt;/P&gt;&lt;P&gt; at org.apache.spark.scheduler.Task.doRunTask(Task.scala:150)&lt;/P&gt;&lt;P&gt; at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:119)&lt;/P&gt;&lt;P&gt; at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)&lt;/P&gt;&lt;P&gt; at org.apache.spark.scheduler.Task.run(Task.scala:91)&lt;/P&gt;&lt;P&gt; at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:813)&lt;/P&gt;&lt;P&gt; at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1620)&lt;/P&gt;&lt;P&gt; at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:816)&lt;/P&gt;&lt;P&gt; at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)&lt;/P&gt;&lt;P&gt; at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)&lt;/P&gt;&lt;P&gt; at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:672)&lt;/P&gt;&lt;P&gt; at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)&lt;/P&gt;&lt;P&gt; at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)&lt;/P&gt;&lt;P&gt; at java.lang.Thread.run(Thread.java:748)&lt;/P&gt;&lt;P&gt; ...&lt;/P&gt;&lt;P&gt;```&lt;/P&gt;&lt;P&gt;For 3 - 6 millions records, it was no problem. But for 10 millions or above records, it was failed. I'm not sure why it was happened on 10 millions or above records.&lt;/P&gt;&lt;P&gt;Are there any solutions for huge DataFrame process on Azure Databricks?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I posted this error on StackOverflow, too. You can check the error on &lt;A href="https://stackoverflow.com/questions/71076648/cant-write-big-dataframe-into-mssql-server-by-using-jdbc-driver-on-azure-databr" alt="https://stackoverflow.com/questions/71076648/cant-write-big-dataframe-into-mssql-server-by-using-jdbc-driver-on-azure-databr" target="_blank"&gt;https://stackoverflow.com/questions/71076648/cant-write-big-dataframe-into-mssql-server-by-using-jdbc-driver-on-azure-databr&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 11 Feb 2022 08:15:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/can-t-write-big-dataframe-into-mssql-server-by-using-jdbc-driver/m-p/28723#M20500</guid>
      <dc:creator>ninjadev999</dc:creator>
      <dc:date>2022-02-11T08:15:11Z</dc:date>
    </item>
    <item>
      <title>Re: Can't write big DataFrame into MSSQL server by using jdbc driver on Azure Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/can-t-write-big-dataframe-into-mssql-server-by-using-jdbc-driver/m-p/28724#M20501</link>
      <description>&lt;P&gt;I don´t see why you use databricks for this.  There is no data transformations, only moving data from csv to a database.&lt;/P&gt;&lt;P&gt;Why don´t you do this with Data Factory?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The thing is: Databricks will not have issues processing this csv file, but when you write it to a database, that will become the bottleneck.  You also have a cluster running which is basically mainly waiting.&lt;/P&gt;</description>
      <pubDate>Fri, 11 Feb 2022 10:58:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/can-t-write-big-dataframe-into-mssql-server-by-using-jdbc-driver/m-p/28724#M20501</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-02-11T10:58:42Z</dc:date>
    </item>
    <item>
      <title>Re: Can't write big DataFrame into MSSQL server by using jdbc driver on Azure Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/can-t-write-big-dataframe-into-mssql-server-by-using-jdbc-driver/m-p/28725#M20502</link>
      <description>&lt;P&gt;Thanks for your response!&lt;/P&gt;&lt;P&gt;Actually, I don't know well about Data Factory.. &lt;/P&gt;&lt;P&gt;To speed up to process big DataFrame, I used repartition mode, according to this blog: &lt;A href="https://devblogs.microsoft.com/azure-sql/partitioning-on-spark-fast-loading-clustered-columnstore-index/" target="test_blank"&gt;https://devblogs.microsoft.com/azure-sql/partitioning-on-spark-fast-loading-clustered-columnstore-index/&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Could you let me know how can I use it to overwrite all data in csv file into MSSQL? Is it easy?&lt;/P&gt;</description>
      <pubDate>Fri, 11 Feb 2022 14:19:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/can-t-write-big-dataframe-into-mssql-server-by-using-jdbc-driver/m-p/28725#M20502</guid>
      <dc:creator>ninjadev999</dc:creator>
      <dc:date>2022-02-11T14:19:42Z</dc:date>
    </item>
    <item>
      <title>Re: Can't write big DataFrame into MSSQL server by using jdbc driver on Azure Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/can-t-write-big-dataframe-into-mssql-server-by-using-jdbc-driver/m-p/28726#M20503</link>
      <description>&lt;P&gt;Data Factory is pretty easy.  Basically you define a source where the data resides using a UI.&lt;/P&gt;&lt;P&gt;Define a sink where the data has to land, also using a UI and then execute a job.&lt;/P&gt;&lt;P&gt;For plain data movement it is ideal.  But if you have to add transformations etc, I prefer Databricks (you can do this also in Data Factory but I do not like the no/low code tools).&lt;/P&gt;</description>
      <pubDate>Fri, 11 Feb 2022 14:32:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/can-t-write-big-dataframe-into-mssql-server-by-using-jdbc-driver/m-p/28726#M20503</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-02-11T14:32:48Z</dc:date>
    </item>
    <item>
      <title>Re: Can't write big DataFrame into MSSQL server by using jdbc driver on Azure Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/can-t-write-big-dataframe-into-mssql-server-by-using-jdbc-driver/m-p/28727#M20504</link>
      <description>&lt;P&gt;I should make some changes in each records. After that, I write them into MSSQL. So looks like the Databricks is better to use it... &lt;/P&gt;&lt;P&gt;I just needed to speed up to write big dataframe into MSSQL db and fix the error.&lt;/P&gt;</description>
      <pubDate>Fri, 11 Feb 2022 14:57:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/can-t-write-big-dataframe-into-mssql-server-by-using-jdbc-driver/m-p/28727#M20504</guid>
      <dc:creator>ninjadev999</dc:creator>
      <dc:date>2022-02-11T14:57:31Z</dc:date>
    </item>
    <item>
      <title>Re: Can't write big DataFrame into MSSQL server by using jdbc driver on Azure Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/can-t-write-big-dataframe-into-mssql-server-by-using-jdbc-driver/m-p/28728#M20505</link>
      <description>&lt;P&gt;you can try to land the transformed data on some storage in Azure/AWS,&lt;/P&gt;&lt;P&gt;then copy those files to the db using Data Factory or Glue (AWS)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 11 Feb 2022 15:10:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/can-t-write-big-dataframe-into-mssql-server-by-using-jdbc-driver/m-p/28728#M20505</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-02-11T15:10:38Z</dc:date>
    </item>
    <item>
      <title>Re: Can't write big DataFrame into MSSQL server by using jdbc driver on Azure Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/can-t-write-big-dataframe-into-mssql-server-by-using-jdbc-driver/m-p/28729#M20506</link>
      <description>&lt;P&gt;It's not solution what I want. sorry.&lt;/P&gt;&lt;P&gt;I want to know what's wrong in my code and the reason why it was failed. and I want to improve the performance for big DataFrame processing. &lt;/P&gt;</description>
      <pubDate>Sat, 12 Feb 2022 01:31:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/can-t-write-big-dataframe-into-mssql-server-by-using-jdbc-driver/m-p/28729#M20506</guid>
      <dc:creator>ninjadev999</dc:creator>
      <dc:date>2022-02-12T01:31:01Z</dc:date>
    </item>
    <item>
      <title>Re: Can't write big DataFrame into MSSQL server by using jdbc driver on Azure Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/can-t-write-big-dataframe-into-mssql-server-by-using-jdbc-driver/m-p/28730#M20507</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;If you are using Azure SQL DB Managed instance, could you please file a support request with Azure team? This is to review any timeouts, perf issues on the backend.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Also, it seems like the timeout is coming from SQL Server which is closing the connection after some time.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;You can possibly add below this in the JDBC connection string as a param.&lt;/P&gt;&lt;P&gt;;queryTimeout=7200&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The below connector developed by Microsoft SQL team has some bulk load options to speed up large data loads. Please give it a try&lt;/P&gt;&lt;P&gt;&lt;A href="https://github.com/microsoft/sql-spark-connector" target="test_blank"&gt;https://github.com/microsoft/sql-spark-connector&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 15 Feb 2022 14:29:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/can-t-write-big-dataframe-into-mssql-server-by-using-jdbc-driver/m-p/28730#M20507</guid>
      <dc:creator>User16764241763</dc:creator>
      <dc:date>2022-02-15T14:29:06Z</dc:date>
    </item>
  </channel>
</rss>

