<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: data frame takes unusually long time to write for small data sets in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/data-frame-takes-unusually-long-time-to-write-for-small-data/m-p/27238#M19115</link>
    <description>&lt;P&gt;Thanks Hubert for your input. i checked Spark UI, writing takes longer time. &lt;/P&gt;&lt;P&gt;Any link to check about increasing parallelism. &lt;/P&gt;</description>
    <pubDate>Wed, 23 Feb 2022 10:19:29 GMT</pubDate>
    <dc:creator>Anonymous</dc:creator>
    <dc:date>2022-02-23T10:19:29Z</dc:date>
    <item>
      <title>data frame takes unusually long time to write for small data sets</title>
      <link>https://community.databricks.com/t5/data-engineering/data-frame-takes-unusually-long-time-to-write-for-small-data/m-p/27236#M19113</link>
      <description>&lt;P&gt;We have configured workspace with own vpc. We need to extract data from DB2 and write as delta format. we tried to for 550k records with 230 columns, it took &lt;B&gt;50mins&lt;/B&gt; to complete the task. 15mn records takes more than 18hrs. Not sure why this takes such a long time to write. Appreciate a solution for this.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Code:&lt;/P&gt;&lt;P&gt;df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)&lt;/P&gt;&lt;P&gt;df.write.mode("append").format("delta").partitionBy("YEAR", "MONTH", "DAY").save(delta_path)&lt;/P&gt;</description>
      <pubDate>Wed, 23 Feb 2022 09:47:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-frame-takes-unusually-long-time-to-write-for-small-data/m-p/27236#M19113</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-02-23T09:47:24Z</dc:date>
    </item>
    <item>
      <title>Re: data frame takes unusually long time to write for small data sets</title>
      <link>https://community.databricks.com/t5/data-engineering/data-frame-takes-unusually-long-time-to-write-for-small-data/m-p/27237#M19114</link>
      <description>&lt;P&gt;Please increase parallelism by adjusting jdbc settings:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt; columnName="key",&lt;/P&gt;&lt;P&gt; lowerBound=1L,&lt;/P&gt;&lt;P&gt; upperBound=100000L,&lt;/P&gt;&lt;P&gt; numPartitions=100,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;It is example values. The best that key column would be unique and continuous so it will be divided equally without data skews.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Please analyze also Spark UI - look what takes the biggest time (reading or writing?)&lt;/P&gt;</description>
      <pubDate>Wed, 23 Feb 2022 09:56:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-frame-takes-unusually-long-time-to-write-for-small-data/m-p/27237#M19114</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-02-23T09:56:01Z</dc:date>
    </item>
    <item>
      <title>Re: data frame takes unusually long time to write for small data sets</title>
      <link>https://community.databricks.com/t5/data-engineering/data-frame-takes-unusually-long-time-to-write-for-small-data/m-p/27238#M19115</link>
      <description>&lt;P&gt;Thanks Hubert for your input. i checked Spark UI, writing takes longer time. &lt;/P&gt;&lt;P&gt;Any link to check about increasing parallelism. &lt;/P&gt;</description>
      <pubDate>Wed, 23 Feb 2022 10:19:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-frame-takes-unusually-long-time-to-write-for-small-data/m-p/27238#M19115</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-02-23T10:19:29Z</dc:date>
    </item>
    <item>
      <title>Re: data frame takes unusually long time to write for small data sets</title>
      <link>https://community.databricks.com/t5/data-engineering/data-frame-takes-unusually-long-time-to-write-for-small-data/m-p/27239#M19116</link>
      <description>&lt;P&gt;Hi @Hubert Dudek​&amp;nbsp;, I think the Unique column should be integer not alphanumeric / string, right?&lt;/P&gt;</description>
      <pubDate>Wed, 23 Feb 2022 12:09:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-frame-takes-unusually-long-time-to-write-for-small-data/m-p/27239#M19116</guid>
      <dc:creator>RKNutalapati</dc:creator>
      <dc:date>2022-02-23T12:09:48Z</dc:date>
    </item>
    <item>
      <title>Re: data frame takes unusually long time to write for small data sets</title>
      <link>https://community.databricks.com/t5/data-engineering/data-frame-takes-unusually-long-time-to-write-for-small-data/m-p/27240#M19117</link>
      <description>&lt;P&gt;@Dhusanth Thangavadivel​&amp;nbsp;, In general how we configure the cluster is if we are planning to import data with 100 partitions. Then we need to make sure the cluster can spin up 100 threads.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;It will also depend up on the DataBase, whether it will allow 100 connections at time.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What i observed is if any columns with huge text or blob data , then the write/read will be little bit slow. &lt;/P&gt;</description>
      <pubDate>Wed, 23 Feb 2022 12:21:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-frame-takes-unusually-long-time-to-write-for-small-data/m-p/27240#M19117</guid>
      <dc:creator>RKNutalapati</dc:creator>
      <dc:date>2022-02-23T12:21:03Z</dc:date>
    </item>
    <item>
      <title>Re: data frame takes unusually long time to write for small data sets</title>
      <link>https://community.databricks.com/t5/data-engineering/data-frame-takes-unusually-long-time-to-write-for-small-data/m-p/27241#M19118</link>
      <description>&lt;P&gt;Hi @Hubert Dudek​&amp;nbsp;If we don't have unique column in integer/continuous. How this can be done ?&lt;/P&gt;</description>
      <pubDate>Wed, 23 Feb 2022 12:53:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-frame-takes-unusually-long-time-to-write-for-small-data/m-p/27241#M19118</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-02-23T12:53:37Z</dc:date>
    </item>
    <item>
      <title>Re: data frame takes unusually long time to write for small data sets</title>
      <link>https://community.databricks.com/t5/data-engineering/data-frame-takes-unusually-long-time-to-write-for-small-data/m-p/27242#M19119</link>
      <description>&lt;P&gt;just try with numPartitions=100&lt;/P&gt;</description>
      <pubDate>Fri, 25 Feb 2022 13:41:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-frame-takes-unusually-long-time-to-write-for-small-data/m-p/27242#M19119</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-02-25T13:41:48Z</dc:date>
    </item>
    <item>
      <title>Re: data frame takes unusually long time to write for small data sets</title>
      <link>https://community.databricks.com/t5/data-engineering/data-frame-takes-unusually-long-time-to-write-for-small-data/m-p/27243#M19120</link>
      <description>&lt;P&gt;every cpu process 1 partition at the time, other will wait. You can have autoscale like 2-8 executors every with 4 cpus, so it will process max 32 (4x8) partitions concurrently.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Please check also network configuration. Private link to connect to ADLS is recommended.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;After &lt;I&gt;df = spark.read.jdbc&lt;/I&gt; please verify partition number using df.rdd.getNumPartitions()&lt;/P&gt;</description>
      <pubDate>Fri, 25 Feb 2022 13:46:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-frame-takes-unusually-long-time-to-write-for-small-data/m-p/27243#M19120</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-02-25T13:46:00Z</dc:date>
    </item>
    <item>
      <title>Re: data frame takes unusually long time to write for small data sets</title>
      <link>https://community.databricks.com/t5/data-engineering/data-frame-takes-unusually-long-time-to-write-for-small-data/m-p/27245#M19122</link>
      <description>&lt;P&gt;Hello. We face exactly the same issue. Reading is quick but writing takes long time. Just to clarify that it is about a table with only 700k rows. Any suggestions please? Thank you&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;remote_table = spark.read.format ( "jdbc" ) \&lt;/P&gt;&lt;P&gt;.option ( "driver" , "com.ibm.as400.access.AS400JDBCDriver") \&lt;/P&gt;&lt;P&gt;.option ( "url" , "url") \&lt;/P&gt;&lt;P&gt;.option ( "dbtable" , "table_name") \&lt;/P&gt;&lt;P&gt;.option ( "partitionColumn" , "ID") \&lt;/P&gt;&lt;P&gt;.option ( "lowerBound" , "0") \&lt;/P&gt;&lt;P&gt;.option ( "upperBound" , "700000") \&lt;/P&gt;&lt;P&gt;.option ( "numPartitions" , "1000") \&lt;/P&gt;&lt;P&gt;.option ( "user" , "user") \&lt;/P&gt;&lt;P&gt;.option ( "password" , "pass") \&lt;/P&gt;&lt;P&gt;.load ()&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;remote_table.write.format("delta").mode("overwrite") \&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;.option("overwriteSchema", "true") \&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;.partitionBy("ID") \&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;.saveAsTable("table_name")&lt;/P&gt;</description>
      <pubDate>Thu, 10 Nov 2022 11:14:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-frame-takes-unusually-long-time-to-write-for-small-data/m-p/27245#M19122</guid>
      <dc:creator>elgeo</dc:creator>
      <dc:date>2022-11-10T11:14:42Z</dc:date>
    </item>
    <item>
      <title>Re: data frame takes unusually long time to write for small data sets</title>
      <link>https://community.databricks.com/t5/data-engineering/data-frame-takes-unusually-long-time-to-write-for-small-data/m-p/138322#M50912</link>
      <description>&lt;P&gt;facing same issue - I have ~ 700 k rows and I am trying to write this table but it takes forever to write. Previously one time it took only like 5 sec to write but after that whenever we update the analysis and rewrite the table it takes very long and sometimes feels like it is stuck.&lt;/P&gt;&lt;P&gt;We have about 500 columns and about 250 have null information. We do a fillna as we dont want to remove these columns.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Kindly advise&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Below is the code we use.&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;df.write.mode(&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;overwrite&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;).partitionBy(&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;c1&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;).option(&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;numPartitions&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;1000&lt;/SPAN&gt;&lt;SPAN&gt;).saveAsTable(&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;catalog.schema.table&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;</description>
      <pubDate>Mon, 10 Nov 2025 03:11:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/data-frame-takes-unusually-long-time-to-write-for-small-data/m-p/138322#M50912</guid>
      <dc:creator>Sown7</dc:creator>
      <dc:date>2025-11-10T03:11:50Z</dc:date>
    </item>
  </channel>
</rss>

