topic Re: data frame takes unusually long time to write for small data sets in Data Engineering

data frame takes unusually long time to write for small data sets

Anonymous — Wed, 23 Feb 2022 09:47:24 GMT

We have configured workspace with own vpc. We need to extract data from DB2 and write as delta format. we tried to for 550k records with 230 columns, it took 50mins to complete the task. 15mn records takes more than 18hrs. Not sure why this takes such a long time to write. Appreciate a solution for this.

Code:

df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)

df.write.mode("append").format("delta").partitionBy("YEAR", "MONTH", "DAY").save(delta_path)

Re: data frame takes unusually long time to write for small data sets

Hubert-Dudek — Wed, 23 Feb 2022 09:56:01 GMT

Please increase parallelism by adjusting jdbc settings:

columnName="key",

lowerBound=1L,

upperBound=100000L,

numPartitions=100,

It is example values. The best that key column would be unique and continuous so it will be divided equally without data skews.

Please analyze also Spark UI - look what takes the biggest time (reading or writing?)

Re: data frame takes unusually long time to write for small data sets

Anonymous — Wed, 23 Feb 2022 10:19:29 GMT

Thanks Hubert for your input. i checked Spark UI, writing takes longer time.

Any link to check about increasing parallelism.

Re: data frame takes unusually long time to write for small data sets

RKNutalapati — Wed, 23 Feb 2022 12:09:48 GMT

Hi @Hubert Dudek , I think the Unique column should be integer not alphanumeric / string, right?

Re: data frame takes unusually long time to write for small data sets

RKNutalapati — Wed, 23 Feb 2022 12:21:03 GMT

@Dhusanth Thangavadivel , In general how we configure the cluster is if we are planning to import data with 100 partitions. Then we need to make sure the cluster can spin up 100 threads.

It will also depend up on the DataBase, whether it will allow 100 connections at time.

What i observed is if any columns with huge text or blob data , then the write/read will be little bit slow.

Re: data frame takes unusually long time to write for small data sets

Anonymous — Wed, 23 Feb 2022 12:53:37 GMT

Hi @Hubert Dudek If we don't have unique column in integer/continuous. How this can be done ?

Re: data frame takes unusually long time to write for small data sets

Hubert-Dudek — Fri, 25 Feb 2022 13:41:48 GMT

just try with numPartitions=100

Re: data frame takes unusually long time to write for small data sets

Hubert-Dudek — Fri, 25 Feb 2022 13:46:00 GMT

every cpu process 1 partition at the time, other will wait. You can have autoscale like 2-8 executors every with 4 cpus, so it will process max 32 (4x8) partitions concurrently.

Please check also network configuration. Private link to connect to ADLS is recommended.

After df = spark.read.jdbc please verify partition number using df.rdd.getNumPartitions()

Re: data frame takes unusually long time to write for small data sets

elgeo — Thu, 10 Nov 2022 11:14:42 GMT

Hello. We face exactly the same issue. Reading is quick but writing takes long time. Just to clarify that it is about a table with only 700k rows. Any suggestions please? Thank you

remote_table = spark.read.format ( "jdbc" ) \

.option ( "driver" , "com.ibm.as400.access.AS400JDBCDriver") \

.option ( "url" , "url") \

.option ( "dbtable" , "table_name") \

.option ( "partitionColumn" , "ID") \

.option ( "lowerBound" , "0") \

.option ( "upperBound" , "700000") \

.option ( "numPartitions" , "1000") \

.option ( "user" , "user") \

.option ( "password" , "pass") \

.load ()

remote_table.write.format("delta").mode("overwrite") \

.option("overwriteSchema", "true") \

.partitionBy("ID") \

.saveAsTable("table_name")

Re: data frame takes unusually long time to write for small data sets

Sown7 — Mon, 10 Nov 2025 03:11:50 GMT

facing same issue - I have ~ 700 k rows and I am trying to write this table but it takes forever to write. Previously one time it took only like 5 sec to write but after that whenever we update the analysis and rewrite the table it takes very long and sometimes feels like it is stuck.

We have about 500 columns and about 250 have null information. We do a fillna as we dont want to remove these columns.

Kindly advise

Below is the code we use.

df.write.mode("overwrite").partitionBy("c1").option("numPartitions", 1000).saveAsTable("catalog.schema.table")