cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Writing a single huge dataframe into Azure SQL Database using JDBC

yeungcase
New Contributor III

Hi All,

I am currently trying to read data from a materialized view as a single dataframe which contains around 10M of rows and then write it into an Azure SQL database. However, I don't see the spark job moving a bit even an hour is passed. I have already implemented several optimization measures, but it doesn't help at all.

 

 

 

numberCores = 4
df = spark.read.table(source_table)
df.count()

 

yeungcase_0-1719991764991.png

 

 

 

 

# Write spark dataframe into target SQL database
try:
    df.coalesce(numberCores).write \
    .format("jdbc") \
    .option("url", sql_db_url) \
    .option("batchsize", "10000") \
    .option("rewriteBatchedStatements", "true") \
    .option("compression", "snappy") \
    .option("numPartitions", numberCores) \
    .option("dbtable", target_table) \
    .option("user", username) \
    .option("password", password) \
    .option("tableLock", "true") \
    .mode("overwrite") \
    .save()

    print("Successfully write data into target SQL database")
except Exception as error:
    print("An exception occurred:", error)

 

 

 

Cluster config:

yeungcase_0-1719991355253.png

 

Spark UI:

yeungcase_1-1719991508190.png

 

Appreciate for any advice. Thank you.

 

7 REPLIES 7

dbrx_user
New Contributor III

Hey, a few suggestions to try below!

  • Is there a reason you're using coalesce instead of repartition? Coalesce can result in uneven sized partitions which are slower to work with (even though the initial function might be faster).
  • Also, why are you specifying 4 partitions? It is typically recommended to use partitions 2–4X number of cores of cluster. I think ideal partition size is usually 200MB

yeungcase
New Contributor III

Hi @dbrx_user ,

Thank you for your suggestion. Actually, I also notice that I should use repartition instead. However, I am now struggling on the optimal repartition number. I've tried make it as 16 but it seems the spill of memory and disk issue is very serious. And the spark job is not moving at all.

dbrx_user
New Contributor III

To reduce memory spill, I'd suggest increasing number of partitions. Try playing around with 32 or 48 to see if any difference.

On the jdbc part, I wonder if there is something spark doesn't like re your syntax. Might help to re-order your options so the jdbc ones go together. Could also try re-writing to use jdbc writer explicitly.

df.write \
    .format("jdbc") \
    .option("url", "jdbc:postgresql:dbserver") \
    .option("dbtable", "schema.tablename") \
    .option("user", "username") \
    .option("password", "password") \
    .save()

df.write \
    .jdbc("jdbc:postgresql:dbserver", "schema.tablename",
          properties={"user": "username", "password": "password"})

Also worth checking your target_table variable conforms to the "schema.table" format, in case spark can't find the table. Sometimes these things can get mixed up with the string formatting.

yeungcase
New Contributor III

Hi @dbrx_user ,

I've tried to rewrite the statement, but it does not help. However, I discover that it could successfully write 100k rows without having any issues. But once I further boost it to 1M rows, it just stuck there.

dbrx_user
New Contributor III

Hmmm, what's your ELT setup like? Are you using ADF/ Synapse to trigger databricks notebooks? If the databricks tables are external with underlying ADLS storage, might be worth using an ADF copy activity instead?

Slash
Contributor

Maybe you should try different connector like one below. According to their benchmark they've managed to save 143,9M rows in one dataframe in 72 seconds with tablock=true.

Apache Spark connector for SQL Server - Spark connector for SQL Server | Microsoft Learn

Rishabh_Tiwari
Community Manager
Community Manager

Hi @yeungcase ,

Thank you for reaching out to our community! We're here to help you.

To ensure we provide you with the best support, could you please take a moment to review the response and choose the one that best answers your question? Your feedback not only helps us assist you better but also benefits other community members who may have similar questions in the future.If you found the answer helpful, consider giving it a kudo. If the response fully addresses your question, please mark it as the accepted solution. This will help us close the thread and ensure your question is resolved.

We appreciate your participation and are here to assist you further if you need it!

Thanks,
Rishabh

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group