07-03-2024 12:25 AM - edited 07-03-2024 12:29 AM
Hi All,
I am currently trying to read data from a materialized view as a single dataframe which contains around 10M of rows and then write it into an Azure SQL database. However, I don't see the spark job moving a bit even an hour is passed. I have already implemented several optimization measures, but it doesn't help at all.
numberCores = 4
df = spark.read.table(source_table)
df.count()
# Write spark dataframe into target SQL database
try:
df.coalesce(numberCores).write \
.format("jdbc") \
.option("url", sql_db_url) \
.option("batchsize", "10000") \
.option("rewriteBatchedStatements", "true") \
.option("compression", "snappy") \
.option("numPartitions", numberCores) \
.option("dbtable", target_table) \
.option("user", username) \
.option("password", password) \
.option("tableLock", "true") \
.mode("overwrite") \
.save()
print("Successfully write data into target SQL database")
except Exception as error:
print("An exception occurred:", error)
Cluster config:
Spark UI:
Appreciate for any advice. Thank you.
07-03-2024 05:15 AM
Hey, a few suggestions to try below!
07-03-2024 06:18 AM
Hi @dbrx_user ,
Thank you for your suggestion. Actually, I also notice that I should use repartition instead. However, I am now struggling on the optimal repartition number. I've tried make it as 16 but it seems the spill of memory and disk issue is very serious. And the spark job is not moving at all.
07-03-2024 07:58 AM
To reduce memory spill, I'd suggest increasing number of partitions. Try playing around with 32 or 48 to see if any difference.
On the jdbc part, I wonder if there is something spark doesn't like re your syntax. Might help to re-order your options so the jdbc ones go together. Could also try re-writing to use jdbc writer explicitly.
df.write \ .format("jdbc") \ .option("url", "jdbc:postgresql:dbserver") \ .option("dbtable", "schema.tablename") \ .option("user", "username") \ .option("password", "password") \ .save() df.write \ .jdbc("jdbc:postgresql:dbserver", "schema.tablename", properties={"user": "username", "password": "password"})
Also worth checking your target_table variable conforms to the "schema.table" format, in case spark can't find the table. Sometimes these things can get mixed up with the string formatting.
07-03-2024 07:40 PM
Hi @dbrx_user ,
I've tried to rewrite the statement, but it does not help. However, I discover that it could successfully write 100k rows without having any issues. But once I further boost it to 1M rows, it just stuck there.
07-05-2024 02:41 AM
Hmmm, what's your ELT setup like? Are you using ADF/ Synapse to trigger databricks notebooks? If the databricks tables are external with underlying ADLS storage, might be worth using an ADF copy activity instead?
07-08-2024 01:53 AM - edited 07-08-2024 01:54 AM
Maybe you should try different connector like one below. According to their benchmark they've managed to save 143,9M rows in one dataframe in 72 seconds with tablock=true.
Apache Spark connector for SQL Server - Spark connector for SQL Server | Microsoft Learn
07-10-2024 10:03 AM
Hi @yeungcase ,
Thank you for reaching out to our community! We're here to help you.
To ensure we provide you with the best support, could you please take a moment to review the response and choose the one that best answers your question? Your feedback not only helps us assist you better but also benefits other community members who may have similar questions in the future.If you found the answer helpful, consider giving it a kudo. If the response fully addresses your question, please mark it as the accepted solution. This will help us close the thread and ensure your question is resolved.
We appreciate your participation and are here to assist you further if you need it!
Thanks,
Rishabh
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group