Wednesday - last edited Wednesday
Hi All,
I am currently trying to read data from a materialized view as a single dataframe which contains around 10M of rows and then write it into an Azure SQL database. However, I don't see the spark job moving a bit even an hour is passed. I have already implemented several optimization measures, but it doesn't help at all.
numberCores = 4
df = spark.read.table(source_table)
df.count()
# Write spark dataframe into target SQL database
try:
df.coalesce(numberCores).write \
.format("jdbc") \
.option("url", sql_db_url) \
.option("batchsize", "10000") \
.option("rewriteBatchedStatements", "true") \
.option("compression", "snappy") \
.option("numPartitions", numberCores) \
.option("dbtable", target_table) \
.option("user", username) \
.option("password", password) \
.option("tableLock", "true") \
.mode("overwrite") \
.save()
print("Successfully write data into target SQL database")
except Exception as error:
print("An exception occurred:", error)
Cluster config:
Spark UI:
Appreciate for any advice. Thank you.
Wednesday
Hey, a few suggestions to try below!
Wednesday
Hi @dbrx_user ,
Thank you for your suggestion. Actually, I also notice that I should use repartition instead. However, I am now struggling on the optimal repartition number. I've tried make it as 16 but it seems the spill of memory and disk issue is very serious. And the spark job is not moving at all.
Wednesday
To reduce memory spill, I'd suggest increasing number of partitions. Try playing around with 32 or 48 to see if any difference.
On the jdbc part, I wonder if there is something spark doesn't like re your syntax. Might help to re-order your options so the jdbc ones go together. Could also try re-writing to use jdbc writer explicitly.
df.write \ .format("jdbc") \ .option("url", "jdbc:postgresql:dbserver") \ .option("dbtable", "schema.tablename") \ .option("user", "username") \ .option("password", "password") \ .save() df.write \ .jdbc("jdbc:postgresql:dbserver", "schema.tablename", properties={"user": "username", "password": "password"})
Also worth checking your target_table variable conforms to the "schema.table" format, in case spark can't find the table. Sometimes these things can get mixed up with the string formatting.
Wednesday
Hi @dbrx_user ,
I've tried to rewrite the statement, but it does not help. However, I discover that it could successfully write 100k rows without having any issues. But once I further boost it to 1M rows, it just stuck there.
Friday
Hmmm, what's your ELT setup like? Are you using ADF/ Synapse to trigger databricks notebooks? If the databricks tables are external with underlying ADLS storage, might be worth using an ADF copy activity instead?
8 hours ago - last edited 8 hours ago
Maybe you should try different connector like one below. According to their benchmark they've managed to save 143,9M rows in one dataframe in 72 seconds with tablock=true.
Apache Spark connector for SQL Server - Spark connector for SQL Server | Microsoft Learn
Excited to expand your horizons with us? Click here to Register and begin your journey to success!
Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!