Databricks Community

yeungcase · ‎07-03-2024

Hi All,

I am currently trying to read data from a materialized view as a single dataframe which contains around 10M of rows and then write it into an Azure SQL database. However, I don't see the spark job moving a bit even an hour is passed. I have already implemented several optimization measures, but it doesn't help at all.

numberCores = 4
df = spark.read.table(source_table)
df.count()

# Write spark dataframe into target SQL database
try:
    df.coalesce(numberCores).write \
    .format("jdbc") \
    .option("url", sql_db_url) \
    .option("batchsize", "10000") \
    .option("rewriteBatchedStatements", "true") \
    .option("compression", "snappy") \
    .option("numPartitions", numberCores) \
    .option("dbtable", target_table) \
    .option("user", username) \
    .option("password", password) \
    .option("tableLock", "true") \
    .mode("overwrite") \
    .save()

    print("Successfully write data into target SQL database")
except Exception as error:
    print("An exception occurred:", error)

Cluster config:

Spark UI:

Appreciate for any advice. Thank you.

dbrx_user · ‎07-03-2024

Hey, a few suggestions to try below!

Is there a reason you're using coalesce instead of repartition? Coalesce can result in uneven sized partitions which are slower to work with (even though the initial function might be faster).
Also, why are you specifying 4 partitions? It is typically recommended to use partitions 2–4X number of cores of cluster. I think ideal partition size is usually 200MB

yeungcase · ‎07-03-2024

Hi @dbrx_user ,

Thank you for your suggestion. Actually, I also notice that I should use repartition instead. However, I am now struggling on the optimal repartition number. I've tried make it as 16 but it seems the spill of memory and disk issue is very serious. And the spark job is not moving at all.

dbrx_user · ‎07-03-2024

To reduce memory spill, I'd suggest increasing number of partitions. Try playing around with 32 or 48 to see if any difference.

On the jdbc part, I wonder if there is something spark doesn't like re your syntax. Might help to re-order your options so the jdbc ones go together. Could also try re-writing to use jdbc writer explicitly.

df.write \
    .format("jdbc") \
    .option("url", "jdbc:postgresql:dbserver") \
    .option("dbtable", "schema.tablename") \
    .option("user", "username") \
    .option("password", "password") \
    .save()

df.write \
    .jdbc("jdbc:postgresql:dbserver", "schema.tablename",
          properties={"user": "username", "password": "password"})

Also worth checking your target_table variable conforms to the "schema.table" format, in case spark can't find the table. Sometimes these things can get mixed up with the string formatting.

yeungcase · ‎07-03-2024

Hi @dbrx_user ,

I've tried to rewrite the statement, but it does not help. However, I discover that it could successfully write 100k rows without having any issues. But once I further boost it to 1M rows, it just stuck there.

dbrx_user · ‎07-05-2024

Hmmm, what's your ELT setup like? Are you using ADF/ Synapse to trigger databricks notebooks? If the databricks tables are external with underlying ADLS storage, might be worth using an ADF copy activity instead?

szymon_dybczak · ‎07-08-2024

Maybe you should try different connector like one below. According to their benchmark they've managed to save 143,9M rows in one dataframe in 72 seconds with tablock=true.

Apache Spark connector for SQL Server - Spark connector for SQL Server | Microsoft Learn

Rishabh_Tiwari · ‎07-10-2024

Hi @yeungcase ,

Thank you for reaching out to our community! We're here to help you.

To ensure we provide you with the best support, could you please take a moment to review the response and choose the one that best answers your question? Your feedback not only helps us assist you better but also benefits other community members who may have similar questions in the future.If you found the answer helpful, consider giving it a kudo. If the response fully addresses your question, please mark it as the accepted solution. This will help us close the thread and ensure your question is resolved.

We appreciate your participation and are here to assist you further if you need it!

Thanks,
Rishabh