Env: Azure Databrick :
version : 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)
Work Type : 56 GB Memory 2-8 node ( standard D13_V2)
No of rows : 2470350 and 115 Column
Size : 2.2 GB
Time taken approx. 9 min
Python Code .
- What will be best approach for bulk load ?
- What is best partition size you consider?
- optimal batch size ?
df_gl_repartitioned = f5.repartition(10)
write_data_to_db(df_gl_repartitioned,"myserver.database.windows.net", "XXXX.onmicrosoft.com", "DBNAME" , "dbo.stag", dbutils,"1004857" ,"overwrite")
try:
df.write.format("com.microsoft.sqlserver.jdbc.spark").mode(mode).option("url", f"jdbc:sqlserver://{server}").option("databaseName", database).option("dbtable", dbtable).option("accessToken", access_token).option("encrypt", "true").option("hostNameInCertificate", "*.database.windows.net").option("schemaCheckEnabled", "false").save()
print(f"Successfully wrote df to {dbtable} ")
except ValueError as error:
print(error)
I check below link
https://techcommunity.microsoft.com/t5/azure-sql-blog/considerations-of-data-partitioning-on-spark-d...
https://techcommunity.microsoft.com/t5/datacat/data-loading-performance-considerations-with-clustere...