Azure Databrick SQL bulk insert to AZ SQL

AvijitDey
New Contributor III

Env: Azure Databrick :

version : 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)

Work Type : 56 GB Memory 2-8 node ( standard D13_V2)

No of rows : 2470350 and 115 Column

Size : 2.2 GB

Time taken approx. 9 min

Python Code .

  1. What will be best approach for bulk load ?
  2. What is best partition size you consider?
  3. optimal batch size ?

df_gl_repartitioned = f5.repartition(10) 

write_data_to_db(df_gl_repartitioned,"myserver.database.windows.net", "XXXX.onmicrosoft.com", "DBNAME" , "dbo.stag", dbutils,"1004857" ,"overwrite")

try:
        df.write.format("com.microsoft.sqlserver.jdbc.spark").mode(mode).option("url", f"jdbc:sqlserver://{server}").option("databaseName", database).option("dbtable", dbtable).option("accessToken", access_token).option("encrypt", "true").option("hostNameInCertificate", "*.database.windows.net").option("schemaCheckEnabled", "false").save()
        print(f"Successfully wrote df to {dbtable} ")
except ValueError as error:
        print(error)

I check below link

https://techcommunity.microsoft.com/t5/azure-sql-blog/considerations-of-data-partitioning-on-spark-d...

https://techcommunity.microsoft.com/t5/datacat/data-loading-performance-considerations-with-clustere...

Hubert-Dudek
Databricks MVP

I would avoid repartition as it is additionally not necessary cost, and you usually already have data partitioned. (check that with df.rdd.getNumParitions() ).

2.2 GB is not so extensive so I would go with a basic machine, one driver, and auto-scaling between 1 to 2 workers.


My blog: https://databrickster.medium.com/

View solution in original post

AvijitDey
New Contributor III

Thanks for your response .

What the time line you expect to insert 2.2 GB data into SQL DB ?

Any time line ?

Now with repartition 5-10 - Time taken 9 min

With out repartition - Time took 13.16 minutes

Looking Process less than 9 min

AvijitDey
New Contributor III

Any further suggestion