cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Azure Databrick SQL bulk insert to AZ SQL

AvijitDey
New Contributor III

Env: Azure Databrick :

version : 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)

Work Type : 56 GB Memory 2-8 node ( standard D13_V2)

No of rows : 2470350 and 115 Column

Size : 2.2 GB

Time taken approx. 9 min

Python Code .

  1. What will be best approach for bulk load ?
  2. What is best partition size you consider?
  3. optimal batch size ?

df_gl_repartitioned = f5.repartition(10) 

write_data_to_db(df_gl_repartitioned,"myserver.database.windows.net", "XXXX.onmicrosoft.com", "DBNAME" , "dbo.stag", dbutils,"1004857" ,"overwrite")

try:
        df.write.format("com.microsoft.sqlserver.jdbc.spark").mode(mode).option("url", f"jdbc:sqlserver://{server}").option("databaseName", database).option("dbtable", dbtable).option("accessToken", access_token).option("encrypt", "true").option("hostNameInCertificate", "*.database.windows.net").option("schemaCheckEnabled", "false").save()
        print(f"Successfully wrote df to {dbtable} ")
except ValueError as error:
        print(error)

I check below link

https://techcommunity.microsoft.com/t5/azure-sql-blog/considerations-of-data-partitioning-on-spark-d...

https://techcommunity.microsoft.com/t5/datacat/data-loading-performance-considerations-with-clustere...

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

I would avoid repartition as it is additionally not necessary cost, and you usually already have data partitioned. (check that with df.rdd.getNumParitions() ).

2.2 GB is not so extensive so I would go with a basic machine, one driver, and auto-scaling between 1 to 2 workers.

View solution in original post

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

I would avoid repartition as it is additionally not necessary cost, and you usually already have data partitioned. (check that with df.rdd.getNumParitions() ).

2.2 GB is not so extensive so I would go with a basic machine, one driver, and auto-scaling between 1 to 2 workers.

AvijitDey
New Contributor III

Thanks for your response .

What the time line you expect to insert 2.2 GB data into SQL DB ?

Any time line ?

Now with repartition 5-10 - Time taken 9 min

With out repartition - Time took 13.16 minutes

Looking Process less than 9 min

AvijitDey
New Contributor III

Any further suggestion

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.