- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-02-2022 11:35 AM
Env: Azure Databrick :
version : 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)
Work Type : 56 GB Memory 2-8 node ( standard D13_V2)
No of rows : 2470350 and 115 Column
Size : 2.2 GB
Time taken approx. 9 min
Python Code .
- What will be best approach for bulk load ?
- What is best partition size you consider?
- optimal batch size ?
df_gl_repartitioned = f5.repartition(10)
write_data_to_db(df_gl_repartitioned,"myserver.database.windows.net", "XXXX.onmicrosoft.com", "DBNAME" , "dbo.stag", dbutils,"1004857" ,"overwrite")
try:
df.write.format("com.microsoft.sqlserver.jdbc.spark").mode(mode).option("url", f"jdbc:sqlserver://{server}").option("databaseName", database).option("dbtable", dbtable).option("accessToken", access_token).option("encrypt", "true").option("hostNameInCertificate", "*.database.windows.net").option("schemaCheckEnabled", "false").save()
print(f"Successfully wrote df to {dbtable} ")
except ValueError as error:
print(error)
I check below link
- Labels:
-
AZ
-
Azure
-
Azure databricks
-
LTS
-
SQL
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-03-2022 07:34 AM
I would avoid repartition as it is additionally not necessary cost, and you usually already have data partitioned. (check that with df.rdd.getNumParitions() ).
2.2 GB is not so extensive so I would go with a basic machine, one driver, and auto-scaling between 1 to 2 workers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-03-2022 07:34 AM
I would avoid repartition as it is additionally not necessary cost, and you usually already have data partitioned. (check that with df.rdd.getNumParitions() ).
2.2 GB is not so extensive so I would go with a basic machine, one driver, and auto-scaling between 1 to 2 workers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-03-2022 08:40 AM
Thanks for your response .
What the time line you expect to insert 2.2 GB data into SQL DB ?
Any time line ?
Now with repartition 5-10 - Time taken 9 min
With out repartition - Time took 13.16 minutes
Looking Process less than 9 min
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-09-2022 06:59 AM
Any further suggestion

