Databricks

AvijitDey · ‎05-02-2022

Env: Azure Databrick :

version : 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)

Work Type : 56 GB Memory 2-8 node ( standard D13_V2)

No of rows : 2470350 and 115 Column

Size : 2.2 GB

Time taken approx. 9 min

Python Code .

What will be best approach for bulk load ?
What is best partition size you consider?
optimal batch size ?

df_gl_repartitioned = f5.repartition(10)

write_data_to_db(df_gl_repartitioned,"myserver.database.windows.net", "XXXX.onmicrosoft.com", "DBNAME" , "dbo.stag", dbutils,"1004857" ,"overwrite")

try:
        df.write.format("com.microsoft.sqlserver.jdbc.spark").mode(mode).option("url", f"jdbc:sqlserver://{server}").option("databaseName", database).option("dbtable", dbtable).option("accessToken", access_token).option("encrypt", "true").option("hostNameInCertificate", "*.database.windows.net").option("schemaCheckEnabled", "false").save()
        print(f"Successfully wrote df to {dbtable} ")
except ValueError as error:
        print(error)

I check below link

https://techcommunity.microsoft.com/t5/azure-sql-blog/considerations-of-data-partitioning-on-spark-d...

https://techcommunity.microsoft.com/t5/datacat/data-loading-performance-considerations-with-clustere...

Hubert-Dudek · ‎05-03-2022

I would avoid repartition as it is additionally not necessary cost, and you usually already have data partitioned. (check that with df.rdd.getNumParitions() ).

2.2 GB is not so extensive so I would go with a basic machine, one driver, and auto-scaling between 1 to 2 workers.

View solution in original post

Hubert-Dudek · ‎05-03-2022

I would avoid repartition as it is additionally not necessary cost, and you usually already have data partitioned. (check that with df.rdd.getNumParitions() ).

2.2 GB is not so extensive so I would go with a basic machine, one driver, and auto-scaling between 1 to 2 workers.