Hi All,
Hope everyone is doing well.
We are currently validating Databricks on GCP and Azure.
We have a python notebook that does some ETL (Copy, extract zip files and process files within the zip files)
Our Cluster Config on Azure
DBX Runtime - 10.4 - Driver - Standard DS4_v2 , Worker - Standard D8_v3 (4 Workers). (40 cores 156GB)
We tried similar Config on GCP
DBX Runtime - 11.3 - Driver - n2-highmem-4 , Worker - n2-standard-8 (4 Workers). (36 cores 160GB)
For same notebook with minor path changes, the runtimes seem to be very high in GCP compared to Azure - 1h increased to 3h
Since the notebook has not changed by much - maybe split of large functions into smaller ones and path changes, I was wondering if it might be due to the runtime change and the machine type.
- Is there any documented slow downs in DBX Runtime 11.3 compared to 10.4 ?
- Is there a table mapping equivalent machine types between Azure and GCP ? Google search shows similar groupings i.e. Compute Optimized in Azure vs Compute Optimized in GCP, but no one to one mapping.
- Does splitting a single function into multiple function cause such a huge difference in runtime ?
Thanks for all the help.
Cheers...