Databricks Community

jenshumrich · ‎09-17-2024

Within a script, I noticed that the network connection between driver and the mounted network drives is often a huge bottleneck. It seems that the network through speed is unreasonable low for being an Azure

Single node: Standard_DS12_v2 · DBR: 14.3.x-photon-scala2.12

Are there some ways how to improve upon the storing of a result to an Azure Blob storage? My current code looks like this:

joined_df.write.partitionBy("IdStation").mode("overwrite").parquet("/mnt/temp_folder")

Especially the IO wait of the CPU is more than just weird.

filipniziol · ‎09-17-2024

Hi @jenshumrich ,

There is partitioning by IdStation. How many partitions are created? Isn't it a problem with too many files?
The partition size should around 1 GB and the file size should be or around 128 MB.

I see a lot of IO wait, so this would go in line with my suspicion that too many files are created.

View solution in original post

jenshumrich · ‎09-17-2024

Here you can see the really slow network traffic, causing iowait on the CPU

filipniziol · ‎09-17-2024

Hi @jenshumrich ,

There is partitioning by IdStation. How many partitions are created? Isn't it a problem with too many files?
The partition size should around 1 GB and the file size should be or around 128 MB.

I see a lot of IO wait, so this would go in line with my suspicion that too many files are created.

ZoeCole · ‎10-09-2024

Thank you.

jenshumrich · ‎09-18-2024

You are right. I am creating 200 small files with the size of roughly 6 MB (in the quality system) and a few 100000s files in production. The partition is motivated by the original business need and further processing. Let me test with a the different partitioning.

Databricks Community

Network bottleneck

Connect with Databricks Users in Your Area

Introducing SAP Databricks

Serverless Compute for Notebooks, Workflows and Pipelines is now Generally Available on Google Cloud

Welcoming BladeBridge to Databricks: Accelerating Data Warehouse Migrations to Lakehouse

Databricks Clean Rooms: Now Generally Available on AWS and Azure

Securely share data, analytics and AI