โ09-17-2024 04:02 AM
Within a script, I noticed that the network connection between driver and the mounted network drives is often a huge bottleneck. It seems that the network through speed is unreasonable low for being an Azure
Are there some ways how to improve upon the storing of a result to an Azure Blob storage? My current code looks like this:
โ09-17-2024 10:19 AM - edited โ09-17-2024 10:20 AM
Hi @jenshumrich ,
There is partitioning by IdStation. How many partitions are created? Isn't it a problem with too many files?
The partition size should around 1 GB and the file size should be or around 128 MB.
I see a lot of IO wait, so this would go in line with my suspicion that too many files are created.
โ09-17-2024 04:03 AM
โ
โHere you can see the really slow network traffic, causing iowait on the CPU
โ09-17-2024 10:19 AM - edited โ09-17-2024 10:20 AM
Hi @jenshumrich ,
There is partitioning by IdStation. How many partitions are created? Isn't it a problem with too many files?
The partition size should around 1 GB and the file size should be or around 128 MB.
I see a lot of IO wait, so this would go in line with my suspicion that too many files are created.
โ10-09-2024 11:54 PM
Thank you.
โ09-18-2024 04:04 AM
You are right. I am creating 200 small files with the size of roughly 6 MB (in the quality system) and a few 100000s files in production. The partition is motivated by the original business need and further processing. Let me test with a the different partitioning.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group