topic Re: Network bottleneck in Data Engineering

Network bottleneck

jenshumrich — Tue, 17 Sep 2024 11:02:14 GMT

Within a script, I noticed that the network connection between driver and the mounted network drives is often a huge bottleneck. It seems that the network through speed is unreasonable low for being an Azure

Single node: Standard_DS12_v2 · DBR: 14.3.x-photon-scala2.12

Are there some ways how to improve upon the storing of a result to an Azure Blob storage? My current code looks like this:

joined_df.write.partitionBy("IdStation").mode("overwrite").parquet("/mnt/temp_folder")

Especially the IO wait of the CPU is more than just weird.

Re: Network bottleneck

jenshumrich — Tue, 17 Sep 2024 11:03:23 GMT

Here you can see the really slow network traffic, causing iowait on the CPU

Re: Network bottleneck

filipniziol — Tue, 17 Sep 2024 17:20:08 GMT

Hi @jenshumrich ,

There is partitioning by IdStation. How many partitions are created? Isn't it a problem with too many files?
The partition size should around 1 GB and the file size should be or around 128 MB.

I see a lot of IO wait, so this would go in line with my suspicion that too many files are created.

Re: Network bottleneck

jenshumrich — Wed, 18 Sep 2024 11:04:55 GMT

You are right. I am creating 200 small files with the size of roughly 6 MB (in the quality system) and a few 100000s files in production. The partition is motivated by the original business need and further processing. Let me test with a the different partitioning.

Re: Network bottleneck

ZoeCole — Thu, 10 Oct 2024 06:54:27 GMT

Thank you.