Do Spark nodes read data from storage in a sequence?

narek_margaryan — Wed, 06 Oct 2021 19:51:06 GMT

I'm new to Spark and trying to understand how some of its components work.

I understand that once the data is loaded into the memory of separate nodes, they process partitions in parallel, within their own memory (RAM).

But I'm wondering whether the initial partition loads into memory are done in parallel as well? AFAIK some SSDs allow for concurrent reads, but not sure whether that applies here.

Also, what exactly is partitioning in the context of Spark? Does the original file get split into different smaller files, or each nodes reads from a certain begin_byte to end_byte?

Re: Do Spark nodes read data from storage in a sequence?

-werners- — Fri, 08 Oct 2021 07:11:36 GMT

@Narek Margaryan , Normally the reading is done in parallel because the underlying file system is already distributed (if you use HDFS-based storage or something like, a data lake f.e.).

The number of partitions in the file itself also matters.

This leads me to your second question:

Partitioning in the context of spark is indeed the number of files being read/written.

There is a lot more to it, like shuffling, file format, and system parameters you can set, ...

topic Do Spark nodes read data from storage in a sequence? in Data Engineering

Do Spark nodes read data from storage in a sequence?

Re: Do Spark nodes read data from storage in a sequence?