cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Do Spark nodes read data from storage in a sequence?

narek_margaryan
New Contributor II

I'm new to Spark and trying to understand how some of its components work.

I understand that once the data is loaded into the memory of separate nodes, they process partitions in parallel, within their own memory (RAM).

But I'm wondering whether the initial partition loads into memory are done in parallel as well? AFAIK some SSDs allow for concurrent reads, but not sure whether that applies here.

Also, what exactly is partitioning in the context of Spark? Does the original file get split into different smaller files, or each nodes reads from a certain begin_byte to end_byte?

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

@Narek Margaryanโ€‹ , Normally the reading is done in parallel because the underlying file system is already distributed (if you use HDFS-based storage or something like, a data lake f.e.).

The number of partitions in the file itself also matters.

This leads me to your second question:

Partitioning in the context of spark is indeed the number of files being read/written.

There is a lot more to it, like shuffling, file format, and system parameters you can set, ...

View solution in original post

3 REPLIES 3

Kaniz_Fatma
Community Manager
Community Manager

Hi @narek_margaryan! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers on the community have an answer to your question first. Or else I will get back to you soon.Thanks.

-werners-
Esteemed Contributor III

@Narek Margaryanโ€‹ , Normally the reading is done in parallel because the underlying file system is already distributed (if you use HDFS-based storage or something like, a data lake f.e.).

The number of partitions in the file itself also matters.

This leads me to your second question:

Partitioning in the context of spark is indeed the number of files being read/written.

There is a lot more to it, like shuffling, file format, and system parameters you can set, ...

Kaniz_Fatma
Community Manager
Community Manager

Hi @Narek Margaryanโ€‹, Just a friendly follow-up. Do you still need help, or does the above response help you to find the solution? Please let us know.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group