<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Do Spark nodes read data from storage in a sequence? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/do-spark-nodes-read-data-from-storage-in-a-sequence/m-p/13960#M8535</link>
    <description>&lt;P&gt;I'm new to Spark and trying to understand how some of its components work.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I understand that once the data is loaded into the memory of separate nodes, they process partitions in parallel, within their own memory (RAM).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;But I'm wondering whether the initial partition loads into memory are done in parallel as well? AFAIK some SSDs allow for concurrent reads, but not sure whether that applies here.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Also, what exactly is partitioning in the context of Spark? Does the original file get split into different smaller files, or each nodes reads from a certain begin_byte to end_byte?&lt;/P&gt;</description>
    <pubDate>Wed, 06 Oct 2021 19:51:06 GMT</pubDate>
    <dc:creator>narek_margaryan</dc:creator>
    <dc:date>2021-10-06T19:51:06Z</dc:date>
    <item>
      <title>Do Spark nodes read data from storage in a sequence?</title>
      <link>https://community.databricks.com/t5/data-engineering/do-spark-nodes-read-data-from-storage-in-a-sequence/m-p/13960#M8535</link>
      <description>&lt;P&gt;I'm new to Spark and trying to understand how some of its components work.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I understand that once the data is loaded into the memory of separate nodes, they process partitions in parallel, within their own memory (RAM).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;But I'm wondering whether the initial partition loads into memory are done in parallel as well? AFAIK some SSDs allow for concurrent reads, but not sure whether that applies here.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Also, what exactly is partitioning in the context of Spark? Does the original file get split into different smaller files, or each nodes reads from a certain begin_byte to end_byte?&lt;/P&gt;</description>
      <pubDate>Wed, 06 Oct 2021 19:51:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/do-spark-nodes-read-data-from-storage-in-a-sequence/m-p/13960#M8535</guid>
      <dc:creator>narek_margaryan</dc:creator>
      <dc:date>2021-10-06T19:51:06Z</dc:date>
    </item>
    <item>
      <title>Re: Do Spark nodes read data from storage in a sequence?</title>
      <link>https://community.databricks.com/t5/data-engineering/do-spark-nodes-read-data-from-storage-in-a-sequence/m-p/13962#M8537</link>
      <description>&lt;P&gt;@Narek Margaryan​&amp;nbsp;, Normally the reading is done in parallel because the underlying file system is already distributed (if you use HDFS-based storage or something like, a data lake f.e.).&lt;/P&gt;&lt;P&gt;The number of partitions in the file itself also matters.&lt;/P&gt;&lt;P&gt;This leads me to your second question:&lt;/P&gt;&lt;P&gt;Partitioning in the context of spark is indeed the number of files being read/written.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;There is a lot more to it, like shuffling, file format, and system parameters you can set, ...&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 08 Oct 2021 07:11:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/do-spark-nodes-read-data-from-storage-in-a-sequence/m-p/13962#M8537</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-10-08T07:11:36Z</dc:date>
    </item>
  </channel>
</rss>

