<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Why does .collect() cause a shuffle while .show() does not? in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/why-does-collect-cause-a-shuffle-while-show-does-not/m-p/115627#M9401</link>
    <description>&lt;P&gt;Q1: collect() moves all data to the driver, hence a shufle. show() just shows x records from the df, from a partition (or more partitions if x &amp;gt; partition size).&amp;nbsp; No shuffling needed.&lt;BR /&gt;For display purposes the results are of course gathered on the driver but this is not a spark shuffle.&lt;BR /&gt;Q2: I´d say that using collect, the file is read twice. Perhaps multiple stages.&amp;nbsp; Spark can read a file multiple times if necessary.&lt;BR /&gt;Q3: data is not written to disk, so it is worker RAM -&amp;gt; network -&amp;gt; driver RAM&lt;/P&gt;</description>
    <pubDate>Wed, 16 Apr 2025 09:16:32 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2025-04-16T09:16:32Z</dc:date>
    <item>
      <title>Why does .collect() cause a shuffle while .show() does not?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/why-does-collect-cause-a-shuffle-while-show-does-not/m-p/114790#M9400</link>
      <description>&lt;P&gt;&lt;STRONG&gt;I’m learning Spark using the book Spark: The Definitive Guide and came across some behavior I’m trying to understand.&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;SPAN&gt;I am reading a csv_file&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;which has 3 columns:&amp;nbsp;&lt;/SPAN&gt;DEST_COUNTRY_NAME&lt;SPAN&gt;,&amp;nbsp;&lt;/SPAN&gt;ORIGIN_COUNTRY_NAME&lt;SPAN&gt;,&amp;nbsp;&lt;/SPAN&gt;count&lt;SPAN&gt;. The dataset has a total of&amp;nbsp;&lt;/SPAN&gt;256&lt;SPAN&gt;&amp;nbsp;rows.&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Here’s the code I'm running:&lt;/STRONG&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;data = (spark
.read
.format('csv')
.option('inferSchema', 'true')
.option('header', 'true')
.option('path', 'dbfs:/FileStore/tables/spark_definitive_guide/data/data/csv/2015_summary.csv')
.load())

# Set numper of partitions to 5

spark.conf.set("spark.sql.shuffle.partitions", "5")&lt;/LI-CODE&gt;&lt;P&gt;&lt;BR /&gt;&lt;STRONG&gt;Now, I’m trying to sort the Data Frame by the count column. The physical plan looks like this:&lt;/STRONG&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;data.sort("count").explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#337 ASC NULLS FIRST], true, 0
+- Exchange rangepartitioning(count#337 ASC NULLS FIRST, 5), ENSURE_REQUIREMENTS, [plan_id=255]
+- FileScan csv [DEST_COUNTRY_NAME#335,ORIGIN_COUNTRY_NAME#336,count#337] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[dbfs:/FileStore/tables/spark_definitive_guide/data/data/csv/201..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct&amp;lt;DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:int&amp;gt;&lt;/LI-CODE&gt;&lt;P&gt;&lt;STRONG&gt;Now here's where I’m confused:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Question 1&lt;/STRONG&gt;&lt;BR /&gt;Why does .collect() cause a shuffle, but .show(1000) does not?&lt;/P&gt;&lt;P&gt;When I run:&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;data.sort("count").show(1000)​&lt;/LI-CODE&gt;&lt;P&gt;I don’t see any shuffle in the Spark UI DAG — the data just gets scanned and displayed.&lt;/P&gt;&lt;P&gt;But when I run:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;data.sort("count").collect()&lt;/LI-CODE&gt;&lt;P&gt;&lt;BR /&gt;I do see a shuffle in the DAG and execution plan.&lt;/P&gt;&lt;P&gt;Both commands are retrieving all 256 rows, so why the difference? &lt;STRONG&gt;Why does .collect() trigger a shuffle, while .show(1000) does not?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;DAG of .show&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="show.png" style="width: 645px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/15858i3404CA16E24F5FFA/image-size/large?v=v2&amp;amp;px=999" role="button" title="show.png" alt="show.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;DAG of .collect&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="collect.png" style="width: 492px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/15859i45440B3400C56A61/image-size/large?v=v2&amp;amp;px=999" role="button" title="collect.png" alt="collect.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Question 2&lt;/STRONG&gt;&lt;BR /&gt;Why does the DAG for &lt;STRONG&gt;.collect()&lt;/STRONG&gt;&amp;nbsp;for&amp;nbsp;&lt;STRONG&gt;Scan csv&lt;/STRONG&gt;&amp;nbsp;returning &lt;STRONG&gt;512&lt;/STRONG&gt; rows when the file only has &lt;STRONG&gt;256&lt;/STRONG&gt; as shown above in the the &lt;STRONG&gt;.collect&lt;/STRONG&gt; DAG?&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Question 3&lt;/STRONG&gt;&lt;BR /&gt;When we do&amp;nbsp;&lt;STRONG&gt;.collect()&lt;/STRONG&gt;&amp;nbsp;and the data is loaded into driver memory, how does this happen under the hood? Is the data directly sent from executors to driver memory over the network, or is the data first written to disk and then read by driver into memory?&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Any help in understanding this behavior would be much appreciated. Thank you!&lt;/P&gt;</description>
      <pubDate>Tue, 08 Apr 2025 07:36:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/why-does-collect-cause-a-shuffle-while-show-does-not/m-p/114790#M9400</guid>
      <dc:creator>VaderK</dc:creator>
      <dc:date>2025-04-08T07:36:21Z</dc:date>
    </item>
    <item>
      <title>Re: Why does .collect() cause a shuffle while .show() does not?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/why-does-collect-cause-a-shuffle-while-show-does-not/m-p/115627#M9401</link>
      <description>&lt;P&gt;Q1: collect() moves all data to the driver, hence a shufle. show() just shows x records from the df, from a partition (or more partitions if x &amp;gt; partition size).&amp;nbsp; No shuffling needed.&lt;BR /&gt;For display purposes the results are of course gathered on the driver but this is not a spark shuffle.&lt;BR /&gt;Q2: I´d say that using collect, the file is read twice. Perhaps multiple stages.&amp;nbsp; Spark can read a file multiple times if necessary.&lt;BR /&gt;Q3: data is not written to disk, so it is worker RAM -&amp;gt; network -&amp;gt; driver RAM&lt;/P&gt;</description>
      <pubDate>Wed, 16 Apr 2025 09:16:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/why-does-collect-cause-a-shuffle-while-show-does-not/m-p/115627#M9401</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2025-04-16T09:16:32Z</dc:date>
    </item>
  </channel>
</rss>

