<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Pandas API on Spark, Does it run on a multi-node cluster? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/pandas-api-on-spark-does-it-run-on-a-multi-node-cluster/m-p/26895#M18888</link>
    <description>&lt;P&gt;@Debayan Mukherjee​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks for your help.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have a question about terms : "Pandas dataset" and "pandas-on-Spark dataset".&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;When you say "dataset", does it refer to "DataFrame"?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If I create "pandas-on-Spark dataset&amp;nbsp;", can I apply Pandas functions on it, or I should convert it to "pandas dataset" before such a computation?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If I need to convert it to "pandas dataset", I think computation will be done on a single node. Is it correct?&lt;/P&gt;</description>
    <pubDate>Tue, 18 Oct 2022 21:21:17 GMT</pubDate>
    <dc:creator>Mado</dc:creator>
    <dc:date>2022-10-18T21:21:17Z</dc:date>
    <item>
      <title>Pandas API on Spark, Does it run on a multi-node cluster?</title>
      <link>https://community.databricks.com/t5/data-engineering/pandas-api-on-spark-does-it-run-on-a-multi-node-cluster/m-p/26893#M18886</link>
      <description>&lt;P&gt;Hi, &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have a few questions about "Pandas API on Spark". Thanks for your time to read my questions&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;1) Input to these functions are Pandas DataFrame or PySpark DataFrame?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;2) When I use any pandas function (like isna, size, apply, where, etc ), does it run only on one node or multi nodes?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks. &lt;/P&gt;</description>
      <pubDate>Mon, 17 Oct 2022 22:11:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pandas-api-on-spark-does-it-run-on-a-multi-node-cluster/m-p/26893#M18886</guid>
      <dc:creator>Mado</dc:creator>
      <dc:date>2022-10-17T22:11:09Z</dc:date>
    </item>
    <item>
      <title>Re: Pandas API on Spark, Does it run on a multi-node cluster?</title>
      <link>https://community.databricks.com/t5/data-engineering/pandas-api-on-spark-does-it-run-on-a-multi-node-cluster/m-p/26895#M18888</link>
      <description>&lt;P&gt;@Debayan Mukherjee​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks for your help.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have a question about terms : "Pandas dataset" and "pandas-on-Spark dataset".&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;When you say "dataset", does it refer to "DataFrame"?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If I create "pandas-on-Spark dataset&amp;nbsp;", can I apply Pandas functions on it, or I should convert it to "pandas dataset" before such a computation?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If I need to convert it to "pandas dataset", I think computation will be done on a single node. Is it correct?&lt;/P&gt;</description>
      <pubDate>Tue, 18 Oct 2022 21:21:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pandas-api-on-spark-does-it-run-on-a-multi-node-cluster/m-p/26895#M18888</guid>
      <dc:creator>Mado</dc:creator>
      <dc:date>2022-10-18T21:21:17Z</dc:date>
    </item>
    <item>
      <title>Re: Pandas API on Spark, Does it run on a multi-node cluster?</title>
      <link>https://community.databricks.com/t5/data-engineering/pandas-api-on-spark-does-it-run-on-a-multi-node-cluster/m-p/26896#M18889</link>
      <description>&lt;P&gt;I would like to share the following information, that might help you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Pandas API on Spark fills this gap by providing pandas equivalent APIs that work on Apache Spark. Pandas API on Spark is useful not only for pandas users but also PySpark users, because pandas API on Spark supports many tasks that are difficult to do with PySpark, for example plotting data directly from a PySpark DataFrame. Doc &lt;A href="https://docs.databricks.com/_static/notebooks/pandas-to-pandas-api-on-spark-in-10-minutes.html" target="test_blank"&gt;https://docs.databricks.com/_static/notebooks/pandas-to-pandas-api-on-spark-in-10-minutes.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 24 Oct 2022 18:49:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pandas-api-on-spark-does-it-run-on-a-multi-node-cluster/m-p/26896#M18889</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2022-10-24T18:49:13Z</dc:date>
    </item>
    <item>
      <title>Re: Pandas API on Spark, Does it run on a multi-node cluster?</title>
      <link>https://community.databricks.com/t5/data-engineering/pandas-api-on-spark-does-it-run-on-a-multi-node-cluster/m-p/26897#M18890</link>
      <description>&lt;P&gt;Thanks for your reply. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I just want to confirm that Pandas API on Spark uses the parallelism capability of Spark (computations on multi nodes). &lt;/P&gt;</description>
      <pubDate>Tue, 25 Oct 2022 09:02:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pandas-api-on-spark-does-it-run-on-a-multi-node-cluster/m-p/26897#M18890</guid>
      <dc:creator>Mado</dc:creator>
      <dc:date>2022-10-25T09:02:02Z</dc:date>
    </item>
    <item>
      <title>Re: Pandas API on Spark, Does it run on a multi-node cluster?</title>
      <link>https://community.databricks.com/t5/data-engineering/pandas-api-on-spark-does-it-run-on-a-multi-node-cluster/m-p/26894#M18887</link>
      <description>&lt;P&gt;Hi @Mohammad Saber​&amp;nbsp;, &lt;/P&gt;&lt;P&gt;Pandas dataset lives in the single machine, and is naturally iterable locally within the same machine. However, pandas-on-Spark dataset lives across multiple machines, and they are computed in a distributed manner. It is difficult to be locally iterable and it is very likely users collect the entire data into the client side without knowing it. Therefore, it is best to stick to using pandas-on-Spark APIs.&lt;/P&gt;&lt;P&gt;Please refer:&lt;/P&gt;&lt;P&gt;&lt;A href="https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html#use-pandas-api-on-spark-directly-whenever-possible" alt="https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html#use-pandas-api-on-spark-directly-whenever-possible" target="_blank"&gt;https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html#use-pandas-api-on-spark-directly-whenever-possible&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html" alt="https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html" target="_blank"&gt;https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/languages/pandas-spark.html" alt="https://docs.databricks.com/languages/pandas-spark.html" target="_blank"&gt;https://docs.databricks.com/languages/pandas-spark.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Please let us know if you need further clarification on the same. We are more than happy to assist you further.&lt;/P&gt;</description>
      <pubDate>Tue, 18 Oct 2022 12:46:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pandas-api-on-spark-does-it-run-on-a-multi-node-cluster/m-p/26894#M18887</guid>
      <dc:creator>Debayan</dc:creator>
      <dc:date>2022-10-18T12:46:39Z</dc:date>
    </item>
  </channel>
</rss>

