<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Kernel switches to unknown using pyspark in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/kernel-switches-to-unknown-using-pyspark/m-p/18676#M12433</link>
    <description>&lt;P&gt;Thank you very much, I ill try to do that as it seems that that is the problem! Nevertheless, I managed to save the dataframe into CSV and from there to transform it to pandas (it did not work from me directly from spark df to pandas). Pandas works great with this dataset as it is not quite big. However, I am aware that it is not suitable for big data. So for big data, next time, I will try to connect to existing spark cluster. &lt;/P&gt;</description>
    <pubDate>Tue, 07 Jun 2022 13:42:30 GMT</pubDate>
    <dc:creator>SusuTheSeeker</dc:creator>
    <dc:date>2022-06-07T13:42:30Z</dc:date>
    <item>
      <title>Kernel switches to unknown using pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/kernel-switches-to-unknown-using-pyspark/m-p/18671#M12428</link>
      <description>&lt;P&gt;I am working in jupyter hub in a notebook. I am using pyspark dataframe for analyzing text. More precisely I am doing sentimment analysis of newspaper articles. The code works until I get to some point where the kernel is busy and after approximately 10 minutes of being busy, it switches to unknown. The operations that cause it to stop working are for example&amp;nbsp;.drop() and&amp;nbsp;groupBy(). The dataset has only about 25k rows. After looking at the logs I get this message:&lt;/P&gt;&lt;P&gt;Stage 1:&amp;gt; (0 + 0) / 1] 22/06/02 09:30:17 WARN TaskSetManager: Stage 1 contains a task of very large size (234399 KiB). The maximum recommended task size is 1000 KiB.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;After some research I found out that it is probably due to full memory. However I am not sure how to increase it.&lt;/P&gt;&lt;P&gt;To build the spark application I use this code:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;spark = SparkSession.builder \
        .master("local") \
        .appName("x") \
        .config("spark.driver.memory", "2g") \
        .config("spark.executor.memory", "12g") \
        .getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Any ideas for the kernel to stop changing to "Unknown" or somehow free the memory? Note: I am not using RDDs just spark dataframes&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am sharing my notebook. This project is for my thesis and I am desperate to get the code working. Would be extremely thankful for any help!&lt;/P&gt;</description>
      <pubDate>Mon, 06 Jun 2022 10:45:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/kernel-switches-to-unknown-using-pyspark/m-p/18671#M12428</guid>
      <dc:creator>SusuTheSeeker</dc:creator>
      <dc:date>2022-06-06T10:45:09Z</dc:date>
    </item>
    <item>
      <title>Re: Kernel switches to unknown using pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/kernel-switches-to-unknown-using-pyspark/m-p/18672#M12429</link>
      <description>&lt;P&gt;do you actually run the code on a distributed environment (meaning a driver and multiple workers)?&lt;/P&gt;&lt;P&gt;If not, there is no use in using pyspark as all code will be executed locally.&lt;/P&gt;</description>
      <pubDate>Tue, 07 Jun 2022 10:14:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/kernel-switches-to-unknown-using-pyspark/m-p/18672#M12429</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-06-07T10:14:25Z</dc:date>
    </item>
    <item>
      <title>Re: Kernel switches to unknown using pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/kernel-switches-to-unknown-using-pyspark/m-p/18673#M12430</link>
      <description>&lt;P&gt;No i do not. How could I do that? ​&lt;/P&gt;</description>
      <pubDate>Tue, 07 Jun 2022 10:33:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/kernel-switches-to-unknown-using-pyspark/m-p/18673#M12430</guid>
      <dc:creator>SusuTheSeeker</dc:creator>
      <dc:date>2022-06-07T10:33:38Z</dc:date>
    </item>
    <item>
      <title>Re: Kernel switches to unknown using pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/kernel-switches-to-unknown-using-pyspark/m-p/18674#M12431</link>
      <description>&lt;P&gt;Spark is a distributed data processing framework.  For it to shine, you need multiple machines (VMs or physical).  Otherwise it is no better than pandas etc (in local mode on a single node).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;So to start using spark, you should either connect to an existing spark cluster (if there is a cluster available for you) or, and that might be the easiest way: sign up for Databricks Community Edition and start using Databricks.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Mind that Community Edition is limited in functionality, but still very useful.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/getting-started/quick-start.html" alt="https://docs.databricks.com/getting-started/quick-start.html" target="_blank"&gt;https://docs.databricks.com/getting-started/quick-start.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If you cannot do either, stop using pyspark and focus on pure python code.&lt;/P&gt;&lt;P&gt;You can still run into memory issues though as you run code locally.&lt;/P&gt;</description>
      <pubDate>Tue, 07 Jun 2022 10:38:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/kernel-switches-to-unknown-using-pyspark/m-p/18674#M12431</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-06-07T10:38:47Z</dc:date>
    </item>
    <item>
      <title>Re: Kernel switches to unknown using pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/kernel-switches-to-unknown-using-pyspark/m-p/18675#M12432</link>
      <description>&lt;P&gt;Are you a Databricks customer?  You can use a notebook in the webui and spin up a cluster very easily. &lt;/P&gt;</description>
      <pubDate>Tue, 07 Jun 2022 12:21:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/kernel-switches-to-unknown-using-pyspark/m-p/18675#M12432</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-06-07T12:21:59Z</dc:date>
    </item>
    <item>
      <title>Re: Kernel switches to unknown using pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/kernel-switches-to-unknown-using-pyspark/m-p/18676#M12433</link>
      <description>&lt;P&gt;Thank you very much, I ill try to do that as it seems that that is the problem! Nevertheless, I managed to save the dataframe into CSV and from there to transform it to pandas (it did not work from me directly from spark df to pandas). Pandas works great with this dataset as it is not quite big. However, I am aware that it is not suitable for big data. So for big data, next time, I will try to connect to existing spark cluster. &lt;/P&gt;</description>
      <pubDate>Tue, 07 Jun 2022 13:42:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/kernel-switches-to-unknown-using-pyspark/m-p/18676#M12433</guid>
      <dc:creator>SusuTheSeeker</dc:creator>
      <dc:date>2022-06-07T13:42:30Z</dc:date>
    </item>
    <item>
      <title>Re: Kernel switches to unknown using pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/kernel-switches-to-unknown-using-pyspark/m-p/18677#M12434</link>
      <description>&lt;P&gt;Yes I am just a costumer I think. I will try to do that, thank you!&lt;/P&gt;</description>
      <pubDate>Tue, 07 Jun 2022 13:43:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/kernel-switches-to-unknown-using-pyspark/m-p/18677#M12434</guid>
      <dc:creator>SusuTheSeeker</dc:creator>
      <dc:date>2022-06-07T13:43:01Z</dc:date>
    </item>
    <item>
      <title>Re: Kernel switches to unknown using pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/kernel-switches-to-unknown-using-pyspark/m-p/18679#M12436</link>
      <description>&lt;P&gt;Hi, unfortunately I do not have a solution. The solution would be to connect  the dataset to an existing spark cluster. It seems that I had spark just locally and all the computations were done locally and that is why the kernel was failing. &lt;/P&gt;</description>
      <pubDate>Mon, 13 Jun 2022 14:51:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/kernel-switches-to-unknown-using-pyspark/m-p/18679#M12436</guid>
      <dc:creator>SusuTheSeeker</dc:creator>
      <dc:date>2022-06-13T14:51:03Z</dc:date>
    </item>
  </channel>
</rss>

