<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Cannot display DataFrame when I filter by length in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/cannot-display-dataframe-when-i-filter-by-length/m-p/11690#M6634</link>
    <description>&lt;P&gt;@Werner Stinckens​&amp;nbsp;It worked by writing it and then reading it again. So I guess your theory of the query plan being too complex is true.&lt;/P&gt;</description>
    <pubDate>Wed, 03 Aug 2022 10:53:12 GMT</pubDate>
    <dc:creator>cralle</dc:creator>
    <dc:date>2022-08-03T10:53:12Z</dc:date>
    <item>
      <title>Cannot display DataFrame when I filter by length</title>
      <link>https://community.databricks.com/t5/data-engineering/cannot-display-dataframe-when-i-filter-by-length/m-p/11684#M6628</link>
      <description>&lt;P&gt;I have a DataFrame that I have created based on a couple of datasets and multiple operations. The DataFrame has multiple columns, one of which is a array of strings. But when I take the DataFrame and try to filter based upon the size of this array column, and execute the command nothing happens, I get no Spark Jobs, no Stages and in the Ganglia cluster Report there is no computation happening. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;All I see in my cell is "Running Command". I am sure it happens because of the filter, because I can display and do a count on the DataFrame, but as soon as I do a filter Databricks gets stuck. I also know that it is not the size function that causes it, because I can make a new colum with the size of the array and I can display and count fine after adding.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;So I have a DataFrame I call &lt;I&gt;df_merged_mapped&lt;/I&gt;. This is the DataFrame that I want to filter. So I can do this:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;print(df_merged_mapped.count())
&amp;gt; 50414&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I can also do:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;print(df_merged_mapped.withColumn("len", F.size("new_topics")).count())
&amp;gt; 50414&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;But if I do this, then nothing will happen:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;print(df_merged_mapped.withColumn("len", F.size("new_topics")).filter("len &amp;gt; 0").count())&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;It will just say "Running command...", no Stages or Jobs will ever appear, and the cluster just terminates after the idle period (30 minutes) triggers.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1671i036378DF15ED2F41/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;And this is how Gangle cluster Report looks like (I run single node):&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1676i4A409ED4DD8AD9A8/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have also tried the following, and each of them stalls and never completes:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df_merged_mapped.createOrReplaceTempView("tmp")
df_new = spark.sql("SELECT *, size(new_topics) AS len FROM tmp WHERE size(new_topics) &amp;gt; 0")&lt;/CODE&gt;&lt;/PRE&gt;&lt;PRE&gt;&lt;CODE&gt;print(df_merged_mapped.filter("size(new_topics) &amp;gt; 0").count())&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Does anybody have any ideas why this is happening, and what I can do to solve the problem?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;EDIT:&lt;/P&gt;&lt;P&gt;Run on cluster using DBR 10.4 LTS&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 02 Aug 2022 06:12:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cannot-display-dataframe-when-i-filter-by-length/m-p/11684#M6628</guid>
      <dc:creator>cralle</dc:creator>
      <dc:date>2022-08-02T06:12:09Z</dc:date>
    </item>
    <item>
      <title>Re: Cannot display DataFrame when I filter by length</title>
      <link>https://community.databricks.com/t5/data-engineering/cannot-display-dataframe-when-i-filter-by-length/m-p/11685#M6629</link>
      <description>&lt;P&gt;strange, works fine here.  what version of databricks are you on?&lt;/P&gt;&lt;P&gt;What you could do to identify the issue is to output the query plan (.explain).&lt;/P&gt;&lt;P&gt;And also creating a new df for each transformation could help.  Like that you can check step by step where things go wrong.&lt;/P&gt;</description>
      <pubDate>Tue, 02 Aug 2022 09:57:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cannot-display-dataframe-when-i-filter-by-length/m-p/11685#M6629</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-08-02T09:57:34Z</dc:date>
    </item>
    <item>
      <title>Re: Cannot display DataFrame when I filter by length</title>
      <link>https://community.databricks.com/t5/data-engineering/cannot-display-dataframe-when-i-filter-by-length/m-p/11686#M6630</link>
      <description>&lt;P&gt;DBR 10.4 LTS. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I run a lot of transformations and operations before I do this filter. But since I am able to display/count on the DataFrame before the filter, I would assume that it is because of the filter that the error is caused.&lt;/P&gt;</description>
      <pubDate>Tue, 02 Aug 2022 10:45:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cannot-display-dataframe-when-i-filter-by-length/m-p/11686#M6630</guid>
      <dc:creator>cralle</dc:creator>
      <dc:date>2022-08-02T10:45:37Z</dc:date>
    </item>
    <item>
      <title>Re: Cannot display DataFrame when I filter by length</title>
      <link>https://community.databricks.com/t5/data-engineering/cannot-display-dataframe-when-i-filter-by-length/m-p/11687#M6631</link>
      <description>&lt;P&gt;don't see why it would generate an error as it works fine.&lt;/P&gt;&lt;P&gt;Perhaps, as a test, serialize the dataframe (write it and read again) and then try the filter + count.&lt;/P&gt;&lt;P&gt;If that works, it is probably your query plan which might be too complex, or the AQE which goes crazy.&lt;/P&gt;</description>
      <pubDate>Tue, 02 Aug 2022 10:49:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cannot-display-dataframe-when-i-filter-by-length/m-p/11687#M6631</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-08-02T10:49:11Z</dc:date>
    </item>
    <item>
      <title>Re: Cannot display DataFrame when I filter by length</title>
      <link>https://community.databricks.com/t5/data-engineering/cannot-display-dataframe-when-i-filter-by-length/m-p/11688#M6632</link>
      <description>&lt;P&gt;If the query plan is so complex, please do some checkpointing to break the plan into two.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Please also debug by .show() or display() dataframe one step before running .filter("len &amp;gt; 0") in notebook. &lt;/P&gt;</description>
      <pubDate>Tue, 02 Aug 2022 13:37:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cannot-display-dataframe-when-i-filter-by-length/m-p/11688#M6632</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-08-02T13:37:59Z</dc:date>
    </item>
    <item>
      <title>Re: Cannot display DataFrame when I filter by length</title>
      <link>https://community.databricks.com/t5/data-engineering/cannot-display-dataframe-when-i-filter-by-length/m-p/11689#M6633</link>
      <description>&lt;P&gt;What do you mean by checkpointing?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Also I have used display as I described in the question, I can do display/count directly before the filter and it works fine, but if I do it directly after the filter, then Spark just gets stuck and does nothing.&lt;/P&gt;</description>
      <pubDate>Wed, 03 Aug 2022 05:34:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cannot-display-dataframe-when-i-filter-by-length/m-p/11689#M6633</guid>
      <dc:creator>cralle</dc:creator>
      <dc:date>2022-08-03T05:34:12Z</dc:date>
    </item>
    <item>
      <title>Re: Cannot display DataFrame when I filter by length</title>
      <link>https://community.databricks.com/t5/data-engineering/cannot-display-dataframe-when-i-filter-by-length/m-p/11690#M6634</link>
      <description>&lt;P&gt;@Werner Stinckens​&amp;nbsp;It worked by writing it and then reading it again. So I guess your theory of the query plan being too complex is true.&lt;/P&gt;</description>
      <pubDate>Wed, 03 Aug 2022 10:53:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cannot-display-dataframe-when-i-filter-by-length/m-p/11690#M6634</guid>
      <dc:creator>cralle</dc:creator>
      <dc:date>2022-08-03T10:53:12Z</dc:date>
    </item>
    <item>
      <title>Re: Cannot display DataFrame when I filter by length</title>
      <link>https://community.databricks.com/t5/data-engineering/cannot-display-dataframe-when-i-filter-by-length/m-p/11691#M6635</link>
      <description>&lt;P&gt;checkpointing is writing an intermediate dataframe to disk.  Like that the whole logic leading to that DF can be forgotten.&lt;/P&gt;&lt;P&gt;Basically it is the same as writing to parquet and reading it.&lt;/P&gt;&lt;P&gt;There are some technical differences though, but &lt;A href="https://dzone.com/articles/what-are-spark-checkpoints-on-dataframes" alt="https://dzone.com/articles/what-are-spark-checkpoints-on-dataframes" target="_blank"&gt;here&lt;/A&gt; you can find more info (and on other sites too).&lt;/P&gt;&lt;P&gt;So applying a checkpoint somewhere in your code creates 2 or more smaller query plans than one huge one.&lt;/P&gt;&lt;P&gt;Could be a solution, or figuring out how to make the query simpler...&lt;/P&gt;</description>
      <pubDate>Wed, 03 Aug 2022 13:09:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/cannot-display-dataframe-when-i-filter-by-length/m-p/11691#M6635</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-08-03T13:09:06Z</dc:date>
    </item>
  </channel>
</rss>

