Cannot display DataFrame when I filter by length

cralle
New Contributor II

I have a DataFrame that I have created based on a couple of datasets and multiple operations. The DataFrame has multiple columns, one of which is a array of strings. But when I take the DataFrame and try to filter based upon the size of this array column, and execute the command nothing happens, I get no Spark Jobs, no Stages and in the Ganglia cluster Report there is no computation happening.

All I see in my cell is "Running Command". I am sure it happens because of the filter, because I can display and do a count on the DataFrame, but as soon as I do a filter Databricks gets stuck. I also know that it is not the size function that causes it, because I can make a new colum with the size of the array and I can display and count fine after adding.

So I have a DataFrame I call df_merged_mapped. This is the DataFrame that I want to filter. So I can do this:

print(df_merged_mapped.count())
> 50414

I can also do:

print(df_merged_mapped.withColumn("len", F.size("new_topics")).count())
> 50414

But if I do this, then nothing will happen:

print(df_merged_mapped.withColumn("len", F.size("new_topics")).filter("len > 0").count())

It will just say "Running command...", no Stages or Jobs will ever appear, and the cluster just terminates after the idle period (30 minutes) triggers.

image 

And this is how Gangle cluster Report looks like (I run single node):

image 

I have also tried the following, and each of them stalls and never completes:

df_merged_mapped.createOrReplaceTempView("tmp")
df_new = spark.sql("SELECT *, size(new_topics) AS len FROM tmp WHERE size(new_topics) > 0")
print(df_merged_mapped.filter("size(new_topics) > 0").count())

Does anybody have any ideas why this is happening, and what I can do to solve the problem?

EDIT:

Run on cluster using DBR 10.4 LTS