I'm having a weird behavior with Apache Spark, which I run in a Python Notebook on Azure Databricks. I have a dataframe with some data, with 2 columns of interest: name and ftime
I found that I sometime have duplicated values, sometime not, depending on how I fetch the data:
df.where(col('name') == 'test').where(col('ftime') == '2022-07-18').count()
# Result is 1
But when I run
len(df.where(col('name') == 'test').where(col('ftime') == '2022-07-18').collect())
# Result is 2
, I now have a result of 2 rows, which are exactly the same. Those two cells are ran one after the other, the order doesn't change anything.
I tried creating a temp view in spark with
df.createOrReplaceTempView('df_referential')
but I run in the same problem:
SELECT name, ftime, COUNT(*)
FROM df_referential
GROUP BY name, ftime
HAVING COUNT(*) > 1
returns no result, while
SELECT *
FROM df_referential
WHERE name = 'test' AND ftime = '2022-07-18'
returns two rows, perfectly identical.
And if I try to
df.filter((col('name') == 'test') & (col('ftime') == '2022-07-18')).show()
I have 2 rows, exactly identical, but
df.filter((col('name') == 'test') & (col('ftime') == '2022-07-18')).select('name', 'ftime').show()
gives only one row
I'm having a hard time understanding why this happens. I expect these to returns only one row, and the JSON file that the data is read from contains only one occurrence of the data.
If someone can point me at what I'm doing wrong, this would be of great help