<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Inconsistent duplicated row with Spark (Databricks on MS Azure) in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/inconsistent-duplicated-row-with-spark-databricks-on-ms-azure/m-p/12548#M7348</link>
    <description>&lt;P&gt;I'm having a weird behavior with Apache Spark, which I run in a Python Notebook on Azure Databricks. I have a dataframe with some data, with 2 columns of interest: name and ftime&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I found that I sometime have duplicated values, sometime not, depending on how I fetch the data:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df.where(col('name') == 'test').where(col('ftime') == '2022-07-18').count()
# Result is 1&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;But when I run&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;len(df.where(col('name') == 'test').where(col('ftime') == '2022-07-18').collect())
&amp;nbsp;
# Result is 2&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;, I now have a result of 2 rows, which are exactly the same. Those two cells are ran one after the other, the order doesn't change anything.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I tried creating a temp view in spark with&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df.createOrReplaceTempView('df_referential')&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;but I run in the same problem:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;SELECT name, ftime, COUNT(*)
FROM df_referential
GROUP BY name, ftime
HAVING COUNT(*) &amp;gt; 1&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;returns no result, while&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;SELECT *
FROM df_referential
WHERE name = 'test' AND ftime = '2022-07-18'&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;returns two rows, perfectly identical.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;And if I try to&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df.filter((col('name') == 'test') &amp;amp; (col('ftime') == '2022-07-18')).show()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have 2 rows, exactly identical, but&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df.filter((col('name') == 'test') &amp;amp; (col('ftime') == '2022-07-18')).select('name', 'ftime').show()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;gives only one row&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'm having a hard time understanding why this happens. I expect these to returns only one row, and the JSON file that the data is read from contains only one occurrence of the data.&lt;/P&gt;&lt;P&gt;If someone can point me at what I'm doing wrong, this would be of great help&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 26 Jul 2022 08:41:24 GMT</pubDate>
    <dc:creator>whatthespark</dc:creator>
    <dc:date>2022-07-26T08:41:24Z</dc:date>
    <item>
      <title>Inconsistent duplicated row with Spark (Databricks on MS Azure)</title>
      <link>https://community.databricks.com/t5/data-engineering/inconsistent-duplicated-row-with-spark-databricks-on-ms-azure/m-p/12548#M7348</link>
      <description>&lt;P&gt;I'm having a weird behavior with Apache Spark, which I run in a Python Notebook on Azure Databricks. I have a dataframe with some data, with 2 columns of interest: name and ftime&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I found that I sometime have duplicated values, sometime not, depending on how I fetch the data:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df.where(col('name') == 'test').where(col('ftime') == '2022-07-18').count()
# Result is 1&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;But when I run&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;len(df.where(col('name') == 'test').where(col('ftime') == '2022-07-18').collect())
&amp;nbsp;
# Result is 2&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;, I now have a result of 2 rows, which are exactly the same. Those two cells are ran one after the other, the order doesn't change anything.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I tried creating a temp view in spark with&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df.createOrReplaceTempView('df_referential')&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;but I run in the same problem:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;SELECT name, ftime, COUNT(*)
FROM df_referential
GROUP BY name, ftime
HAVING COUNT(*) &amp;gt; 1&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;returns no result, while&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;SELECT *
FROM df_referential
WHERE name = 'test' AND ftime = '2022-07-18'&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;returns two rows, perfectly identical.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;And if I try to&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df.filter((col('name') == 'test') &amp;amp; (col('ftime') == '2022-07-18')).show()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have 2 rows, exactly identical, but&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df.filter((col('name') == 'test') &amp;amp; (col('ftime') == '2022-07-18')).select('name', 'ftime').show()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;gives only one row&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'm having a hard time understanding why this happens. I expect these to returns only one row, and the JSON file that the data is read from contains only one occurrence of the data.&lt;/P&gt;&lt;P&gt;If someone can point me at what I'm doing wrong, this would be of great help&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 26 Jul 2022 08:41:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/inconsistent-duplicated-row-with-spark-databricks-on-ms-azure/m-p/12548#M7348</guid>
      <dc:creator>whatthespark</dc:creator>
      <dc:date>2022-07-26T08:41:24Z</dc:date>
    </item>
    <item>
      <title>Re: Inconsistent duplicated row with Spark (Databricks on MS Azure)</title>
      <link>https://community.databricks.com/t5/data-engineering/inconsistent-duplicated-row-with-spark-databricks-on-ms-azure/m-p/12549#M7349</link>
      <description>&lt;P&gt;I would like to see how you create the df dataframe.&lt;/P&gt;&lt;P&gt;In pyspark you can get weird results if you do not clear state, or when you reuse dataframe names.&lt;/P&gt;</description>
      <pubDate>Tue, 26 Jul 2022 09:55:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/inconsistent-duplicated-row-with-spark-databricks-on-ms-azure/m-p/12549#M7349</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-07-26T09:55:40Z</dc:date>
    </item>
    <item>
      <title>Re: Inconsistent duplicated row with Spark (Databricks on MS Azure)</title>
      <link>https://community.databricks.com/t5/data-engineering/inconsistent-duplicated-row-with-spark-databricks-on-ms-azure/m-p/12550#M7350</link>
      <description>&lt;P&gt;I mount an AZ blob storage with this&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;try:
  dbutils.fs.mount(
    source = "wasbs://client@client.blob.core.windows.net",
    mount_point = "/mnt/client",
    extra_configs = {"fs.azure.account.key."+storage_account_name+".blob.core.windows.net": storage_account_access_key})
except Exception as e:
  pass&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;I then load the data JSON files with:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;from pyspark.sql import types
&amp;nbsp;
def extra_cat_name(categories):
    """Extract the categories name, handling the fact that different types co-exist because of data migration"""
    true = True
    false = False
    null = None
    if isinstance(categories, str):
        categories = eval(categories)
    try:
        return [categories['translatedNames'][0]['translation']]
    except Exception as e:
        return categories
&amp;nbsp;
cat_extract = spark.udf.register("cat_extract", extra_cat_name, types.ArrayType(types.StringType()))
&amp;nbsp;
def add_time_from_filename(df, indicator):
    """Add a timestamp column extracted from the filename
    :param df: the dataframe to work on
    :param indicator: the last word before the date in the file name"""
    return df\
            .withColumn('file_name', F.input_file_name())\
            .withColumn('regtime', F.regexp_extract('file_name', f'(_{indicator}_)(.*)(\.json)', 2))\
            .withColumn('ftime', F.to_date(col("regtime"),"yyyy_MM_dd"))\
&amp;nbsp;
def load_data():
  referential_files = f"/mnt/client/{GIVEN_CLIENT_NAME}/internal_api/referential_data/*.json"
  user_files = f"/mnt/client/{GIVEN_CLIENT_NAME}/internal_api/users_data/*.json"
&amp;nbsp;
  
  df_users = spark.read.json(user_files)
  df = spark.read.json(referential_files)\
       .withColumn("data", F.explode("skills")).select("data.*")\
       .withColumn('categories', cat_extract('categories'))
  df = add_time_from_filename(df, 'referential')&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 26 Jul 2022 10:24:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/inconsistent-duplicated-row-with-spark-databricks-on-ms-azure/m-p/12550#M7350</guid>
      <dc:creator>whatthespark</dc:creator>
      <dc:date>2022-07-26T10:24:48Z</dc:date>
    </item>
    <item>
      <title>Re: Inconsistent duplicated row with Spark (Databricks on MS Azure)</title>
      <link>https://community.databricks.com/t5/data-engineering/inconsistent-duplicated-row-with-spark-databricks-on-ms-azure/m-p/12551#M7351</link>
      <description>&lt;P&gt;ok so each time you do an action, the code for your df is executed.&lt;/P&gt;&lt;P&gt;I suspect if you clear state between the two cells that your issue is gone.&lt;/P&gt;&lt;P&gt;Or try to assign to a new dataframe like&lt;/P&gt;&lt;P&gt;  df2 = add_time_from_filename(df, 'referential')&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 26 Jul 2022 11:06:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/inconsistent-duplicated-row-with-spark-databricks-on-ms-azure/m-p/12551#M7351</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-07-26T11:06:24Z</dc:date>
    </item>
    <item>
      <title>Re: Inconsistent duplicated row with Spark (Databricks on MS Azure)</title>
      <link>https://community.databricks.com/t5/data-engineering/inconsistent-duplicated-row-with-spark-databricks-on-ms-azure/m-p/12552#M7352</link>
      <description>&lt;P&gt;Thanks a lot, it seems that disabling the extra_cat_name(categories) udf fixes the duplicates, I however don't understant why, if you have an idea&lt;/P&gt;</description>
      <pubDate>Tue, 26 Jul 2022 12:43:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/inconsistent-duplicated-row-with-spark-databricks-on-ms-azure/m-p/12552#M7352</guid>
      <dc:creator>whatthespark</dc:creator>
      <dc:date>2022-07-26T12:43:57Z</dc:date>
    </item>
  </channel>
</rss>

