<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Delta Live Tables has duplicates created by multiple workers in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/delta-live-tables-has-duplicates-created-by-multiple-workers/m-p/27064#M18967</link>
    <description>&lt;P&gt;I use 'toLocalIterator' to iterate over a dataframe to make a dictionary item which is then for looped. For each key value in the Dictionary, a dataframe is returned after processing. That dataframe is appended to a list of dataframes. Using Reduce I create one DF from the list of DFs. From this the DLT is created.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Based on your answer I removed the append and reduce fucntions and directly used union as when dataframe is returned and this solved the duplciate issue.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you for the hint.&lt;/P&gt;</description>
    <pubDate>Mon, 28 Feb 2022 14:22:08 GMT</pubDate>
    <dc:creator>SM</dc:creator>
    <dc:date>2022-02-28T14:22:08Z</dc:date>
    <item>
      <title>Delta Live Tables has duplicates created by multiple workers</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-live-tables-has-duplicates-created-by-multiple-workers/m-p/27058#M18961</link>
      <description>&lt;P&gt;Hello, I am working with Delta Live Tables, I am trying to create a DLT from a combination of Dataframes from a 'for loop' which are unioned and then DLT is created over the Unioned Dataframe. However I noticed that the delta table has duplciates. And the Number of Duplicates per Unique Row is the number of workers in the Pool Cluster. How can I avoid this duplication from happening.&lt;/P&gt;</description>
      <pubDate>Fri, 25 Feb 2022 07:40:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-live-tables-has-duplicates-created-by-multiple-workers/m-p/27058#M18961</guid>
      <dc:creator>SM</dc:creator>
      <dc:date>2022-02-25T07:40:54Z</dc:date>
    </item>
    <item>
      <title>Re: Delta Live Tables has duplicates created by multiple workers</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-live-tables-has-duplicates-created-by-multiple-workers/m-p/27059#M18962</link>
      <description>&lt;P&gt;Are you sure it is not because of your loop?&lt;/P&gt;</description>
      <pubDate>Fri, 25 Feb 2022 09:54:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-live-tables-has-duplicates-created-by-multiple-workers/m-p/27059#M18962</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-02-25T09:54:14Z</dc:date>
    </item>
    <item>
      <title>Re: Delta Live Tables has duplicates created by multiple workers</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-live-tables-has-duplicates-created-by-multiple-workers/m-p/27060#M18963</link>
      <description>&lt;P&gt;Please share your code, as @Werner Stinckens​&amp;nbsp;said it is other issue&lt;/P&gt;</description>
      <pubDate>Fri, 25 Feb 2022 12:59:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-live-tables-has-duplicates-created-by-multiple-workers/m-p/27060#M18963</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-02-25T12:59:56Z</dc:date>
    </item>
    <item>
      <title>Re: Delta Live Tables has duplicates created by multiple workers</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-live-tables-has-duplicates-created-by-multiple-workers/m-p/27061#M18964</link>
      <description>&lt;PRE&gt;&lt;CODE&gt;def map_explode(df):
  metric = create_map(list(chain(*(
    (lit(name), col(name).cast(StringType())) for name in df.columns if name not in ['timestamp', 'path']
  ))))
  df = df.withColumn('mapCol', metric)
  result = df.select('timestamp', 'path', explode(df.mapCol).alias('metric','value'))
  return result
&amp;nbsp;
def load_data(path):
  data =(spark.read
       .format("parquet")
       .option("recursiveFileLookup","true")
       .load(path))
  columns = data.columns
  sft_df= (data.withColumn("timestamp", (col("_time")/1000).cast("timestamp"))
        .withColumn("path", input_file_name())
        .select('timestamp', 'path', *columns))
  
  df_transpose =  map_explode(sft_df)
  df_transpose = (df_transpose.withColumn("asset", regexp_extract(col("path"),asset_str,0))
                 .withColumn("aspect", regexp_extract(col("path"),aspect_str,0))
                 .select(*fact_columns)
                 )
&amp;nbsp;
  return df_transpose
&amp;nbsp;
aspect_paths ={}
df_append=[]
@dlt.table()
def fact_table():
  aspect_master = spark.read.table("default.master").select("name")
  for var in master.toLocalIterator():
    aspect_name = str(var["name"])
    aspect_paths.update({aspect_name: ([aspect.path for asset in dbutils.fs.ls(filepath) if 'entityId' in asset.name for aspect in dbutils.fs.ls(asset.path) if aspect_name in aspect.name])})
  
  
  for asp in aspect_paths.keys():
    if len(aspect_paths[asp])!=0:
      data = load_data(aspect_paths[asp])
      df_append.append(data)
    
  fact_df = reduce(DataFrame.unionAll, df_append)
  return fact_df&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;From this code I have tried to directly write to a delta table not via DLT and I dont see any duplicates. Here is the code. &lt;/P&gt;</description>
      <pubDate>Mon, 28 Feb 2022 06:53:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-live-tables-has-duplicates-created-by-multiple-workers/m-p/27061#M18964</guid>
      <dc:creator>SM</dc:creator>
      <dc:date>2022-02-28T06:53:12Z</dc:date>
    </item>
    <item>
      <title>Re: Delta Live Tables has duplicates created by multiple workers</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-live-tables-has-duplicates-created-by-multiple-workers/m-p/27062#M18965</link>
      <description>&lt;P&gt;Hm hard to tell. You use a mix of pyspark and python objects, perhaps that is the reason as some will be executed on the driver and others over the workers.&lt;/P&gt;&lt;P&gt;Can I ask why you use the toLocalIterator and the append as a list (df_append) which you then reduce with functools?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 28 Feb 2022 07:34:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-live-tables-has-duplicates-created-by-multiple-workers/m-p/27062#M18965</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-02-28T07:34:24Z</dc:date>
    </item>
    <item>
      <title>Re: Delta Live Tables has duplicates created by multiple workers</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-live-tables-has-duplicates-created-by-multiple-workers/m-p/27063#M18966</link>
      <description>&lt;P&gt;This shared code is not correct for delta live tables. It is kind of streaming tables and here every time it loop hive metastore and load data for all tables so results can be unexpected. Just use normal job notebook without delta live tables.&lt;/P&gt;</description>
      <pubDate>Mon, 28 Feb 2022 10:20:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-live-tables-has-duplicates-created-by-multiple-workers/m-p/27063#M18966</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-02-28T10:20:30Z</dc:date>
    </item>
    <item>
      <title>Re: Delta Live Tables has duplicates created by multiple workers</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-live-tables-has-duplicates-created-by-multiple-workers/m-p/27064#M18967</link>
      <description>&lt;P&gt;I use 'toLocalIterator' to iterate over a dataframe to make a dictionary item which is then for looped. For each key value in the Dictionary, a dataframe is returned after processing. That dataframe is appended to a list of dataframes. Using Reduce I create one DF from the list of DFs. From this the DLT is created.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Based on your answer I removed the append and reduce fucntions and directly used union as when dataframe is returned and this solved the duplciate issue.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you for the hint.&lt;/P&gt;</description>
      <pubDate>Mon, 28 Feb 2022 14:22:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-live-tables-has-duplicates-created-by-multiple-workers/m-p/27064#M18967</guid>
      <dc:creator>SM</dc:creator>
      <dc:date>2022-02-28T14:22:08Z</dc:date>
    </item>
    <item>
      <title>Re: Delta Live Tables has duplicates created by multiple workers</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-live-tables-has-duplicates-created-by-multiple-workers/m-p/27065#M18968</link>
      <description>&lt;P&gt;mixing python and pyspark often gives issues.&lt;/P&gt;&lt;P&gt;better to go all-in on pandas and then convert to dataframe or  go for pyspark (like you did now).&lt;/P&gt;</description>
      <pubDate>Mon, 28 Feb 2022 14:25:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-live-tables-has-duplicates-created-by-multiple-workers/m-p/27065#M18968</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-02-28T14:25:49Z</dc:date>
    </item>
    <item>
      <title>Re: Delta Live Tables has duplicates created by multiple workers</title>
      <link>https://community.databricks.com/t5/data-engineering/delta-live-tables-has-duplicates-created-by-multiple-workers/m-p/27066#M18969</link>
      <description>&lt;P&gt;@Shikha Mathew​&amp;nbsp;- Does your last answer mean that your issue is resolved? Would you be happy to mark whichever answer helped as best? Or, if it wasn't a specific one, would you tell us what worked?&lt;/P&gt;</description>
      <pubDate>Mon, 07 Mar 2022 00:25:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/delta-live-tables-has-duplicates-created-by-multiple-workers/m-p/27066#M18969</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-03-07T00:25:10Z</dc:date>
    </item>
  </channel>
</rss>

