<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Skewness / Salting with countDistinct in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/skewness-salting-with-countdistinct/m-p/106300#M42441</link>
    <description>&lt;P&gt;you can make use of databricks native feature "Liquid Clustering", cluster by the columns which you want to use in grouping statements, it will handle the performance issue due to data skewness .&lt;/P&gt;&lt;P&gt;For more information, please do visit :&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/en/delta/clustering.html" target="_blank"&gt;https://docs.databricks.com/en/delta/clustering.html&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 20 Jan 2025 11:52:12 GMT</pubDate>
    <dc:creator>Avinash_Narala</dc:creator>
    <dc:date>2025-01-20T11:52:12Z</dc:date>
    <item>
      <title>Skewness / Salting with countDistinct</title>
      <link>https://community.databricks.com/t5/data-engineering/skewness-salting-with-countdistinct/m-p/81919#M36450</link>
      <description>&lt;P&gt;Hey Everyone,&lt;/P&gt;&lt;P&gt;I experience data skewness for:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;df = (source_df
.unionByName(source_df.withColumn("region", lit("Country")))
.groupBy("zip_code", "region", "device_type")
.agg(countDistinct("device_id").alias("total_active_unique"), count("device_id").alias("total_active"))&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Stats:&lt;BR /&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Screenshot 2024-08-05 at 17.24.08.png" style="width: 999px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/10142i4089442A2EDB55FE/image-size/large/is-moderation-mode/true?v=v2&amp;amp;px=999" role="button" title="Screenshot 2024-08-05 at 17.24.08.png" alt="Screenshot 2024-08-05 at 17.24.08.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Is there a way to work with data skewness where I need to calculate a countDistinct in aggregation and avoid affecting the results?&lt;/P&gt;&lt;P&gt;I understand how to work with data skewness by adding salting, but it seems to be fine with count.&lt;BR /&gt;But when countDistinct comes to the picture salting seems to be affecting the results.&lt;BR /&gt;Is there some tricky way to still apply salting and secure deterministic results for countDistinct?&lt;BR /&gt;Or is there some other approach in such case to be applied for data skewness?&lt;/P&gt;</description>
      <pubDate>Mon, 05 Aug 2024 15:25:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/skewness-salting-with-countdistinct/m-p/81919#M36450</guid>
      <dc:creator>KosmaS</dc:creator>
      <dc:date>2024-08-05T15:25:15Z</dc:date>
    </item>
    <item>
      <title>Re: Skewness / Salting with countDistinct</title>
      <link>https://community.databricks.com/t5/data-engineering/skewness-salting-with-countdistinct/m-p/82230#M36568</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/112385"&gt;@KosmaS&lt;/a&gt;,&amp;nbsp;To address data skewness with `countDistinct`, you can use several techniques:&lt;/P&gt;
&lt;P&gt;Double Aggregation involves salting the data, performing an aggregation, then removing the salt and aggregating again to reduce skewness.&lt;/P&gt;
&lt;P&gt;HyperLogLog (HLL) provides approximate results for `countDistinct`, balancing accuracy and performance.&lt;/P&gt;
&lt;P&gt;Broadcast Joins can help with small skewed datasets by reducing shuffle issues, while Partitioning the data based on skewed keys can distribute the load more evenly.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;If you have any more details or specific constraints, feel free to share!&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 07 Aug 2024 13:30:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/skewness-salting-with-countdistinct/m-p/82230#M36568</guid>
      <dc:creator>Retired_mod</dc:creator>
      <dc:date>2024-08-07T13:30:55Z</dc:date>
    </item>
    <item>
      <title>Re: Skewness / Salting with countDistinct</title>
      <link>https://community.databricks.com/t5/data-engineering/skewness-salting-with-countdistinct/m-p/83896#M37058</link>
      <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;thanks for the reply. I tried to spend some time on your response.&lt;/P&gt;&lt;P&gt;You're suggesting 'double aggregation' and as I'd be guessing it should look more or less this way:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;df = (source_df
.unionByName(source_df.withColumn("region", lit("Country")))
.groupBy("zip_code", "region", "device_type", "salt")
.agg(countDistinct("device_id").alias("total_active_unique"), count("device_id").alias("total_active"))
.groupBy("zip_code", "region", "device_type")
.agg(countDistinct("device_id").alias("total_active_unique"), count("device_id").alias("total_active"))&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So I can't see how countDistinct value won't be affected by salt. It'll be affected at the first step (with salt), so the second step will have inaccurate results. Or should this be done a bit differently? And did you mean something else?&lt;/P&gt;</description>
      <pubDate>Thu, 22 Aug 2024 08:17:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/skewness-salting-with-countdistinct/m-p/83896#M37058</guid>
      <dc:creator>KosmaS</dc:creator>
      <dc:date>2024-08-22T08:17:08Z</dc:date>
    </item>
    <item>
      <title>Re: Skewness / Salting with countDistinct</title>
      <link>https://community.databricks.com/t5/data-engineering/skewness-salting-with-countdistinct/m-p/106193#M42420</link>
      <description>&lt;P&gt;What about salt function is function on device_id produces mutually exclusive results like hash(device_id) % 101 and then one more aggregation to sum of these counts group by zip_code, region, device_type&lt;/P&gt;</description>
      <pubDate>Sat, 18 Jan 2025 19:24:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/skewness-salting-with-countdistinct/m-p/106193#M42420</guid>
      <dc:creator>singhvikash86</dc:creator>
      <dc:date>2025-01-18T19:24:29Z</dc:date>
    </item>
    <item>
      <title>Re: Skewness / Salting with countDistinct</title>
      <link>https://community.databricks.com/t5/data-engineering/skewness-salting-with-countdistinct/m-p/106300#M42441</link>
      <description>&lt;P&gt;you can make use of databricks native feature "Liquid Clustering", cluster by the columns which you want to use in grouping statements, it will handle the performance issue due to data skewness .&lt;/P&gt;&lt;P&gt;For more information, please do visit :&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/en/delta/clustering.html" target="_blank"&gt;https://docs.databricks.com/en/delta/clustering.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 20 Jan 2025 11:52:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/skewness-salting-with-countdistinct/m-p/106300#M42441</guid>
      <dc:creator>Avinash_Narala</dc:creator>
      <dc:date>2025-01-20T11:52:12Z</dc:date>
    </item>
  </channel>
</rss>

