<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: RLS function : concat vs list in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/rls-function-concat-vs-list/m-p/117311#M45479</link>
    <description>&lt;P&gt;Checking.&lt;/P&gt;</description>
    <pubDate>Thu, 01 May 2025 07:00:36 GMT</pubDate>
    <dc:creator>NandiniN</dc:creator>
    <dc:date>2025-05-01T07:00:36Z</dc:date>
    <item>
      <title>RLS function : concat vs list</title>
      <link>https://community.databricks.com/t5/data-engineering/rls-function-concat-vs-list/m-p/110046#M43471</link>
      <description>&lt;P&gt;Hello all,&amp;nbsp;&lt;BR /&gt;I'm designing a function to implement RLS on Unity Catalog for multiple tables of different size (1k to 10G rows).&lt;BR /&gt;RLS will be applied on two columns and 150+ groups.&lt;BR /&gt;I wonder what would be more performant :&lt;/P&gt;&lt;P&gt;Solution 1: exhaustive (boring) listing of combinations&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;(col1 = 'A1' and col2 = 'B1' and is_account_group_member('GP-A1-B1'))
or (col1 = 'A2' and col2 = 'B2' and is_account_group_member('GP-A2-B2'))
or (col1 = 'A3' and col2 = 'B3' and is_account_group_member('GP-A3-B3'))
or (col1 = 'A4' and col2 = 'B4' and is_account_group_member('GP-A4-B4'))
or (col1 = 'A5' and col2 = 'B5' and is_account_group_member('GP-A5-B5'))
or (col1 = 'A6' and col2 = 'B6' and is_account_group_member('GP-A6-B6'))
... * 150&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Solution 2: simple with concat&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;is_account_group_member(concat('GP-',col1,'-', col2))&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN class=""&gt;Of course, I would prefer the second option but I'm afraid of the "cost" of the concat versus using literals.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Did someone already experiment that ?&lt;/P&gt;&lt;P&gt;Thank you.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 12 Feb 2025 21:20:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/rls-function-concat-vs-list/m-p/110046#M43471</guid>
      <dc:creator>MaximeGendre</dc:creator>
      <dc:date>2025-02-12T21:20:23Z</dc:date>
    </item>
    <item>
      <title>Re: RLS function : concat vs list</title>
      <link>https://community.databricks.com/t5/data-engineering/rls-function-concat-vs-list/m-p/117311#M45479</link>
      <description>&lt;P&gt;Checking.&lt;/P&gt;</description>
      <pubDate>Thu, 01 May 2025 07:00:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/rls-function-concat-vs-list/m-p/117311#M45479</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2025-05-01T07:00:36Z</dc:date>
    </item>
    <item>
      <title>Re: RLS function : concat vs list</title>
      <link>https://community.databricks.com/t5/data-engineering/rls-function-concat-vs-list/m-p/137033#M50689</link>
      <description>&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;The more performant solution for Row-Level Security (RLS) in Unity Catalog, when applying to two columns and 150+ groups, generally depends on how much of the access check logic can be pushed into efficient, indexable predicates versus computed at runtime per row.&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Solution Comparison&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Solution 1 (Exhaustive Listing)&lt;/STRONG&gt;&lt;BR /&gt;This approach writes out all group+column combinations as separate OR conditions. The optimizer can potentially leverage column statistics and indexes for the direct comparisons, leading to better query plans if those columns are indexed. However, as the number of combinations grows (such as your 150+ groups), the WHERE clause becomes very large, possibly overwhelming the optimizer or leading to very complex plans that are hard to cache and maintain. Managing and updating the logic is also a major operational headache.​&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Solution 2 (Concat in Function Argument)&lt;/STRONG&gt;&lt;BR /&gt;Here you use a dynamic function call:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;is_account_group_member(concat('GP-', col1, '-', col2))&lt;/CODE&gt;. This is far more maintainable. However, this approach introduces row-wise string operations and may make optimization less efficient, especially for very large tables, since the concatenation and function evaluation can't be indexed and must execute for every row, potentially impacting scan times.​&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Performance and Databricks Recommendations&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Databricks UDF/ABAC Policy Best Practices:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Avoid per-row expensive logic inside custom functions.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Heavy per-row computations, dynamic string building, and checks like&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;is_account_group_member()&lt;/CODE&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;within the row filter function make the queries less scalable for large tables.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Whenever possible, keep access logic outside UDFs or use indexed predicates for best performance.​&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;Known Bottlenecks:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Row filters that perform row-wise function calls (e.g., string concat + group membership check) on massive tables (10G+ rows) will always have some runtime penalty.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;Large, unindexed OR chains as in Solution 1 can cause the optimizer to generate slow plans or, in some engines, result in full table scans, though they perform better if all columns in the condition are indexed and the list is not too large.​&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;What to Do&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;For small tables or moderate group counts, the difference may be negligible.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;For large tables (&amp;gt; 1M rows), avoid dynamic per-row logic (Solution 2) if you need maximum performance.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;If possible,&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;materialize the group information as a third column during ETL&lt;/STRONG&gt;, and then use a simple indexed equality check in your RLS, e.g.,&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;CODE&gt;target_group = current_user_group&lt;/CODE&gt;, which is highly performant.​&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;If group memberships are dynamic and must be checked row-wise, measure both approaches on a large sample set with realistic filters to see actual impact.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Recommendation&lt;/H2&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;If absolute performance is a must (especially in the 10G row case), use materialized columns and indexed predicates, not string concatenation or long OR chains.​&lt;BR /&gt;If maintainability and flexibility are more important and performance hit is acceptable, use the concat-based function for clarity.&lt;BR /&gt;For Databricks Unity Catalog, prefer ABAC policy patterns with indexed access where feasible, and profile on your data.​&lt;/P&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;In summary, while Solution 2 (concat) is simpler, it almost always performs worse for very large tables because it cannot exploit indexing or optimized predicate pushdown, especially in Databricks/Unity Catalog context.​&lt;/P&gt;</description>
      <pubDate>Fri, 31 Oct 2025 15:15:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/rls-function-concat-vs-list/m-p/137033#M50689</guid>
      <dc:creator>mark_ott</dc:creator>
      <dc:date>2025-10-31T15:15:39Z</dc:date>
    </item>
  </channel>
</rss>

