cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

RLS function : concat vs list

MaximeGendre
New Contributor III

Hello all, 
I'm designing a function to implement RLS on Unity Catalog for multiple tables of different size (1k to 10G rows).
RLS will be applied on two columns and 150+ groups.
I wonder what would be more performant :

Solution 1: exhaustive (boring) listing of combinations

 

(col1 = 'A1' and col2 = 'B1' and is_account_group_member('GP-A1-B1'))
or (col1 = 'A2' and col2 = 'B2' and is_account_group_member('GP-A2-B2'))
or (col1 = 'A3' and col2 = 'B3' and is_account_group_member('GP-A3-B3'))
or (col1 = 'A4' and col2 = 'B4' and is_account_group_member('GP-A4-B4'))
or (col1 = 'A5' and col2 = 'B5' and is_account_group_member('GP-A5-B5'))
or (col1 = 'A6' and col2 = 'B6' and is_account_group_member('GP-A6-B6'))
... * 150

 

Solution 2: simple with concat

 

is_account_group_member(concat('GP-',col1,'-', col2))

 

Of course, I would prefer the second option but I'm afraid of the "cost" of the concat versus using literals.

Did someone already experiment that ?

Thank you.

 

2 REPLIES 2

NandiniN
Databricks Employee
Databricks Employee

Checking.

mark_ott
Databricks Employee
Databricks Employee

The more performant solution for Row-Level Security (RLS) in Unity Catalog, when applying to two columns and 150+ groups, generally depends on how much of the access check logic can be pushed into efficient, indexable predicates versus computed at runtime per row.

Solution Comparison

  • Solution 1 (Exhaustive Listing)
    This approach writes out all group+column combinations as separate OR conditions. The optimizer can potentially leverage column statistics and indexes for the direct comparisons, leading to better query plans if those columns are indexed. However, as the number of combinations grows (such as your 150+ groups), the WHERE clause becomes very large, possibly overwhelming the optimizer or leading to very complex plans that are hard to cache and maintain. Managing and updating the logic is also a major operational headache.​

  • Solution 2 (Concat in Function Argument)
    Here you use a dynamic function call: is_account_group_member(concat('GP-', col1, '-', col2)). This is far more maintainable. However, this approach introduces row-wise string operations and may make optimization less efficient, especially for very large tables, since the concatenation and function evaluation can't be indexed and must execute for every row, potentially impacting scan times.​

Performance and Databricks Recommendations

  • Databricks UDF/ABAC Policy Best Practices:

    • Avoid per-row expensive logic inside custom functions.

    • Heavy per-row computations, dynamic string building, and checks like is_account_group_member() within the row filter function make the queries less scalable for large tables.

    • Whenever possible, keep access logic outside UDFs or use indexed predicates for best performance.​

  • Known Bottlenecks:

    • Row filters that perform row-wise function calls (e.g., string concat + group membership check) on massive tables (10G+ rows) will always have some runtime penalty.

    • Large, unindexed OR chains as in Solution 1 can cause the optimizer to generate slow plans or, in some engines, result in full table scans, though they perform better if all columns in the condition are indexed and the list is not too large.​

What to Do

  • For small tables or moderate group counts, the difference may be negligible.

  • For large tables (> 1M rows), avoid dynamic per-row logic (Solution 2) if you need maximum performance.

  • If possible, materialize the group information as a third column during ETL, and then use a simple indexed equality check in your RLS, e.g., target_group = current_user_group, which is highly performant.​

  • If group memberships are dynamic and must be checked row-wise, measure both approaches on a large sample set with realistic filters to see actual impact.

Recommendation

If absolute performance is a must (especially in the 10G row case), use materialized columns and indexed predicates, not string concatenation or long OR chains.​
If maintainability and flexibility are more important and performance hit is acceptable, use the concat-based function for clarity.
For Databricks Unity Catalog, prefer ABAC policy patterns with indexed access where feasible, and profile on your data.​

In summary, while Solution 2 (concat) is simpler, it almost always performs worse for very large tables because it cannot exploit indexing or optimized predicate pushdown, especially in Databricks/Unity Catalog context.​