cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

LIKE Operator Fails with COLLATE UNICODE_CI Columns

Phani1
Databricks MVP

 

Hi All,

We are experiencing problem while we are querying the table which has collate unicode_ci attribute enabled, Kindly find the below findings and let me know if any alternative approach to fix this issue?

Problem Statement

The LIKE operator in a query does not work when COLLATE UNICODE_CI is enabled on a table column, even though the = (equals) operator works fine with the same column and value.
 

Steps to Reproduce

1. Table Created with UNICODE_CI Property on Name Column (Case-Insensitive)

CREATE TABLE test_catalog_ph.bronze.my_table_cs_new2 (
id INT,
name STRING COLLATE UNICODE_CI
);

2. Sample Data Inserted

INSERT INTO test_catalog_ph.bronze.my_table_cs_new2 VALUES (1, 'TEST');
INSERT INTO test_catalog_ph.bronze.my_table_cs_new2 VALUES (2, 'test');

Query with = Operator = Works

SELECT * FROM test_catalog_ph.bronze.my_table_cs_new2
WHERE name = 'Test';

-- Returns 2 rows: 'TEST' and 'test' (case-insensitive match works)

Query with LIKE Operator — Fails

SELECT * FROM test_catalog_ph.bronze.my_table_cs_new2
WHERE name LIKE 'Test';

-- Throws error

Error Message

[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "name LIKE Test"
due to data type mismatch: The first parameter requires the "STRING" type,
however "name" has the type "STRING COLLATE UNICODE_CI".
SQLSTATE: 42K09; line 102, pos 65

Our Findings

The root cause is a data type mismatch between the two sides of the LIKE expression:

Left side (column)-STRING COLLATE UNICODE_CI

Right side (literal)-STRING (UTF8_BINARY)

The Concern

  • The LIKE operator is one of the most commonly used SQL operators for pattern matching
  • Most end users and BI tools (including Power BI, Tableau, etc.) rely heavily on LIKE for filtering
  • If = works with UNICODE_CI columns but LIKE does not, this creates an inconsistent and confusing experience
  • End users should not be expected to add COLLATE UNICODE_CI to every LIKE query manually

Regards,

Phani

 

1 ACCEPTED SOLUTION

Accepted Solutions

SteveOstrowski
Databricks Employee
Databricks Employee

Hi,

This is a known limitation in Databricks SQL. The LIKE operator only supports a subset of collations, and UNICODE_CI is NOT one of them.


WHICH COLLATIONS DOES LIKE SUPPORT?

Per the official LIKE documentation, the supported collations are limited to:

- UTF8_BINARY
- UTF8_BINARY_RTRIM
- UTF8_LCASE
- UTF8_LCASE_RTRIM

If you try to use LIKE on a column with any other collation (including UNICODE_CI, UNICODE, or any locale-based collation), it will fail.

Docs: https://docs.databricks.com/aws/en/sql/language-manual/functions/like.html


WORKAROUNDS

There are a few approaches depending on your use case:

1. Use contains(), startswith(), or endswith() instead of LIKE

The Databricks documentation explicitly recommends these functions as alternatives when working with collations not supported by LIKE. These functions respect the collation of the input string, so they will honor your UNICODE_CI setting:

-- Instead of: SELECT * FROM my_table WHERE name LIKE '%john%'
SELECT * FROM my_table WHERE contains(name, 'john')

-- Instead of: SELECT * FROM my_table WHERE name LIKE 'john%'
SELECT * FROM my_table WHERE startswith(name, 'john')

-- Instead of: SELECT * FROM my_table WHERE name LIKE '%john'
SELECT * FROM my_table WHERE endswith(name, 'john')

Since your column uses UNICODE_CI, these functions will perform case-insensitive matching automatically -- no need for LOWER() wrapping.


2. Use LOWER() / UPPER() with LIKE on a non-collated column

If you need the full wildcard pattern-matching power of LIKE (e.g., LIKE 'J_hn%' with single-character wildcards), you can wrap both sides in LOWER():

SELECT * FROM my_table WHERE LOWER(name) LIKE LOWER('%John%')

Note: This approach may not give you the same Unicode-aware case folding that UNICODE_CI provides (e.g., certain locale-specific characters). It uses simple lowercase conversion.


3. Use ILIKE for case-insensitive pattern matching (with caveats)

The ILIKE operator performs case-insensitive LIKE matching, but it also only works with UTF8_BINARY and UTF8_LCASE collations. So you would need to explicitly cast the column:

SELECT * FROM my_table WHERE name COLLATE UTF8_BINARY ILIKE '%john%'

This forces the comparison to use UTF8_BINARY collation for that expression, bypassing the UNICODE_CI collation on the column. The trade-off is you lose UNICODE_CI's more sophisticated Unicode-aware case folding.

Docs: https://docs.databricks.com/aws/en/sql/language-manual/functions/ilike.html


4. Switch the column collation to UTF8_LCASE

If your primary need is case-insensitive comparison (and you don't specifically need the ICU Unicode-aware ordering that UNICODE_CI provides), consider using UTF8_LCASE instead:

ALTER TABLE my_table ALTER COLUMN name SET COLLATION UTF8_LCASE;

UTF8_LCASE is case-insensitive AND fully supported by both LIKE and ILIKE. The difference is that UTF8_LCASE does simple byte-level lowercase conversion while UNICODE_CI uses the ICU library for more linguistically accurate case folding.


SUMMARY

- contains() / startswith() / endswith(): Supports UNICODE_CI, no wildcards, Unicode-aware
- LOWER() + LIKE: Bypasses collation, full LIKE patterns, limited Unicode awareness
- ILIKE with COLLATE UTF8_BINARY: Overrides collation, full LIKE patterns, limited Unicode awareness
- Switch to UTF8_LCASE: Different collation, full LIKE patterns, limited Unicode awareness


DOCUMENTATION REFERENCES

- LIKE operator: https://docs.databricks.com/aws/en/sql/language-manual/functions/like.html
- Collation in Databricks SQL: https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-collation.html
- Supported collations: https://docs.databricks.com/aws/en/sql/language-manual/functions/collations.html
- ILIKE operator: https://docs.databricks.com/aws/en/sql/language-manual/functions/ilike.html
- Collation limitations with Delta Lake: https://docs.databricks.com/aws/en/delta/collation.html

The key takeaway is that LIKE has limited collation support by design, and contains()/startswith()/endswith() are the recommended alternatives for UNICODE_CI columns.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

View solution in original post

4 REPLIES 4

Shetty_1338
New Contributor III

Greetings @Phani1 ,

This is an expected behaviour and follows ANSI rules. ANSI SQL requires operands of comparison operators (including LIKE) to have compatible character types and collations.

Since implicit conversion is not applied for `LIKE`, the query fails with a datatype mismatch error.

Below are two alternative approaches that can be used to address this limitation based on my experience.

Option 1 — View Wrapper (Recommended for End Users & BI Tools)

A compatibility view can be created by casting the collated column to a standard `STRING` datatype. End users and BI tools can then query the view without needing to handle collation explicitly.

Ex:

CREATE VIEW test_catalog_ph.bronze.my_table_cs_new2_vw AS

SELECT

    id,

    CAST(name AS STRING) AS name

FROM test_catalog_ph.bronze.my_table_cs_new2;

Usage:

SELECT *

FROM test_catalog_ph.bronze.my_table_cs_new2_vw

WHERE name LIKE 'Test';

Advantages:

* No changes required in user queries.

* Fully compatible with BI tools such as Power BI or Tableau.

* Hides collation complexity from end users.

* Centralized solution applied once at the view layer.

Option 2 — Helper Scalar Function

A reusable SQL function can be created to encapsulate the casting logic required for pattern matching.

Example Function

CREATE FUNCTION ci_like(col STRING, pattern STRING)

RETURNS BOOLEAN

RETURN CAST(col AS STRING) LIKE pattern;

Usage:

SELECT *

FROM test_catalog_ph.bronze.my_table_cs_new2

WHERE ci_like(name, 'Test');

Advantages:

* Standardizes pattern matching logic.

* Avoids repeated casting expressions in queries.

* Provides a cleaner and reusable workaround.

NOTE: Queries must use the helper function instead of the native `LIKE` operator.

Option 3: Alter to String if there is an option.

Alter the column data type to string if your use case does not need it to be collate.

Since automatic implicit conversion is not currently supported for `LIKE` operations on collated columns, the above approaches can be used as a workaround.

Hope this helps! 

Saicharan 

Phani1
Databricks MVP

Thanks for your quick response and for providing the solution. However, the view-based approach is causing the same issue—since it still involves converting the UNICODE_CI column to a regular string, we may lose case-insensitive behavior and not get the expected results.

SteveOstrowski
Databricks Employee
Databricks Employee

Hi,

This is a known limitation in Databricks SQL. The LIKE operator only supports a subset of collations, and UNICODE_CI is NOT one of them.


WHICH COLLATIONS DOES LIKE SUPPORT?

Per the official LIKE documentation, the supported collations are limited to:

- UTF8_BINARY
- UTF8_BINARY_RTRIM
- UTF8_LCASE
- UTF8_LCASE_RTRIM

If you try to use LIKE on a column with any other collation (including UNICODE_CI, UNICODE, or any locale-based collation), it will fail.

Docs: https://docs.databricks.com/aws/en/sql/language-manual/functions/like.html


WORKAROUNDS

There are a few approaches depending on your use case:

1. Use contains(), startswith(), or endswith() instead of LIKE

The Databricks documentation explicitly recommends these functions as alternatives when working with collations not supported by LIKE. These functions respect the collation of the input string, so they will honor your UNICODE_CI setting:

-- Instead of: SELECT * FROM my_table WHERE name LIKE '%john%'
SELECT * FROM my_table WHERE contains(name, 'john')

-- Instead of: SELECT * FROM my_table WHERE name LIKE 'john%'
SELECT * FROM my_table WHERE startswith(name, 'john')

-- Instead of: SELECT * FROM my_table WHERE name LIKE '%john'
SELECT * FROM my_table WHERE endswith(name, 'john')

Since your column uses UNICODE_CI, these functions will perform case-insensitive matching automatically -- no need for LOWER() wrapping.


2. Use LOWER() / UPPER() with LIKE on a non-collated column

If you need the full wildcard pattern-matching power of LIKE (e.g., LIKE 'J_hn%' with single-character wildcards), you can wrap both sides in LOWER():

SELECT * FROM my_table WHERE LOWER(name) LIKE LOWER('%John%')

Note: This approach may not give you the same Unicode-aware case folding that UNICODE_CI provides (e.g., certain locale-specific characters). It uses simple lowercase conversion.


3. Use ILIKE for case-insensitive pattern matching (with caveats)

The ILIKE operator performs case-insensitive LIKE matching, but it also only works with UTF8_BINARY and UTF8_LCASE collations. So you would need to explicitly cast the column:

SELECT * FROM my_table WHERE name COLLATE UTF8_BINARY ILIKE '%john%'

This forces the comparison to use UTF8_BINARY collation for that expression, bypassing the UNICODE_CI collation on the column. The trade-off is you lose UNICODE_CI's more sophisticated Unicode-aware case folding.

Docs: https://docs.databricks.com/aws/en/sql/language-manual/functions/ilike.html


4. Switch the column collation to UTF8_LCASE

If your primary need is case-insensitive comparison (and you don't specifically need the ICU Unicode-aware ordering that UNICODE_CI provides), consider using UTF8_LCASE instead:

ALTER TABLE my_table ALTER COLUMN name SET COLLATION UTF8_LCASE;

UTF8_LCASE is case-insensitive AND fully supported by both LIKE and ILIKE. The difference is that UTF8_LCASE does simple byte-level lowercase conversion while UNICODE_CI uses the ICU library for more linguistically accurate case folding.


SUMMARY

- contains() / startswith() / endswith(): Supports UNICODE_CI, no wildcards, Unicode-aware
- LOWER() + LIKE: Bypasses collation, full LIKE patterns, limited Unicode awareness
- ILIKE with COLLATE UTF8_BINARY: Overrides collation, full LIKE patterns, limited Unicode awareness
- Switch to UTF8_LCASE: Different collation, full LIKE patterns, limited Unicode awareness


DOCUMENTATION REFERENCES

- LIKE operator: https://docs.databricks.com/aws/en/sql/language-manual/functions/like.html
- Collation in Databricks SQL: https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-collation.html
- Supported collations: https://docs.databricks.com/aws/en/sql/language-manual/functions/collations.html
- ILIKE operator: https://docs.databricks.com/aws/en/sql/language-manual/functions/ilike.html
- Collation limitations with Delta Lake: https://docs.databricks.com/aws/en/delta/collation.html

The key takeaway is that LIKE has limited collation support by design, and contains()/startswith()/endswith() are the recommended alternatives for UNICODE_CI columns.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

Thanks for your quick response and solution on this!!