3 weeks ago
We are experiencing problem while we are querying the table which has collate unicode_ci attribute enabled, Kindly find the below findings and let me know if any alternative approach to fix this issue?
CREATE TABLE test_catalog_ph.bronze.my_table_cs_new2 (
id INT,
name STRING COLLATE UNICODE_CI
);
INSERT INTO test_catalog_ph.bronze.my_table_cs_new2 VALUES (1, 'TEST');
INSERT INTO test_catalog_ph.bronze.my_table_cs_new2 VALUES (2, 'test');
SELECT * FROM test_catalog_ph.bronze.my_table_cs_new2
WHERE name = 'Test';
-- ✅ Returns 2 rows: 'TEST' and 'test' (case-insensitive match works)
SELECT * FROM test_catalog_ph.bronze.my_table_cs_new2
WHERE name LIKE 'Test';
-- ❌ Throws error
[DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "name LIKE Test"
due to data type mismatch: The first parameter requires the "STRING" type,
however "name" has the type "STRING COLLATE UNICODE_CI".
SQLSTATE: 42K09; line 102, pos 65
The root cause is a data type mismatch between the two sides of the LIKE expression:
Left side (column)-STRING COLLATE UNICODE_CI
Right side (literal)-STRING (UTF8_BINARY)
Regards,
Phani
2 weeks ago
Hi,
This is a known limitation in Databricks SQL. The LIKE operator only supports a subset of collations, and UNICODE_CI is NOT one of them.
WHICH COLLATIONS DOES LIKE SUPPORT?
Per the official LIKE documentation, the supported collations are limited to:
- UTF8_BINARY
- UTF8_BINARY_RTRIM
- UTF8_LCASE
- UTF8_LCASE_RTRIM
If you try to use LIKE on a column with any other collation (including UNICODE_CI, UNICODE, or any locale-based collation), it will fail.
Docs: https://docs.databricks.com/aws/en/sql/language-manual/functions/like.html
WORKAROUNDS
There are a few approaches depending on your use case:
1. Use contains(), startswith(), or endswith() instead of LIKE
The Databricks documentation explicitly recommends these functions as alternatives when working with collations not supported by LIKE. These functions respect the collation of the input string, so they will honor your UNICODE_CI setting:
-- Instead of: SELECT * FROM my_table WHERE name LIKE '%john%'
SELECT * FROM my_table WHERE contains(name, 'john')
-- Instead of: SELECT * FROM my_table WHERE name LIKE 'john%'
SELECT * FROM my_table WHERE startswith(name, 'john')
-- Instead of: SELECT * FROM my_table WHERE name LIKE '%john'
SELECT * FROM my_table WHERE endswith(name, 'john')
Since your column uses UNICODE_CI, these functions will perform case-insensitive matching automatically -- no need for LOWER() wrapping.
2. Use LOWER() / UPPER() with LIKE on a non-collated column
If you need the full wildcard pattern-matching power of LIKE (e.g., LIKE 'J_hn%' with single-character wildcards), you can wrap both sides in LOWER():
SELECT * FROM my_table WHERE LOWER(name) LIKE LOWER('%John%')
Note: This approach may not give you the same Unicode-aware case folding that UNICODE_CI provides (e.g., certain locale-specific characters). It uses simple lowercase conversion.
3. Use ILIKE for case-insensitive pattern matching (with caveats)
The ILIKE operator performs case-insensitive LIKE matching, but it also only works with UTF8_BINARY and UTF8_LCASE collations. So you would need to explicitly cast the column:
SELECT * FROM my_table WHERE name COLLATE UTF8_BINARY ILIKE '%john%'
This forces the comparison to use UTF8_BINARY collation for that expression, bypassing the UNICODE_CI collation on the column. The trade-off is you lose UNICODE_CI's more sophisticated Unicode-aware case folding.
Docs: https://docs.databricks.com/aws/en/sql/language-manual/functions/ilike.html
4. Switch the column collation to UTF8_LCASE
If your primary need is case-insensitive comparison (and you don't specifically need the ICU Unicode-aware ordering that UNICODE_CI provides), consider using UTF8_LCASE instead:
ALTER TABLE my_table ALTER COLUMN name SET COLLATION UTF8_LCASE;
UTF8_LCASE is case-insensitive AND fully supported by both LIKE and ILIKE. The difference is that UTF8_LCASE does simple byte-level lowercase conversion while UNICODE_CI uses the ICU library for more linguistically accurate case folding.
SUMMARY
- contains() / startswith() / endswith(): Supports UNICODE_CI, no wildcards, Unicode-aware
- LOWER() + LIKE: Bypasses collation, full LIKE patterns, limited Unicode awareness
- ILIKE with COLLATE UTF8_BINARY: Overrides collation, full LIKE patterns, limited Unicode awareness
- Switch to UTF8_LCASE: Different collation, full LIKE patterns, limited Unicode awareness
DOCUMENTATION REFERENCES
- LIKE operator: https://docs.databricks.com/aws/en/sql/language-manual/functions/like.html
- Collation in Databricks SQL: https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-collation.html
- Supported collations: https://docs.databricks.com/aws/en/sql/language-manual/functions/collations.html
- ILIKE operator: https://docs.databricks.com/aws/en/sql/language-manual/functions/ilike.html
- Collation limitations with Delta Lake: https://docs.databricks.com/aws/en/delta/collation.html
The key takeaway is that LIKE has limited collation support by design, and contains()/startswith()/endswith() are the recommended alternatives for UNICODE_CI columns.
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.
3 weeks ago
Greetings @Phani1 ,
This is an expected behaviour and follows ANSI rules. ANSI SQL requires operands of comparison operators (including LIKE) to have compatible character types and collations.
Since implicit conversion is not applied for `LIKE`, the query fails with a datatype mismatch error.
Below are two alternative approaches that can be used to address this limitation based on my experience.
Option 1 — View Wrapper (Recommended for End Users & BI Tools)
A compatibility view can be created by casting the collated column to a standard `STRING` datatype. End users and BI tools can then query the view without needing to handle collation explicitly.
Ex:
CREATE VIEW test_catalog_ph.bronze.my_table_cs_new2_vw AS
SELECT
id,
CAST(name AS STRING) AS name
FROM test_catalog_ph.bronze.my_table_cs_new2;
Usage:
SELECT *
FROM test_catalog_ph.bronze.my_table_cs_new2_vw
WHERE name LIKE 'Test';
Advantages:
* No changes required in user queries.
* Fully compatible with BI tools such as Power BI or Tableau.
* Hides collation complexity from end users.
* Centralized solution applied once at the view layer.
Option 2 — Helper Scalar Function
A reusable SQL function can be created to encapsulate the casting logic required for pattern matching.
Example Function
CREATE FUNCTION ci_like(col STRING, pattern STRING)
RETURNS BOOLEAN
RETURN CAST(col AS STRING) LIKE pattern;
Usage:
SELECT *
FROM test_catalog_ph.bronze.my_table_cs_new2
WHERE ci_like(name, 'Test');
Advantages:
* Standardizes pattern matching logic.
* Avoids repeated casting expressions in queries.
* Provides a cleaner and reusable workaround.
NOTE: Queries must use the helper function instead of the native `LIKE` operator.
Option 3: Alter to String if there is an option.
Alter the column data type to string if your use case does not need it to be collate.
Since automatic implicit conversion is not currently supported for `LIKE` operations on collated columns, the above approaches can be used as a workaround.
Hope this helps!
Saicharan
3 weeks ago
Thanks for your quick response and for providing the solution. However, the view-based approach is causing the same issue—since it still involves converting the UNICODE_CI column to a regular string, we may lose case-insensitive behavior and not get the expected results.
2 weeks ago
Hi,
This is a known limitation in Databricks SQL. The LIKE operator only supports a subset of collations, and UNICODE_CI is NOT one of them.
WHICH COLLATIONS DOES LIKE SUPPORT?
Per the official LIKE documentation, the supported collations are limited to:
- UTF8_BINARY
- UTF8_BINARY_RTRIM
- UTF8_LCASE
- UTF8_LCASE_RTRIM
If you try to use LIKE on a column with any other collation (including UNICODE_CI, UNICODE, or any locale-based collation), it will fail.
Docs: https://docs.databricks.com/aws/en/sql/language-manual/functions/like.html
WORKAROUNDS
There are a few approaches depending on your use case:
1. Use contains(), startswith(), or endswith() instead of LIKE
The Databricks documentation explicitly recommends these functions as alternatives when working with collations not supported by LIKE. These functions respect the collation of the input string, so they will honor your UNICODE_CI setting:
-- Instead of: SELECT * FROM my_table WHERE name LIKE '%john%'
SELECT * FROM my_table WHERE contains(name, 'john')
-- Instead of: SELECT * FROM my_table WHERE name LIKE 'john%'
SELECT * FROM my_table WHERE startswith(name, 'john')
-- Instead of: SELECT * FROM my_table WHERE name LIKE '%john'
SELECT * FROM my_table WHERE endswith(name, 'john')
Since your column uses UNICODE_CI, these functions will perform case-insensitive matching automatically -- no need for LOWER() wrapping.
2. Use LOWER() / UPPER() with LIKE on a non-collated column
If you need the full wildcard pattern-matching power of LIKE (e.g., LIKE 'J_hn%' with single-character wildcards), you can wrap both sides in LOWER():
SELECT * FROM my_table WHERE LOWER(name) LIKE LOWER('%John%')
Note: This approach may not give you the same Unicode-aware case folding that UNICODE_CI provides (e.g., certain locale-specific characters). It uses simple lowercase conversion.
3. Use ILIKE for case-insensitive pattern matching (with caveats)
The ILIKE operator performs case-insensitive LIKE matching, but it also only works with UTF8_BINARY and UTF8_LCASE collations. So you would need to explicitly cast the column:
SELECT * FROM my_table WHERE name COLLATE UTF8_BINARY ILIKE '%john%'
This forces the comparison to use UTF8_BINARY collation for that expression, bypassing the UNICODE_CI collation on the column. The trade-off is you lose UNICODE_CI's more sophisticated Unicode-aware case folding.
Docs: https://docs.databricks.com/aws/en/sql/language-manual/functions/ilike.html
4. Switch the column collation to UTF8_LCASE
If your primary need is case-insensitive comparison (and you don't specifically need the ICU Unicode-aware ordering that UNICODE_CI provides), consider using UTF8_LCASE instead:
ALTER TABLE my_table ALTER COLUMN name SET COLLATION UTF8_LCASE;
UTF8_LCASE is case-insensitive AND fully supported by both LIKE and ILIKE. The difference is that UTF8_LCASE does simple byte-level lowercase conversion while UNICODE_CI uses the ICU library for more linguistically accurate case folding.
SUMMARY
- contains() / startswith() / endswith(): Supports UNICODE_CI, no wildcards, Unicode-aware
- LOWER() + LIKE: Bypasses collation, full LIKE patterns, limited Unicode awareness
- ILIKE with COLLATE UTF8_BINARY: Overrides collation, full LIKE patterns, limited Unicode awareness
- Switch to UTF8_LCASE: Different collation, full LIKE patterns, limited Unicode awareness
DOCUMENTATION REFERENCES
- LIKE operator: https://docs.databricks.com/aws/en/sql/language-manual/functions/like.html
- Collation in Databricks SQL: https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-collation.html
- Supported collations: https://docs.databricks.com/aws/en/sql/language-manual/functions/collations.html
- ILIKE operator: https://docs.databricks.com/aws/en/sql/language-manual/functions/ilike.html
- Collation limitations with Delta Lake: https://docs.databricks.com/aws/en/delta/collation.html
The key takeaway is that LIKE has limited collation support by design, and contains()/startswith()/endswith() are the recommended alternatives for UNICODE_CI columns.
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.
2 weeks ago
Thanks for your quick response and solution on this!!