cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Case insensitive data

dpc
Contributor III

For all it's positives, one of the first general issues we had with databricks was case sensitivity.

We have a lot of data specific filters in our code

Problem is, we land and view data from lots of different case insensitive source systems e.g. SQL Server

As such, we have to be very careful with our code and convert columns to UPPER when making a comparison.

Most of our code is written in SQL.

 

About 18 months ago I asked whether there was going to be a catalog, schema or table setting for this i.e. make the object case insensitive.

I was told it was on its way.

Not heard anything since and cannot find anything.

 

Does anybody know whether this is in place or expected?

 

Thanks

 

2 ACCEPTED SOLUTIONS

Accepted Solutions

emma_s
Databricks Employee
Databricks Employee

Hi, You can set the default collation at Catalog level  or schema level and the tables in the catalog will inherit the collation. This is supported from DBR 17.1 and above.

View solution in original post

szymon_dybczak
Esteemed Contributor III

Hi @dpc ,

Like @emma_s  mentioned - you can set it at catalog/schema. Table will inherit collation. But you can also define it explicitly for a specific column within a table using collate modifier

szymon_dybczak_1-1768585044149.png

 

 

View solution in original post

8 REPLIES 8

szymon_dybczak
Esteemed Contributor III

Hi @dpc ,

I think you can try to use a collation for that purpose. A collation is a set of rules that determines how string comparisons are performed. Collations are used to compare strings in a case-insensitive, accent-insensitive, or trailing space insensitive manner, or to sort strings in a specific language-aware order.

Collation | Databricks on AWS

dpc
Contributor III

Thanks.

Collation is table specific though isn't it? and you have to apply it to each columns.

Is there a was to just say, this schema, catalog or table is case insensitive or can you only do it by column?

szymon_dybczak
Esteemed Contributor III

Hi @dpc ,

Like @emma_s  mentioned - you can set it at catalog/schema. Table will inherit collation. But you can also define it explicitly for a specific column within a table using collate modifier

szymon_dybczak_1-1768585044149.png

 

 

emma_s
Databricks Employee
Databricks Employee

Hi, You can set the default collation at Catalog level  or schema level and the tables in the catalog will inherit the collation. This is supported from DBR 17.1 and above.

dpc
Contributor III

Thanks.

I'll test collation at catalog, sschema and table level using 17.1

dpc
Contributor III

Thanks @emma_s  and @szymon_dybczak

Looks positive.

Some datatype issues moving to 17.1 but that's a separate issue.

I've run some simple tests and it works well so will trial it on out full data set

szymon_dybczak
Esteemed Contributor III

Great @dpc , good to hear that 🙂

dpc
Contributor III

It works but there's a scenario that causes an issue.

If I create a schema with defaultcollation UTF8_LCASE 

Then create a table, it marks all the string columns as UTF8_LCASE 

Which is fine and works

 

If I create the table, in the newly created UTF8_LCASE schema from an existing table using:

 

CREATE TABLE <destination> AS
SELECT *
FROM <source>
WHERE 1 = 2;
 
 
The table is UTF8_LCASE but all the columns aren't
So, when  I use it, it remains case sensitive
 
Any thoughts?
Am I doing something wrong here as I really want to create tables that are structurally the same with the exception of case
I am doing this as I land the data and the initial table is derived from the landed structure, before moving it.
 
Thanks