cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to make delta table column values case-insensitive?

prasadvaze
Valued Contributor

 we have many delta tables with string columns as unique key (PK in traditional relational db) and we don't want to insert new row because key value only differs in case. 

Its lot of code change to use upper/lower function on column value compare (in upsert logic) so looking for alternative

i see a CHECK constraint on delta table column can enforce consistent case value but its too late , i already have mixed case data in tables.  

is there anything similar to sql server collation feature? 

spark.conf.set('spark.sql.caseSensitive', False) does not work as expected (meaning string comparison between mixed case value shows i have 2 different strings) 

Also looked up spark.conf.set('spark.databricks.analyzer.batchResolveRelations', False) in vein

I have tried 7.3LTS and 9.1LTS databricks on azure

1 ACCEPTED SOLUTION

Accepted Solutions

prasadvaze
Valued Contributor

@Kaniz Fatma​  Thanks for reaching out Kaniz. After much research, I came to the conclusion that this feature is not available in spark or delta lake so this would be a new feature request for spark community. JSON and Parquet are case-sensitive so this may be root cause .

View solution in original post

8 REPLIES 8

Kaniz
Community Manager
Community Manager

Hi @ prasadvaze! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

prasadvaze
Valued Contributor

@Kaniz Fatma​  Thanks for reaching out Kaniz. After much research, I came to the conclusion that this feature is not available in spark or delta lake so this would be a new feature request for spark community. JSON and Parquet are case-sensitive so this may be root cause .

Thank you for the update. Could you mark your response as best answer? it might help for future references.

-werners-
Esteemed Contributor III

I don't recall there being a collation in Spark/Delta Lake.

Also data corruption/loss is definitely a main focus of Databricks so I don´t think there is an easy way for fixing this.

What I would do is the following:

overwrite the tables which have mixed cases to uppercase (or lowercase, your choice).

That fixes your current data.

For the data which you want to upsert, you can create a wrapper function around spark.read.parquet (or csv or json or whatever you are ingesting) which translates string columns to uppercase.

We have to do this for a similar issue (trim all string columns).

Or you can just always call the upper/lower function.

Perhaps you can even translate everything to upper/lower while copying it to your storage.

But both cases require work.

I don´t see a quick solution.

You can ofc leave the data as it is, and downstream always take into account that the data is mixed case. So when reading in this mixed data, always apply an upper in filters etc.

jose_gonzalez
Moderator
Moderator

hi @prasad vaze​ ,

By default "spark.sql.caseSensitive" will be false. If you would like to enforce it, then you will need to set it to true. This feature was added since Spark 2.x, you could find more details in this Jira

Like @Werner Stinckens​ said, you can rename the column names and create a new table with the new cols

@Jose Gonzalez​  I am solving for case-sensitive values inside the column and not the case-sensitive name of the column. spark.sql.caseSensitive = Ture makes columns names case-sensitive and not the value in the column

Sandeep
Contributor III

To make the comparisons case insensitive we can compare after converting columns to upper or lower case.

Eg:

import org.apache.spark.sql.functions._

df.filter(lower($"colname") == lower($"colname"))

lizou
Contributor II

Well, the unintended benefit is now I am using int\big int as surrogate keys

for all tables (preferred in DW).

All joins are made on integer data types. Query efficiency is also improved.

The string matching using upper() is done only on ETL when comparing source and target data.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.