topic Re: How to make delta table column values case-insensitive? in Data Engineering

How to make delta table column values case-insensitive?

prasadvaze — Wed, 20 Oct 2021 19:56:28 GMT

we have many delta tables with string columns as unique key (PK in traditional relational db) and we don't want to insert new row because key value only differs in case.

Its lot of code change to use upper/lower function on column value compare (in upsert logic) so looking for alternative

i see a CHECK constraint on delta table column can enforce consistent case value but its too late , i already have mixed case data in tables.

is there anything similar to sql server collation feature?

spark.conf.set('spark.sql.caseSensitive', False) does not work as expected (meaning string comparison between mixed case value shows i have 2 different strings)

Also looked up spark.conf.set('spark.databricks.analyzer.batchResolveRelations', False) in vein

I have tried 7.3LTS and 9.1LTS databricks on azure

Re: How to make delta table column values case-insensitive?

-werners- — Thu, 21 Oct 2021 11:27:59 GMT

I don't recall there being a collation in Spark/Delta Lake.

Also data corruption/loss is definitely a main focus of Databricks so I don´t think there is an easy way for fixing this.

What I would do is the following:

overwrite the tables which have mixed cases to uppercase (or lowercase, your choice).

That fixes your current data.

For the data which you want to upsert, you can create a wrapper function around spark.read.parquet (or csv or json or whatever you are ingesting) which translates string columns to uppercase.

We have to do this for a similar issue (trim all string columns).

Or you can just always call the upper/lower function.

Perhaps you can even translate everything to upper/lower while copying it to your storage.

But both cases require work.

I don´t see a quick solution.

You can ofc leave the data as it is, and downstream always take into account that the data is mixed case. So when reading in this mixed data, always apply an upper in filters etc.

Re: How to make delta table column values case-insensitive?

jose_gonzalez — Thu, 21 Oct 2021 16:33:15 GMT

hi @prasad vaze ,

By default "spark.sql.caseSensitive" will be false. If you would like to enforce it, then you will need to set it to true. This feature was added since Spark 2.x, you could find more details in this Jira

Like @Werner Stinckens said, you can rename the column names and create a new table with the new cols

Re: How to make delta table column values case-insensitive?

prasadvaze — Sat, 30 Oct 2021 14:57:27 GMT

@Jose Gonzalez I am solving for case-sensitive values inside the column and not the case-sensitive name of the column. spark.sql.caseSensitive = Ture makes columns names case-sensitive and not the value in the column

Re: How to make delta table column values case-insensitive?

prasadvaze — Sat, 30 Oct 2021 14:59:45 GMT

@Kaniz Fatma Thanks for reaching out Kaniz. After much research, I came to the conclusion that this feature is not available in spark or delta lake so this would be a new feature request for spark community. JSON and Parquet are case-sensitive so this may be root cause .

Re: How to make delta table column values case-insensitive?

Sandeep — Wed, 10 Nov 2021 19:26:38 GMT

To make the comparisons case insensitive we can compare after converting columns to upper or lower case.

Eg:

import org.apache.spark.sql.functions._

df.filter(lower($"colname") == lower($"colname"))

Re: How to make delta table column values case-insensitive?

jose_gonzalez — Thu, 18 Nov 2021 17:49:13 GMT

Thank you for the update. Could you mark your response as best answer? it might help for future references.

Re: How to make delta table column values case-insensitive?

lizou — Mon, 29 Nov 2021 04:49:48 GMT

Well, the unintended benefit is now I am using int\big int as surrogate keys

for all tables (preferred in DW).

All joins are made on integer data types. Query efficiency is also improved.

The string matching using upper() is done only on ETL when comparing source and target data.