Databricks Community

prasadvaze · ‎10-20-2021

we have many delta tables with string columns as unique key (PK in traditional relational db) and we don't want to insert new row because key value only differs in case.

Its lot of code change to use upper/lower function on column value compare (in upsert logic) so looking for alternative

i see a CHECK constraint on delta table column can enforce consistent case value but its too late , i already have mixed case data in tables.

is there anything similar to sql server collation feature?

spark.conf.set('spark.sql.caseSensitive', False) does not work as expected (meaning string comparison between mixed case value shows i have 2 different strings)

Also looked up spark.conf.set('spark.databricks.analyzer.batchResolveRelations', False) in vein

I have tried 7.3LTS and 9.1LTS databricks on azure

prasadvaze · ‎10-30-2021

@Kaniz Fatma Thanks for reaching out Kaniz. After much research, I came to the conclusion that this feature is not available in spark or delta lake so this would be a new feature request for spark community. JSON and Parquet are case-sensitive so this may be root cause .

View solution in original post

prasadvaze · ‎10-30-2021

@Kaniz Fatma Thanks for reaching out Kaniz. After much research, I came to the conclusion that this feature is not available in spark or delta lake so this would be a new feature request for spark community. JSON and Parquet are case-sensitive so this may be root cause .

jose_gonzalez · ‎11-18-2021

Thank you for the update. Could you mark your response as best answer? it might help for future references.

-werners- · ‎10-21-2021

I don't recall there being a collation in Spark/Delta Lake.

Also data corruption/loss is definitely a main focus of Databricks so I don´t think there is an easy way for fixing this.

What I would do is the following:

overwrite the tables which have mixed cases to uppercase (or lowercase, your choice).

That fixes your current data.

For the data which you want to upsert, you can create a wrapper function around spark.read.parquet (or csv or json or whatever you are ingesting) which translates string columns to uppercase.

We have to do this for a similar issue (trim all string columns).

Or you can just always call the upper/lower function.

Perhaps you can even translate everything to upper/lower while copying it to your storage.

But both cases require work.

I don´t see a quick solution.

You can ofc leave the data as it is, and downstream always take into account that the data is mixed case. So when reading in this mixed data, always apply an upper in filters etc.

jose_gonzalez · ‎10-21-2021

hi @prasad vaze ,

By default "spark.sql.caseSensitive" will be false. If you would like to enforce it, then you will need to set it to true. This feature was added since Spark 2.x, you could find more details in this Jira

Like @Werner Stinckens said, you can rename the column names and create a new table with the new cols

prasadvaze · ‎10-30-2021

@Jose Gonzalez I am solving for case-sensitive values inside the column and not the case-sensitive name of the column. spark.sql.caseSensitive = Ture makes columns names case-sensitive and not the value in the column