10-20-2021 12:56 PM
we have many delta tables with string columns as unique key (PK in traditional relational db) and we don't want to insert new row because key value only differs in case.
Its lot of code change to use upper/lower function on column value compare (in upsert logic) so looking for alternative
i see a CHECK constraint on delta table column can enforce consistent case value but its too late , i already have mixed case data in tables.
is there anything similar to sql server collation feature?
spark.conf.set('spark.sql.caseSensitive', False) does not work as expected (meaning string comparison between mixed case value shows i have 2 different strings)
Also looked up spark.conf.set('spark.databricks.analyzer.batchResolveRelations', False) in vein
I have tried 7.3LTS and 9.1LTS databricks on azure
10-30-2021 07:59 AM
@Kaniz Fatma Thanks for reaching out Kaniz. After much research, I came to the conclusion that this feature is not available in spark or delta lake so this would be a new feature request for spark community. JSON and Parquet are case-sensitive so this may be root cause .
10-30-2021 07:59 AM
@Kaniz Fatma Thanks for reaching out Kaniz. After much research, I came to the conclusion that this feature is not available in spark or delta lake so this would be a new feature request for spark community. JSON and Parquet are case-sensitive so this may be root cause .
11-18-2021 09:49 AM
Thank you for the update. Could you mark your response as best answer? it might help for future references.
10-21-2021 04:27 AM
I don't recall there being a collation in Spark/Delta Lake.
Also data corruption/loss is definitely a main focus of Databricks so I don´t think there is an easy way for fixing this.
What I would do is the following:
overwrite the tables which have mixed cases to uppercase (or lowercase, your choice).
That fixes your current data.
For the data which you want to upsert, you can create a wrapper function around spark.read.parquet (or csv or json or whatever you are ingesting) which translates string columns to uppercase.
We have to do this for a similar issue (trim all string columns).
Or you can just always call the upper/lower function.
Perhaps you can even translate everything to upper/lower while copying it to your storage.
But both cases require work.
I don´t see a quick solution.
You can ofc leave the data as it is, and downstream always take into account that the data is mixed case. So when reading in this mixed data, always apply an upper in filters etc.
10-21-2021 09:33 AM
hi @prasad vaze ,
By default "spark.sql.caseSensitive" will be false. If you would like to enforce it, then you will need to set it to true. This feature was added since Spark 2.x, you could find more details in this Jira
Like @Werner Stinckens said, you can rename the column names and create a new table with the new cols
10-30-2021 07:57 AM
@Jose Gonzalez I am solving for case-sensitive values inside the column and not the case-sensitive name of the column. spark.sql.caseSensitive = Ture makes columns names case-sensitive and not the value in the column
11-10-2021 11:26 AM
To make the comparisons case insensitive we can compare after converting columns to upper or lower case.
Eg:
import org.apache.spark.sql.functions._
df.filter(lower($"colname") == lower($"colname"))
11-28-2021 08:49 PM
Well, the unintended benefit is now I am using int\big int as surrogate keys
for all tables (preferred in DW).
All joins are made on integer data types. Query efficiency is also improved.
The string matching using upper() is done only on ETL when comparing source and target data.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group