-werners-
Esteemed Contributor III

I don't recall there being a collation in Spark/Delta Lake.

Also data corruption/loss is definitely a main focus of Databricks so I don´t think there is an easy way for fixing this.

What I would do is the following:

overwrite the tables which have mixed cases to uppercase (or lowercase, your choice).

That fixes your current data.

For the data which you want to upsert, you can create a wrapper function around spark.read.parquet (or csv or json or whatever you are ingesting) which translates string columns to uppercase.

We have to do this for a similar issue (trim all string columns).

Or you can just always call the upper/lower function.

Perhaps you can even translate everything to upper/lower while copying it to your storage.

But both cases require work.

I don´t see a quick solution.

You can ofc leave the data as it is, and downstream always take into account that the data is mixed case. So when reading in this mixed data, always apply an upper in filters etc.