topic Re: What is the best way to take care of Drop and Rename a column in Schema evaluation. in Data Engineering

What is the best way to take care of Drop and Rename a column in Schema evaluation.

mickniz — Fri, 25 Nov 2022 07:30:37 GMT

I would need some suggestion from DataBricks Folks. As per documentation in Schema Evaluation for Drop and Rename Data is overwritten. Does it means we loose data (because I read data is not deleted but kind of staged). Is it possible to query old data using history and restore.

Re: What is the best way to take care of Drop and Rename a column in Schema evaluation.

yogu — Fri, 25 Nov 2022 09:34:39 GMT

if you want to rename columns from your data then you can use function withColumn() and withColumnRenamed().

Is it possible to query old data using history and restore--> yes we can query old data using Delta’s time travel capabilities. If you write into a Delta table or directory, every operation is automatically versioned. You can access the different versions of the data two different ways:

1) Using a timestamp

Scala syntax:

You can provide the timestamp or date string as an option to DataFrame reader:

val df = spark.read

.format("delta")

.option("timestampAsOf", "2019-01-01")

.load("/path/to/my/table")

In Python:

df = spark.read \

.format("delta") \

.option("timestampAsOf", "2019-01-01") \

.load("/path/to/my/table")

SQL syntax :

SELECT count(*) FROM my_table TIMESTAMP AS OF "2019-01-01"

SELECT count(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1)

SELECT count(*) FROM my_table TIMESTAMP AS OF "2019-01-01 01:30:00.000"

2) Using a version number

In Delta, every write has a version number, and you can use the version number to travel back in time as well.

Scala syntax:

val df = spark.read

.format("delta")

.option("versionAsOf", "5238")

.load("/path/to/my/table")

val df = spark.read

.format("delta")

.load("/path/to/my/table@v5238")

Python syntax:

df = spark.read \

.format("delta") \

.option("versionAsOf", "5238") \

.load("/path/to/my/table")

df = spark.read \

.format("delta") \

.load("/path/to/my/table@v5238")

SQL syntax:

SELECT count(*) FROM my_table VERSION AS OF 5238

SELECT count(*) FROM my_table@v5238

SELECT count(*) FROM delta.`/path/to/my/table@v5238`

Re: What is the best way to take care of Drop and Rename a column in Schema evaluation.

UmaMahesh1 — Fri, 25 Nov 2022 12:40:38 GMT

Can you please elaborate on what you are trying to do especially with that drop and rename part ?

As for the querying old data using history and restore, you can make of delta time travel if you are storing that data in a delta format. Above answer already has the querying commands.

If you want the timestamp/version you need to restore to, you can simply run a describe history <deltatable> for all the details.

Cheers..

Re: What is the best way to take care of Drop and Rename a column in Schema evaluation.

mickniz — Sat, 26 Nov 2022 10:15:11 GMT

basically i am creating function that takes care of schema changes(drop,type change and null values). As per Delta table dccumentation these changes only work when while writing we select overwrite option. My cocern is if my overwrite my previous data will be lost. Is there a way in DeltaTables to backup old data before overwriting. How to take care of old data when overwriting with new schema

Re: What is the best way to take care of Drop and Rename a column in Schema evaluation.

mickniz — Sat, 26 Nov 2022 10:16:57 GMT

thanks for response. I know this we can do but my question was when we overwrite data with new schema ,Will this data be available

Re: What is the best way to take care of Drop and Rename a column in Schema evaluation.

mickniz — Sat, 26 Nov 2022 11:28:15 GMT

@Yogita Chavan

thanks for response. I am aware I can fetch history using timestamp and version but I am asking incase i am overwrtiting data after droping or typechanges like in below code:

(spark.read.table(...)

.withColumn("birthDate", col("birthDate").cast("date"))

.write

.mode("overwrite")

.option("overwriteSchema", "true")

.saveAsTable(...)

)

But my question could it be automatically detected and done in below three case

-Null values

-Type changes

-Drop column

Re: What is the best way to take care of Drop and Rename a column in Schema evaluation.

SS2 — Tue, 29 Nov 2022 19:31:31 GMT

Overwritte option will overwritte your data. If you want to change column name then you can first alter the delta table as per your need then you can append new data as well. So both problems you can resolve