cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

What is the best way to take care of Drop and Rename a column in Schema evaluation.

mickniz
Contributor

I would need some suggestion from DataBricks Folks. As per documentation in Schema Evaluation for Drop and Rename Data is overwritten. Does it means we loose data (because I read data is not deleted but kind of staged). Is it possible to query old data using history and restore.

6 REPLIES 6

yogu
Honored Contributor III

if you want to rename columns from your data then you can use function withColumn() and withColumnRenamed().

 Is it possible to query old data using history and restore--> yes we can query old data using Deltaโ€™s time travel capabilities. If you write into a Delta table or directory, every operation is automatically versioned. You can access the different versions of the data two different ways:

1) Using a timestamp

Scala syntax:

You can provide the timestamp or date string as an option to DataFrame reader:

val df = spark.read

.format("delta")

.option("timestampAsOf", "2019-01-01")

.load("/path/to/my/table")

In Python:

df = spark.read \

.format("delta") \

.option("timestampAsOf", "2019-01-01") \

.load("/path/to/my/table")

SQL syntax :

SELECT count(*) FROM my_table TIMESTAMP AS OF "2019-01-01"

SELECT count(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1)

SELECT count(*) FROM my_table TIMESTAMP AS OF "2019-01-01 01:30:00.000"

2) Using a version number

In Delta, every write has a version number, and you can use the version number to travel back in time as well.

Scala syntax:

val df = spark.read

.format("delta")

.option("versionAsOf", "5238")

.load("/path/to/my/table")

val df = spark.read

.format("delta")

.load("/path/to/my/table@v5238")

Python syntax:

df = spark.read \

.format("delta") \

.option("versionAsOf", "5238") \

.load("/path/to/my/table")

df = spark.read \

.format("delta") \

.load("/path/to/my/table@v5238")

SQL syntax:

SELECT count(*) FROM my_table VERSION AS OF 5238

SELECT count(*) FROM my_table@v5238

SELECT count(*) FROM delta.`/path/to/my/table@v5238`

thanks for response. I know this we can do but my question was when we overwrite data with new schema ,Will this data be available

@Yogita Chavanโ€‹ 

thanks for response. I am aware I can fetch history using timestamp and version but I am asking incase i am overwrtiting data after droping or typechanges like in below code:

(spark.read.table(...)

.withColumn("birthDate", col("birthDate").cast("date"))

.write

.mode("overwrite")

.option("overwriteSchema", "true")

.saveAsTable(...)

)

But my question could it be automatically detected and done in below three case

-Null values

-Type changes

-Drop column

UmaMahesh1
Honored Contributor III

Can you please elaborate on what you are trying to do especially with that drop and rename part ?

As for the querying old data using history and restore, you can make of delta time travel if you are storing that data in a delta format. Above answer already has the querying commands.

If you want the timestamp/version you need to restore to, you can simply run a describe history <deltatable> for all the details.

Cheers..

Uma Mahesh D

basically i am creating function that takes care of schema changes(drop,type change and null values). As per Delta table dccumentation these changes only work when while writing we select overwrite option. My cocern is if my overwrite my previous data will be lost. Is there a way in DeltaTables to backup old data before overwriting. How to take care of old data when overwriting with new schema

SS2
Valued Contributor

Overwritte โ€‹option will overwritte your data. If you want to change column name then you can first alter the delta table as per your need then you can append new data as well. So both problems you can resolve

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group