โ11-24-2022 11:30 PM
I would need some suggestion from DataBricks Folks. As per documentation in Schema Evaluation for Drop and Rename Data is overwritten. Does it means we loose data (because I read data is not deleted but kind of staged). Is it possible to query old data using history and restore.
โ11-25-2022 01:34 AM
if you want to rename columns from your data then you can use function withColumn() and withColumnRenamed().
Is it possible to query old data using history and restore--> yes we can query old data using Deltaโs time travel capabilities. If you write into a Delta table or directory, every operation is automatically versioned. You can access the different versions of the data two different ways:
1) Using a timestamp
Scala syntax:
You can provide the timestamp or date string as an option to DataFrame reader:
val df = spark.read
.format("delta")
.option("timestampAsOf", "2019-01-01")
.load("/path/to/my/table")
In Python:
df = spark.read \
.format("delta") \
.option("timestampAsOf", "2019-01-01") \
.load("/path/to/my/table")
SQL syntax :
SELECT count(*) FROM my_table TIMESTAMP AS OF "2019-01-01"
SELECT count(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1)
SELECT count(*) FROM my_table TIMESTAMP AS OF "2019-01-01 01:30:00.000"
2) Using a version number
In Delta, every write has a version number, and you can use the version number to travel back in time as well.
Scala syntax:
val df = spark.read
.format("delta")
.option("versionAsOf", "5238")
.load("/path/to/my/table")
val df = spark.read
.format("delta")
.load("/path/to/my/table@v5238")
Python syntax:
df = spark.read \
.format("delta") \
.option("versionAsOf", "5238") \
.load("/path/to/my/table")
df = spark.read \
.format("delta") \
.load("/path/to/my/table@v5238")
SQL syntax:
SELECT count(*) FROM my_table VERSION AS OF 5238
SELECT count(*) FROM my_table@v5238
SELECT count(*) FROM delta.`/path/to/my/table@v5238`
โ11-26-2022 02:16 AM
thanks for response. I know this we can do but my question was when we overwrite data with new schema ,Will this data be available
โ11-26-2022 03:28 AM
@Yogita Chavanโ
thanks for response. I am aware I can fetch history using timestamp and version but I am asking incase i am overwrtiting data after droping or typechanges like in below code:
(spark.read.table(...)
.withColumn("birthDate", col("birthDate").cast("date"))
.write
.mode("overwrite")
.option("overwriteSchema", "true")
.saveAsTable(...)
)
But my question could it be automatically detected and done in below three case
-Null values
-Type changes
-Drop column
โ11-25-2022 04:40 AM
Can you please elaborate on what you are trying to do especially with that drop and rename part ?
As for the querying old data using history and restore, you can make of delta time travel if you are storing that data in a delta format. Above answer already has the querying commands.
If you want the timestamp/version you need to restore to, you can simply run a describe history <deltatable> for all the details.
Cheers..
โ11-26-2022 02:15 AM
basically i am creating function that takes care of schema changes(drop,type change and null values). As per Delta table dccumentation these changes only work when while writing we select overwrite option. My cocern is if my overwrite my previous data will be lost. Is there a way in DeltaTables to backup old data before overwriting. How to take care of old data when overwriting with new schema
โ11-29-2022 11:31 AM
Overwritte โoption will overwritte your data. If you want to change column name then you can first alter the delta table as per your need then you can append new data as well. So both problems you can resolve
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group