cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks time travel - how to get ALL changes ever done to a table

oishimbo
New Contributor

Hi time travel gurus,

I am investigating creating a reporting solution with an AsOf functionality. Users will be able to create a report based on the current data or on the data AsOf some time ago. Due to the nature of our data this AsOf feature is quite important to reproduce reports. 

I can create Databricks tables with time travel enabled, thus capturing all changes. However, I plan to import the data into a PowerBI dataset to allow users in PowerBI to set the correct AsOf date while creating reports. In order to do so, I would have to load ALL data changes from Databricks into the PowerBI dataset. For temporal tables in SQL Server this can be done by specifying "FOR SYSTEM_TIME ALL". Is there a similar query option available for Databricks?

I am aware that I have to monitor the size of the underlying data very closely. At this stage I'm simply investigating if Databricks tables with time travel enabled would allow me access to all the changes. 

Thanks for your help 🙂

4 REPLIES 4

jasbro
New Contributor II

Did you find a solution? We're investigating for similar situation.

Panda
Valued Contributor

@oishimbo @jasbro 
This can be tricky because, to achieve this functionality, you need to consider both VACUUM retention and time travel settings. You can use the following query to retrieve data based on your time travel needs, but ensure that the VACUUM retention policy is aligned with the time range you want to access. If the relevant data files are removed by VACUUM, those historical versions will no longer be accessible.

 

-- Example: Query data as of yesterday
SELECT COUNT(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 1);

 

If you want to access data from the last 90 days, ensure that the VACUUM retention policy is set to at least 92 or 93 days to avoid losing the relevant data.

 

-- Example: Query data as of 90 days ago
SELECT COUNT(*) FROM my_table TIMESTAMP AS OF date_sub(current_date(), 90);

 

However, instead of querying all historical data directly from Delta Lake in real-time, it is recommended to export periodic snapshots to a snapshot table. This approach ensures consistent performance, avoids potential issues caused by VACUUM deleting old data files, and reduces the load on your Delta tables.

Refer the link - https://www.databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.... 

mcveyroosevelt
New Contributor

In Databricks, you can use time travel to access historical versions of a table using the versionAsOf or timestampAsOf options in a SELECT query. To retrieve all changes made to a table, you would typically query the table's historical versions, specifying the version number or timestamp. However, there isn’t a direct equivalent to SQL Server's FOR SYSTEM_TIME ALL for querying all changes in a single operation. Instead, you can iterate over the versions of the table using DESCRIBE HISTORY to find the relevant version numbers and then query those versions using SELECT * FROM <table> VERSION AS OF <version_number> to capture the entire history. You may need to script this process to load all historical data into PowerBI, managing the data size accordingly.

Guys I have another opinion. I found similar type of information at this website

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group