cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Set version (VERSION AS OF) dynamically from return of a subquery

nameziane
New Contributor III

Hello,

We have a business request to compare the evolution in a certain delta table.

We would like to compare the latest version of the table with the previous one using Delta time travel.

The main issue we are facing is to retrieve programmatically using SQL the N-1 version and give it as a parameter to our SQL query.

The idea is to get something like this:

%sql
select * from schedule@vN
except all
select * from schedule@vN-1
 

We know that the latest version can be used by simply using the delta table name (as it uses by default the last version), but how can we retrieve the previous delta version.

We tried to use the 'hivevar' variable (like below ) but we are getting an error since the hivevar is not evaluated.

SET hive.variable.substitute=true;
SET hivevar:ver=SELECT VERSION FROM (DESCRIBE HISTORY schema.my_table) LIMIT 1;
SELECT * FROM schema.my_table VERSION AS OF ${hivevar:ver};

The error code below:

- Error in SQL stateรนent: parseException:
missing INTEGER_VALUE at 'SELECT'(line 1, pos 84)

Any idea how to achieve this ?

4 REPLIES 4

UmaMahesh1
Honored Contributor III

Hi @Nabil AMEZIANEโ€‹ ,

What you were doing works actually...

You just need to fetch the latest version into a variable like below

x= spark.sql("""select max(version) from (desc HISTORY <table-name>)""").collect()[0][0]

Be sure to check the operation type too in the history table if there are DELETE operations.

Now you have a integer x with latest version number available. For previous version all you need to do is to do a x-1. Then you have both version numbers.

Now pass these variables into your sql query reading those versions. I would do it as below

spark.sql("""
select * from table @v{0}
minus
select * from table @v{1}
""".format(x, x-1))

Hope this helps...Cheers

nameziane
New Contributor III

@Uma Maheswara Rao Desulaโ€‹ thank you for your Quick answer but i have the constraints to work only with sql format (i canโ€™t use scala/Spark or pyspark code)

apingle
Contributor

In the docs it says that "'Neither timestamp_expression nor version can be subqueries."

So it does sound challenging. I also tried playing with widgets to see if it could be populated using SQL but didn't succeed. With python it's really easy to do.

nameziane
New Contributor III

Yes, exactly with simple code it is very easy to do but not possible at that time with only SQL (i tried using hive/spark variable the result is same thing because the variable not executed) i think we need something like "hive cmd" on databricks to do the first part of collecting the result of the first query in variable then apply it to second query but we have to use shell and it is not the goal )

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!