cancel
Showing results for 
Search instead for 
Did you mean: 
Community Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Data lineage on views

Henrik
New Contributor III

I do not know if this is intended behavior of data lineage but for me it is weird.
When I create a view based on two tables the data lineage upstream looks correct. But when I replace the view to only use one of the tables, then data lineage upstream still telling me that the view is based on the two tables.
When I read the documentation, it looks like data lineage is based on history for the view for the last 30 days, but a view does not work with history the same way as a table, so to me this is weird.
Can anyone give some more details on this strange behavior?
This is my script for testing this:

/* create 2 tables */
create table if not exists test.test.table_1
(id int not null
,name string);

insert into test.test.table_1
values (1, "test 1");

create table if not exists test.test.table_2
(id int not null
,name string);

insert into test.test.table_2
values (1, "test 2");

/* create a view depending on the to tables */

create or replace view test.test.view_1 as
select
a.id
, b.name
from
test.test.table_1 as a
inner join test.test.table_2 as b
on a.id=b.id;

/*
Data lineage for view is correct upstream for the view are table_1 and table_2
-----------------------------------------------------------------------------------------------------------------------------------
*/

/* now change the view to depend only on table_1 */
alter view test.test.view_1 as
select
a.id
, a.name
from
test.test.table_1 as a;

/*
Data lineage has not changed! - we still have both tables in upstream for the view
*/

3 REPLIES 3

Kaniz_Fatma
Community Manager
Community Manager

Hi @Henrik , 

According to the documentation, lineage is computed on a 30-day rolling window, so lineage collected more than 30 days ago is not displayed. If a job or query reads data from Table A and writes to Table B, the link between Table A and Table B is displayed for only 30 days. Additionally, if a table is renamed, lineage is not captured for the renamed table. Therefore, it is possible that the data lineage upstream still shows the two tables even though the view only uses one because the lineage information for the other table is still within the 30-day window. It is also possible that the view was not adequately replaced and is still using both tables.
 

 

Vinay_M_R
Valued Contributor II
Valued Contributor II

Hi @Henrik  Good day!

I tried to repro this issue internally and I was able to repro this, this seems like limitation that:

  • Unity Catalog captures lineage to the column level as much as possible. However, there are some cases where column-level lineage cannot be captured.

When I checked view defination under details after altering the view to depend only on table_1. I found that view defination was updated with latest view but it's not getting reflected in lineage graph.

https://docs.databricks.com/data-governance/unity-catalog/data-lineage.html#limitations

Regards,

Vinay M R

Henrik
New Contributor III

After some thoughts, i have come to this conclusion:

Data lineage on views is working as one should expect. I strongly recommend that this feature is redesigned so it shows the result of the lastest view.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!