cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Missing upstream column lineage missing from api call after some time

Mario_D
New Contributor III

I ran the following piece of code on 2 occasions.

table_name = 'full path of table"
lineage = w.api_client.do(
"GET",
f"/api/2.0/lineage-tracking/column-lineage",
body={
"table_name": table_name,
"column_name": "column_x"
}
)

u_lineage_df = spark.createDataFrame(lineage['downstream_cols'])
d_lineage_df = spark.createDataFrame(lineage['ustream_cols'])

In week 1, i got both downstream and upstream data.
In week 2 however, i got only downstream data.

What could be the cause of this?

Do note that this is highly redacted version, I'm just wondering if there are some restrictions in using the api.

2 REPLIES 2

ShamenParis
New Contributor II

Hi @Mario_D 

Great question. I've run into this exact issue before in my own projects! When lineage suddenly disappears, it's almost never an API restriction. Instead, it's usually one of three things happening under the hood in Unity Catalog:

  • Lost Permissions (Most Common): Unity Catalog hides lineage for security if your user or Service Principal lost BROWSE or SELECT access to the upstream tables between Week 1 and Week 2. Check if you can still see those upstream tables in the Catalog Explorer!

  • Table Re-creation: If someone (or a job) dropped and recreated the upstream table instead of just updating it, the historical lineage link breaks.

  • Pipeline Code Changes: Lineage is built dynamically from Spark execution plans. If someone changed the ETL notebook/job and it no longer actually pulls from that upstream column, the lineage will update to reflect that.

One quick catch on your code: I noticed a typo in your snippet! You wrote lineage['ustream_cols'] instead of upstream_cols (missing the 'p'). If that is exactly how it is in your live script, it will definitely fail to pull the data!

Hope this points you in the right direction!

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @Mario_D,

From what I can gather, this can happen, and itโ€™s usually less about a restriction on calling the API itself and more about how lineage was captured or what the caller is allowed to see.

A few common reasons are:

  • The caller no longer has permission to see the upstream objects. Lineage follows the Unity Catalog permission model. Without at least BROWSE/SELECT on the upstream table, users canโ€™t explore that lineage, and internal examples show API responses where missing lineage is effectively permission-masked. So, if something has changed in week 2, this could be a reason.
  • The upstream side was read or written in a way that doesnโ€™t support full column lineage capture.
  • The query pattern changed between runs.

For example, Databricks documents that column lineage is only supported when both the source and target are referenced by table name. If either side is referenced as a path, column lineage may not be captured. The docs also call out other cases that can affect lineage capture, such as UDFs, RDDs, global temp views, checkpointing, and renames.

The public reference for this is here: Data lineage in Unity Catalog. That page is especially useful for the documented permissions requirements and lineage limitations.

So in your week 1 vs week 2 example, Iโ€™d first check:

  • Whether the same permissions still existed on the upstream tables
  • Whether the upstream read/write logic changed in any way
  • Whether the source and target were still both referenced by table name

If those all stayed the same and only upstream disappeared, then it may be worth validating the behaviour against the documented lineage limitations. If these are not due to limitations, it may be worth raising a support ticket for the team to investigate.

Another key point to note is that the native UC lineage REST endpoints appear to be not publicly documented, even though public docs still reference "the API" in a generic sense. The publicly documented and recommended programmatic interface for native lineage is the lineage system tables. So, you should consider system.access.table_lineage and system.access.column_lineage as the documented programmatic path. 

Also note that system tables update throughout the day, so if recent lineage is missing, you may want to try again later.

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

 

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***