cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Unity Catalog Lineage Not Working on GCP

4kb_nick
New Contributor III

Hello,

We have set up a lakehouse in Databricks for one of our clients. One of the features our client would like to use is the Unity Catalog data lineage view. This is a handy feature that we have used with other clients (in both AWS and Azure) without issue.

We noticed that the lineage data is not being populated at all for the UC tables in our GCP workspaces. Even just running through the UC Sample notebook, we do not see any Lineage data being populated. Looking at the logs, we saw errors like the below that made us think perhaps the issue was with the log4j config:

2024-05-07 13:08:02,895 Thread-168 WARN RollingFileAppender 'com.databricks.LineageLogging.appender': The bufferSize is set to 128000 but bufferedIO is not true

After modifying the log4j properties specified in the error message, we no longer see the log messages. However, the lineage service still does not appear to be working. Our GCP workspaces are allowed outbound access to the internet via our NAT gateways, and are not passing through any in-line firewalls. 

Has anyone run into this issue in GCP, and does anyone know how to resolve it if so?

---

As an aside, updating the log4j properties was not as straightforward as mentioned here:
https://kb.databricks.com/clusters/overwrite-log4j-logs

The file specified in the above KB article does not exist on the clusters we tested in GCP (single-node, 13.3.x-scala2.12). The log4j file we had to modify is located at: /databricks/spark/dbconf/log4j/driver/log4j2.xml

 

2 REPLIES 2

Yeshwanth
Honored Contributor
Honored Contributor

@4kb_nick 

Could you please check the requirements for the lineage feature and its limitations here: https://docs.gcp.databricks.com/en/data-governance/unity-catalog/data-lineage.html?_ga=2.115379718.1...

Kind regards,

Yesh

4kb_nick
New Contributor III

Sure - I've checked the requirements:

  • The workspace must have Unity Catalog enabledIt's enabled.

  • Tables must be registered in a Unity Catalog metastore. They are. I'm just using the sample Unity Catalog lineage notebook located here: https://notebooks.databricks.com/demos/uc-03-data-lineage/index.html

  • Queries must use the Spark DataFrame (for example, Spark SQL functions that return a DataFrame) or Databricks SQL interfaces. For examples of Databricks SQL and PySpark queries, see ExamplesThey are - I'm using the sample notebook, which is interfacing with UC via Databricks SQL.

  • To view the lineage of a table or view, users must have at least the BROWSE privilege on the table’s or view’s parent catalog. I am the owner of the catalog and have ALL PRIVILEGES on it as well.

  • To view lineage information for notebooks, workflows, or dashboards, users must have permissions on these objects as defined by the access control settings in the workspace. See Lineage permissionsI have permissions to all of the objects in the loop - the notebook, as well as the catalog.

  • To view lineage for a Unity Catalog-enabled pipeline, you must have CAN_VIEW permissions on the pipeline. I'm not using a pipeline in my testing.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!