Unity Catalog Lineage Not Working on GCP
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-19-2024 01:33 PM - edited 05-19-2024 01:35 PM
Hello,
We have set up a lakehouse in Databricks for one of our clients. One of the features our client would like to use is the Unity Catalog data lineage view. This is a handy feature that we have used with other clients (in both AWS and Azure) without issue.
We noticed that the lineage data is not being populated at all for the UC tables in our GCP workspaces. Even just running through the UC Sample notebook, we do not see any Lineage data being populated. Looking at the logs, we saw errors like the below that made us think perhaps the issue was with the log4j config:
2024-05-07 13:08:02,895 Thread-168 WARN RollingFileAppender 'com.databricks.LineageLogging.appender': The bufferSize is set to 128000 but bufferedIO is not true
After modifying the log4j properties specified in the error message, we no longer see the log messages. However, the lineage service still does not appear to be working. Our GCP workspaces are allowed outbound access to the internet via our NAT gateways, and are not passing through any in-line firewalls.
Has anyone run into this issue in GCP, and does anyone know how to resolve it if so?
---
As an aside, updating the log4j properties was not as straightforward as mentioned here:
https://kb.databricks.com/clusters/overwrite-log4j-logs
The file specified in the above KB article does not exist on the clusters we tested in GCP (single-node, 13.3.x-scala2.12). The log4j file we had to modify is located at: /databricks/spark/dbconf/log4j/driver/log4j2.xml
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-19-2024 09:23 PM
Could you please check the requirements for the lineage feature and its limitations here: https://docs.gcp.databricks.com/en/data-governance/unity-catalog/data-lineage.html?_ga=2.115379718.1...
Kind regards,
Yesh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-19-2024 10:27 PM
Sure - I've checked the requirements:
The workspace must have Unity Catalog enabled. It's enabled.
Tables must be registered in a Unity Catalog metastore. They are. I'm just using the sample Unity Catalog lineage notebook located here: https://notebooks.databricks.com/demos/uc-03-data-lineage/index.html
Queries must use the Spark DataFrame (for example, Spark SQL functions that return a DataFrame) or Databricks SQL interfaces. For examples of Databricks SQL and PySpark queries, see Examples. They are - I'm using the sample notebook, which is interfacing with UC via Databricks SQL.
To view the lineage of a table or view, users must have at least the BROWSE privilege on the table’s or view’s parent catalog. I am the owner of the catalog and have ALL PRIVILEGES on it as well.
To view lineage information for notebooks, workflows, or dashboards, users must have permissions on these objects as defined by the access control settings in the workspace. See Lineage permissions. I have permissions to all of the objects in the loop - the notebook, as well as the catalog.
To view lineage for a Unity Catalog-enabled pipeline, you must have CAN_VIEW permissions on the pipeline. I'm not using a pipeline in my testing.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-15-2024 08:33 PM
Hello,
It's been a few months since this exchange. The feature limitation is not documented anywhere - documents imply that this should be working in GCP:
https://docs.gcp.databricks.com/en/data-governance/unity-catalog/data-lineage.html
Is this feature just off the table for us? Is it not working as intended in Google Cloud? Is it not available in the northamerica-northeast1 region specifically?