Hello,
We have set up a lakehouse in Databricks for one of our clients. One of the features our client would like to use is the Unity Catalog data lineage view. This is a handy feature that we have used with other clients (in both AWS and Azure) without issue.
We noticed that the lineage data is not being populated at all for the UC tables in our GCP workspaces. Even just running through the UC Sample notebook, we do not see any Lineage data being populated. Looking at the logs, we saw errors like the below that made us think perhaps the issue was with the log4j config:
2024-05-07 13:08:02,895 Thread-168 WARN RollingFileAppender 'com.databricks.LineageLogging.appender': The bufferSize is set to 128000 but bufferedIO is not true
After modifying the log4j properties specified in the error message, we no longer see the log messages. However, the lineage service still does not appear to be working. Our GCP workspaces are allowed outbound access to the internet via our NAT gateways, and are not passing through any in-line firewalls.
Has anyone run into this issue in GCP, and does anyone know how to resolve it if so?
---
As an aside, updating the log4j properties was not as straightforward as mentioned here:
https://kb.databricks.com/clusters/overwrite-log4j-logs
The file specified in the above KB article does not exist on the clusters we tested in GCP (single-node, 13.3.x-scala2.12). The log4j file we had to modify is located at: /databricks/spark/dbconf/log4j/driver/log4j2.xml