I’ve observed differences in table lineage visibility in Databricks based on how data is referenced, and I would like to confirm if this is the expected behavior.
1. When referencing a Delta table as the source in a query (e.g., df = spark.table("catalog_test.schema.dinner")), the table lineage correctly tracks the source table under the lineage section.
2. However, when referencing a file path (e.g., df1 = spark.read.format("delta").load("s3://path/")), the lineage does not track any source table names, as the source is a file location rather than a registered table.
Is it correct that lineage tracking in Databricks primarily works at the table level and won’t capture lineage from data sources referenced by file paths? If so, are there recommended best practices for maintaining lineage visibility when using file locations as sources?