Databricks Community

sms101 · ‎09-18-2024

I’ve observed differences in table lineage visibility in Databricks based on how data is referenced, and I would like to confirm if this is the expected behavior.

1. When referencing a Delta table as the source in a query (e.g., df = spark.table("catalog_test.schema.dinner")), the table lineage correctly tracks the source table under the lineage section.

2. However, when referencing a file path (e.g., df1 = spark.read.format("delta").load("s3://path/")), the lineage does not track any source table names, as the source is a file location rather than a registered table.

Is it correct that lineage tracking in Databricks primarily works at the table level and won’t capture lineage from data sources referenced by file paths? If so, are there recommended best practices for maintaining lineage visibility when using file locations as sources?

Brahmareddy · ‎09-22-2024

Hi @sms101,

How are you doing today?

As per my understanding, It is correct that lineage tracking in Databricks works primarily at the table level, meaning when you reference a Delta table directly, the lineage is properly captured. However, when you use file paths as data sources, Databricks does not track lineage since it sees the source as just a file location, not a registered table. For better lineage visibility, consider registering your data sources as Delta tables before referencing them in queries. This will help ensure the lineage is consistently tracked. Additionally, maintaining consistent use of catalog tables instead of direct file paths is a recommended practice to preserve full lineage tracking across your workflow.

Please let me know if it works and have a good day.

Regards,

Brahma