Hi @dh thanks for your question!
I believe It’s possible to run Polars with Delta Lake on Databricks, but automatic data lineage tracking is not native outside of Spark jobs. You would likely need to implement custom lineage tracking or integrate external tooling, as Databricks’ built-in lineage features (e.g. through Unity Catalog) are designed around Spark.
If your top priority is the built-in lineage features of Databricks, then going off the standard Spark-based stack might complicate things. While a custom lineage approach with Polars is possible, it adds operational overhead. For critical production scenarios where automated lineage tracking is a key requirement, relying on Databricks’ native Spark-based lineage might be more practical.
To address performance and cost concerns with smaller datasets, you may still consider:
- Using Photon on Databricks to speed up Spark workloads and reduce infrastructure costs.
- Employing a smaller, auto-scaling cluster or even a single-node cluster to control costs for sub-1TB datasets.
- Utilizing Delta Live Tables for structured pipelines, which provide built-in lineage tracking and can help manage costs by simplifying pipeline complexity.