cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Data Lineage without Spark, but with Polars (and Delta Lake) instead

dh
New Contributor

Some context: I am completely new to Databricks; have heard good stuff, but also some things that worry me.

One thing that worries me is the performance (and eventual costs) of running Spark with smaller (sub 1TB) datasets. However, one requirement from our architects is "Data Lineage", which is why they are pushing for the use of PySpark.

Soon we will have a Hackathon where I intend to find a way of keeping Data Lineage, but work around Spark if possible, because I've heard that Polars is a much better fit for smaller (sub 200GB datasets) datasets, in the end saving us money and (run)time.

Then there's the required use of Delta Lake, which I'm pretty sure work with Polars.

So my question is: Is it even possible to run a Python application with Polars, on Databricks, while enabling Data Lineage in one way or another, while storing data as/in Deltalake.

1 REPLY 1

VZLA
Databricks Employee
Databricks Employee

Hi @dh thanks for your question!

I believe It’s possible to run Polars with Delta Lake on Databricks, but automatic data lineage tracking is not native outside of Spark jobs. You would likely need to implement custom lineage tracking or integrate external tooling, as Databricks’ built-in lineage features (e.g. through Unity Catalog) are designed around Spark.

If your top priority is the built-in lineage features of Databricks, then going off the standard Spark-based stack might complicate things. While a custom lineage approach with Polars is possible, it adds operational overhead. For critical production scenarios where automated lineage tracking is a key requirement, relying on Databricks’ native Spark-based lineage might be more practical.

To address performance and cost concerns with smaller datasets, you may still consider:

  1. Using Photon on Databricks to speed up Spark workloads and reduce infrastructure costs.
  2. Employing a smaller, auto-scaling cluster or even a single-node cluster to control costs for sub-1TB datasets.
  3. Utilizing Delta Live Tables for structured pipelines, which provide built-in lineage tracking and can help manage costs by simplifying pipeline complexity.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group