topic Re: Lakeflow jobs in Data Engineering

Lakeflow jobs

Nidhig — Mon, 17 Nov 2025 16:02:51 GMT

Hi
I am currently working on migrating all ADF jobs to LakeFlow jobs. I have a few questions:

Pipeline cost: What is the cost model for running LakeFlow pipelines? Any documentation available? ADF vs Lakeflow Job?
Job reuse: Do LakeFlow jobs reuse the same compute/job for each notebook activity, or is each notebook run treated as an independent job?
Event-based triggers: It seems event-based triggering won’t work when the same file name is used. Is there any workaround for this? Keeping in mind for migration, cannot change existing setup file name will be same every time.

Re: Lakeflow jobs

bianca_unifeye — Mon, 17 Nov 2025 16:52:09 GMT

Pipeline cost

ADF charges per activity run + Data Integration Units (DIUs) used for copy/mapping activities, plus Integration Runtime time.
LakeFlow charges per compute usage, not per-step. Fewer moving pricing knobs, but your cost is dominated by:
- cluster size/type
- runtime (how long the job runs)
- how often you trigger it

You will pay also for storage.

For many migrations, you replace a large ADF pipeline full of small activities with one or a few LakeFlow jobs, so you trade lots of per-activity charges for a simpler, cluster-based cost model.

Job reuse

Within a single LakeFlow job run you can reuse the same compute across multiple notebook tasks – if you configure it that way. Each job run itself is independent.

A LakeFlow job is a DAG of tasks (notebooks, pipelines, Python scripts, etc.).
Each task has a compute configuration:
- shared job cluster (recommended), or
- its own cluster, or
- serverless, or
- an existing interactive cluster.

If you attach all notebook activities in that job to the same job cluster / serverless definition, then in one run:

the cluster is started once,
all tasks run on that same cluster,
the cluster is then terminated based on your settings.

It does not automatically reuse compute across different job runs or across different jobs, unless you deliberately target a long-running interactive cluster.

For an ADF pipeline with N Databricks notebook activities, the usual pattern is:

create one LakeFlow job with N notebook tasks,
attach them all to one shared job cluster → this typically reduces cost vs many independent jobs.

Re: Lakeflow jobs

bianca_unifeye — Mon, 17 Nov 2025 16:54:38 GMT

File-arrival triggers are based on the creation of a new file. If an upstream system always overwrites the same file name in place (for example, landing/data.csv every time), a file-arrival trigger will generally not fire for every overwrite.

Because you can’t change the filename, here are realistic workarounds:

Option A – Use a marker file for the trigger (best if you can change the upstream minimally)

Keep your existing data file as-is (same name, same path), but:

After writing data.csv, the upstream (or a tiny helper job) also writes a small marker file with a unique name each time, e.g.
markers/run_2025-11-18T120001.txt.
Configure the LakeFlow event trigger on the marker folder.
The first step of the pipeline just reads data.csv (the file whose name cannot change).

Your data contract stays the same, but the trigger sees each new marker file.

Option B – Time-based trigger + “change detection”

If you really can’t modify upstream at all:

Switch that pipeline to a schedule-based trigger (e.g. every X minutes).
In your first task, read the last-modified timestamp of the file or a small metadata table.
Compare it with a stored watermark; only continue if it has changed since the last run.

You lose pure event-driven behaviour, but still avoid repeated heavy processing.

Option C – Change only the path, not the filename (where possible)

If the file name must stay identical but you’re allowed to adjust the folder:

Upstream writes into time-partitioned folders (e.g. /landing/2025/11/18/data.csv).
LakeFlow trigger watches /landing/**.
A later step merges or copies into your legacy fixed path if other systems rely on it.

Re: Lakeflow jobs

Raman_Unifeye — Mon, 17 Nov 2025 17:26:22 GMT

Well detailed answer @bianca_unifeye.

@Nidhig - There is no such silverbullet that migrating the workloads to Databricks will always reduce the cost. It rather depends on multiple factors such as the job configurations, the type of clusters you used, etc. However, with our last migration from ADF to Workflow certainly simplified our job pipelines and increase the audiatability and observaability besides reducing the visible cost, more importantly operational cost.

Re: Lakeflow jobs

szymon_dybczak — Mon, 17 Nov 2025 17:39:33 GMT

Hi @Nidhig ,

1. Regarding pipeline cost - here you're mostly paying for compute usage. So the exact price depends on which plan you are at and which cloud provider you are using. For instance, for Azure premium plan and US East region you have following cost of DBU:

Here you can use pricing calculator if you want to have more detailed cost estimation because there are other factors like i.e Photon enable
Pricing Calculator Page | Databricks

2. Regarding job reuse - within one job you can have multiple task and those task will reuse your compute. But for example, if you define for_each task and inside an interation you will run job then each job will spawn it's own job compute.

3. This is a limitation of file-based event triggers as of now. But of course there are plenty of workaround. For example, you can subscribe to Azure Event Grid System Topic and if new file arrives then azure function will start processing passing file_path as an argument to a job.