12 hours ago
Hi
I am currently working on migrating all ADF jobs to LakeFlow jobs. I have a few questions:
Pipeline cost: What is the cost model for running LakeFlow pipelines? Any documentation available? ADF vs Lakeflow Job?
Job reuse: Do LakeFlow jobs reuse the same compute/job for each notebook activity, or is each notebook run treated as an independent job?
12 hours ago - last edited 12 hours ago
Pipeline cost
ADF charges per activity run + Data Integration Units (DIUs) used for copy/mapping activities, plus Integration Runtime time.
LakeFlow charges per compute usage, not per-step. Fewer moving pricing knobs, but your cost is dominated by:
cluster size/type
runtime (how long the job runs)
how often you trigger it
You will pay also for storage.
For many migrations, you replace a large ADF pipeline full of small activities with one or a few LakeFlow jobs, so you trade lots of per-activity charges for a simpler, cluster-based cost model.
Within a single LakeFlow job run you can reuse the same compute across multiple notebook tasks โ if you configure it that way. Each job run itself is independent.
A LakeFlow job is a DAG of tasks (notebooks, pipelines, Python scripts, etc.).
Each task has a compute configuration:
shared job cluster (recommended), or
its own cluster, or
serverless, or
an existing interactive cluster.
If you attach all notebook activities in that job to the same job cluster / serverless definition, then in one run:
the cluster is started once,
all tasks run on that same cluster,
the cluster is then terminated based on your settings.
It does not automatically reuse compute across different job runs or across different jobs, unless you deliberately target a long-running interactive cluster.
For an ADF pipeline with N Databricks notebook activities, the usual pattern is:
create one LakeFlow job with N notebook tasks,
attach them all to one shared job cluster โ this typically reduces cost vs many independent jobs.
12 hours ago
File-arrival triggers are based on the creation of a new file. If an upstream system always overwrites the same file name in place (for example, landing/data.csv every time), a file-arrival trigger will generally not fire for every overwrite.
Because you canโt change the filename, here are realistic workarounds:
Keep your existing data file as-is (same name, same path), but:
After writing data.csv, the upstream (or a tiny helper job) also writes a small marker file with a unique name each time, e.g.
markers/run_2025-11-18T120001.txt.
Configure the LakeFlow event trigger on the marker folder.
The first step of the pipeline just reads data.csv (the file whose name cannot change).
Your data contract stays the same, but the trigger sees each new marker file.
If you really canโt modify upstream at all:
Switch that pipeline to a schedule-based trigger (e.g. every X minutes).
In your first task, read the last-modified timestamp of the file or a small metadata table.
Compare it with a stored watermark; only continue if it has changed since the last run.
You lose pure event-driven behaviour, but still avoid repeated heavy processing.
If the file name must stay identical but youโre allowed to adjust the folder:
Upstream writes into time-partitioned folders (e.g. /landing/2025/11/18/data.csv).
LakeFlow trigger watches /landing/**.
A later step merges or copies into your legacy fixed path if other systems rely on it.
11 hours ago
Well detailed answer @bianca_unifeye.
@Nidhig - There is no such silverbullet that migrating the workloads to Databricks will always reduce the cost. It rather depends on multiple factors such as the job configurations, the type of clusters you used, etc. However, with our last migration from ADF to Workflow certainly simplified our job pipelines and increase the audiatability and observaability besides reducing the visible cost, more importantly operational cost.
11 hours ago - last edited 11 hours ago
Hi @Nidhig ,
1. Regarding pipeline cost - here you're mostly paying for compute usage. So the exact price depends on which plan you are at and which cloud provider you are using. For instance, for Azure premium plan and US East region you have following cost of DBU:
Here you can use pricing calculator if you want to have more detailed cost estimation because there are other factors like i.e Photon enable
Pricing Calculator Page | Databricks
2. Regarding job reuse - within one job you can have multiple task and those task will reuse your compute. But for example, if you define for_each task and inside an interation you will run job then each job will spawn it's own job compute.
3. This is a limitation of file-based event triggers as of now. But of course there are plenty of workaround. For example, you can subscribe to Azure Event Grid System Topic and if new file arrives then azure function will start processing passing file_path as an argument to a job.
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now