<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Lakeflow jobs in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/lakeflow-jobs/m-p/139403#M51191</link>
    <description>&lt;P&gt;File-arrival triggers are based on the &lt;STRONG&gt;creation of a new file&lt;/STRONG&gt;. If an upstream system always overwrites the same file name in place (for example, landing/data.csv every time), a file-arrival trigger will generally &lt;STRONG&gt;not&lt;/STRONG&gt; fire for every overwrite.&lt;/P&gt;&lt;P&gt;Because you can’t change the filename, here are realistic workarounds:&lt;/P&gt;&lt;H4&gt;Option A – Use a &lt;EM&gt;marker&lt;/EM&gt; file for the trigger (best if you can change the upstream minimally)&lt;/H4&gt;&lt;P&gt;Keep your existing data file as-is (same name, same path), but:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;After writing data.csv, the upstream (or a tiny helper job) also writes a &lt;STRONG&gt;small marker file&lt;/STRONG&gt; with a unique name each time, e.g.&lt;BR /&gt;markers/run_2025-11-18T120001.txt.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Configure the LakeFlow event trigger on the &lt;STRONG&gt;marker folder&lt;/STRONG&gt;.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;The first step of the pipeline just reads data.csv (the file whose name cannot change).&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Your data contract stays the same, but the trigger sees each new marker file.&lt;/P&gt;&lt;H4&gt;Option B – Time-based trigger + “change detection”&lt;/H4&gt;&lt;P&gt;If you really can’t modify upstream at all:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;Switch that pipeline to a &lt;STRONG&gt;schedule-based trigger&lt;/STRONG&gt; (e.g. every X minutes).&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;In your first task, read the &lt;STRONG&gt;last-modified timestamp&lt;/STRONG&gt; of the file or a small metadata table.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Compare it with a stored watermark; only continue if it has changed since the last run.&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;You lose pure event-driven behaviour, but still avoid repeated heavy processing.&lt;/P&gt;&lt;H4&gt;Option C – Change only the &lt;EM&gt;path&lt;/EM&gt;, not the filename (where possible)&lt;/H4&gt;&lt;P&gt;If the file name must stay identical but you’re allowed to adjust the folder:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Upstream writes into time-partitioned folders (e.g. /landing/2025/11/18/data.csv).&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;LakeFlow trigger watches /landing/**.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;A later step merges or copies into your legacy fixed path if other systems rely on it.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;</description>
    <pubDate>Mon, 17 Nov 2025 16:54:38 GMT</pubDate>
    <dc:creator>bianca_unifeye</dc:creator>
    <dc:date>2025-11-17T16:54:38Z</dc:date>
    <item>
      <title>Lakeflow jobs</title>
      <link>https://community.databricks.com/t5/data-engineering/lakeflow-jobs/m-p/139394#M51186</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Hi&amp;nbsp;&lt;BR /&gt;I am currently working on migrating all ADF jobs to LakeFlow jobs. I have a few questions:&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Pipeline cost:&lt;/STRONG&gt; What is the cost model for running LakeFlow pipelines? Any documentation available? ADF vs Lakeflow Job?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Job reuse:&lt;/STRONG&gt; Do LakeFlow jobs reuse the same compute/job for each notebook activity, or is each notebook run treated as an independent job?&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Event-based triggers:&lt;/STRONG&gt; It seems event-based triggering won’t work when the same file name is used. Is there any workaround for this? Keeping&amp;nbsp; in mind for migration, cannot change existing setup file name will be same every time.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 17 Nov 2025 16:02:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/lakeflow-jobs/m-p/139394#M51186</guid>
      <dc:creator>Nidhig</dc:creator>
      <dc:date>2025-11-17T16:02:51Z</dc:date>
    </item>
    <item>
      <title>Re: Lakeflow jobs</title>
      <link>https://community.databricks.com/t5/data-engineering/lakeflow-jobs/m-p/139402#M51190</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Pipeline cost&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;ADF&lt;/STRONG&gt; charges per &lt;EM&gt;activity run&lt;/EM&gt; + Data Integration Units (DIUs) used for copy/mapping activities, plus Integration Runtime time.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;LakeFlow&lt;/STRONG&gt; charges per &lt;STRONG&gt;compute usage&lt;/STRONG&gt;, not per-step. Fewer moving pricing knobs, but your cost is dominated by:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;cluster size/type&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;runtime (how long the job runs)&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;how often you trigger it&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;You will pay also for storage.&lt;/P&gt;&lt;P&gt;For many migrations, you replace a large ADF pipeline full of small activities with &lt;STRONG&gt;one or a few LakeFlow jobs&lt;/STRONG&gt;, so you trade lots of per-activity charges for a simpler, cluster-based cost model.&lt;/P&gt;&lt;H3&gt;Job reuse&amp;nbsp;&lt;/H3&gt;&lt;P&gt;&lt;STRONG&gt;Within a single LakeFlow job run you can reuse the same compute across multiple notebook tasks – if you configure it that way.&lt;/STRONG&gt; Each job &lt;EM&gt;run&lt;/EM&gt; itself is independent.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;A LakeFlow job is a DAG of tasks (notebooks, pipelines, Python scripts, etc.).&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Each task has a &lt;STRONG&gt;compute configuration&lt;/STRONG&gt;:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;shared &lt;STRONG&gt;job cluster&lt;/STRONG&gt; (recommended), or&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;its own cluster, or&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;serverless, or&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;an existing interactive cluster.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;If you attach &lt;STRONG&gt;all notebook activities in that job&lt;/STRONG&gt; to the &lt;STRONG&gt;same job cluster / serverless definition&lt;/STRONG&gt;, then in one run:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;the cluster is started once,&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;all tasks run on that same cluster,&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;the cluster is then terminated based on your settings.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It does &lt;EM&gt;not&lt;/EM&gt; automatically reuse compute &lt;STRONG&gt;across different job runs&lt;/STRONG&gt; or across different jobs, unless you deliberately target a long-running interactive cluster.&lt;/P&gt;&lt;P&gt;For an ADF pipeline with N Databricks notebook activities, the usual pattern is:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;create &lt;STRONG&gt;one LakeFlow job&lt;/STRONG&gt; with N notebook tasks,&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;attach them all to one shared job cluster → this typically reduces cost vs many independent jobs.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Mon, 17 Nov 2025 16:52:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/lakeflow-jobs/m-p/139402#M51190</guid>
      <dc:creator>bianca_unifeye</dc:creator>
      <dc:date>2025-11-17T16:52:09Z</dc:date>
    </item>
    <item>
      <title>Re: Lakeflow jobs</title>
      <link>https://community.databricks.com/t5/data-engineering/lakeflow-jobs/m-p/139403#M51191</link>
      <description>&lt;P&gt;File-arrival triggers are based on the &lt;STRONG&gt;creation of a new file&lt;/STRONG&gt;. If an upstream system always overwrites the same file name in place (for example, landing/data.csv every time), a file-arrival trigger will generally &lt;STRONG&gt;not&lt;/STRONG&gt; fire for every overwrite.&lt;/P&gt;&lt;P&gt;Because you can’t change the filename, here are realistic workarounds:&lt;/P&gt;&lt;H4&gt;Option A – Use a &lt;EM&gt;marker&lt;/EM&gt; file for the trigger (best if you can change the upstream minimally)&lt;/H4&gt;&lt;P&gt;Keep your existing data file as-is (same name, same path), but:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;After writing data.csv, the upstream (or a tiny helper job) also writes a &lt;STRONG&gt;small marker file&lt;/STRONG&gt; with a unique name each time, e.g.&lt;BR /&gt;markers/run_2025-11-18T120001.txt.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Configure the LakeFlow event trigger on the &lt;STRONG&gt;marker folder&lt;/STRONG&gt;.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;The first step of the pipeline just reads data.csv (the file whose name cannot change).&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Your data contract stays the same, but the trigger sees each new marker file.&lt;/P&gt;&lt;H4&gt;Option B – Time-based trigger + “change detection”&lt;/H4&gt;&lt;P&gt;If you really can’t modify upstream at all:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;Switch that pipeline to a &lt;STRONG&gt;schedule-based trigger&lt;/STRONG&gt; (e.g. every X minutes).&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;In your first task, read the &lt;STRONG&gt;last-modified timestamp&lt;/STRONG&gt; of the file or a small metadata table.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Compare it with a stored watermark; only continue if it has changed since the last run.&lt;/P&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;You lose pure event-driven behaviour, but still avoid repeated heavy processing.&lt;/P&gt;&lt;H4&gt;Option C – Change only the &lt;EM&gt;path&lt;/EM&gt;, not the filename (where possible)&lt;/H4&gt;&lt;P&gt;If the file name must stay identical but you’re allowed to adjust the folder:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Upstream writes into time-partitioned folders (e.g. /landing/2025/11/18/data.csv).&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;LakeFlow trigger watches /landing/**.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;A later step merges or copies into your legacy fixed path if other systems rely on it.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Mon, 17 Nov 2025 16:54:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/lakeflow-jobs/m-p/139403#M51191</guid>
      <dc:creator>bianca_unifeye</dc:creator>
      <dc:date>2025-11-17T16:54:38Z</dc:date>
    </item>
    <item>
      <title>Re: Lakeflow jobs</title>
      <link>https://community.databricks.com/t5/data-engineering/lakeflow-jobs/m-p/139409#M51193</link>
      <description>&lt;P&gt;Well detailed answer&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/193092"&gt;@bianca_unifeye&lt;/a&gt;.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/175584"&gt;@Nidhig&lt;/a&gt;&amp;nbsp;- There is no such silverbullet that migrating the workloads to Databricks will always reduce the cost. It rather depends on multiple factors such as the job configurations, the type of clusters you used, etc. However, with our last migration from ADF to Workflow certainly simplified our job pipelines and increase the audiatability and observaability besides reducing the visible cost, more importantly operational cost.&lt;/P&gt;</description>
      <pubDate>Mon, 17 Nov 2025 17:26:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/lakeflow-jobs/m-p/139409#M51193</guid>
      <dc:creator>Raman_Unifeye</dc:creator>
      <dc:date>2025-11-17T17:26:22Z</dc:date>
    </item>
    <item>
      <title>Re: Lakeflow jobs</title>
      <link>https://community.databricks.com/t5/data-engineering/lakeflow-jobs/m-p/139413#M51195</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/175584"&gt;@Nidhig&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;1. &lt;STRONG&gt;Regarding pipeline cost&lt;/STRONG&gt; - here you're mostly paying for compute usage. So the exact price depends on which plan you are at and which cloud provider you are using. For instance, for Azure premium plan and US East region you have following cost of DBU:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_0-1763400738912.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/21776i5B2C740307CA341B/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_0-1763400738912.png" alt="szymon_dybczak_0-1763400738912.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Here you can use pricing calculator if you want to have more detailed cost estimation because there are other factors like i.e Photon enable&lt;BR /&gt;&lt;A href="https://www.databricks.com/product/pricing/product-pricing/instance-types" target="_blank" rel="noopener"&gt;Pricing Calculator Page | Databricks&lt;/A&gt;&lt;/P&gt;&lt;P&gt;2. &lt;STRONG&gt;Regarding job&amp;nbsp;reuse&lt;/STRONG&gt; - within one job you can have multiple task and those task will reuse your compute. But for example, if you define for_each task and inside an interation you will run job then each job will spawn it's own job compute.&lt;/P&gt;&lt;P&gt;3. This is a limitation of file-based event triggers as of now. But of course there are plenty of workaround. For example, you can subscribe to Azure Event Grid System Topic and if new file arrives then azure function will start processing passing file_path as an argument to a job.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 17 Nov 2025 17:39:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/lakeflow-jobs/m-p/139413#M51195</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-11-17T17:39:33Z</dc:date>
    </item>
  </channel>
</rss>

