cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
Kesavan31
New Contributor II

Architecture_image_front.png

Beyond ADLS Limitations: Making File Arrival Triggers Work for Existing File Updates Using a Flag File Mechanism

In modern data engineering, automation is the backbone of stable, scalable, production-ready pipelines. Databricks File Arrival Triggers play a crucial role by automatically running workflows whenever a file is created in cloud storage.

But there is one fundamental limitation that becomes a major operational blocker:

"Databricks File Arrival Triggers do not fire when an existing file is overwritten with the same name"

This behavior is expected from ADLS and Event Grid - but it breaks automation in real pipelines.

In this blog, I’ll walk through a production-ready mechanism we built to overcome this limitation at scale:

The Flag File Mechanism

A simple yet powerful pattern that guarantees reliable pipeline triggers-even when files are updated without changing their names.

The Root Problem: Triggers Only Work on “Create”, Not “Modify” Events

Databricks File Arrival Triggers listen to Event Grid "Create" events in ADLS.
This works perfectly when:

  • A brand-new file is uploaded

  • A file with a new unique name is placed in the folder

But it fails completely when:

  • A file is overwritten with the same name

  • CI/CD deployments replace files using fixed filenames

  • Metadata files (table_config.csv, mapping_rules.csv) evolve without renaming

In all these cases:

No Create event → No trigger → No pipeline execution

This leads to silent failures - one of the most dangerous issues in data engineering.

Why This Limitation Matters in Real Data Platforms

In our environment, this limitation affected our central metadata ingestion pipeline:

PL_CSV_to_Delta

csv_to_delta_home_image.jpg

This pipeline converts metadata in ADLS into Delta format, powering downstream ETL logic.

However:

  • Metadata filenames rarely change

  • CI/CD deploys updates with the same name

  • ADLS does not emit a Modify event

  • Databricks never receives a trigger

This meant that updated rules, settings, or mappings were never ingested, causing pipelines to run with outdated logic.

We needed a solution that was:

  • Fully automated

  • CI/CD friendly

  • No renaming hacks

  • No timestamp-based filenames

  • No external schedulers like ADF or Airflow

  • 100% reliable

The Solution: The Flag File Mechanism (Elegant & Production-Proven)

At its core, the mechanism uses a small “signal” file to intentionally stimulate the trigger.

This file is named:

trigger_flag.csv

This file acts as an event generator - a guaranteed new file arrival every time a deployment occurs, regardless of whether the main file’s name changes.

Let’s walk through the mechanism in detail.

1.The Repo Always Contains the Flag File

Inside our DevOps repo:

databricks_ConfigData/metadata/

CI_CD_Home.png

We keep:

  • The actual metadata file(s)

  • trigger_flag.csv

The repo intentionally retains the flag file, but ADLS will not.
This difference is the core of the mechanism.

2.CI/CD Syncs Repo - ADLS Automatically

CI_CD_automation.png

The YAML pipeline (azure-pipelines.yml) automatically deploys any changes in metadata to ADLS, including:

  • Updates to existing files (e.g., new rows in EDP_settings.csv)
  • The flag file (trigger_flag.csv)

CI/CD does the following:

  • Detects changes
  • Pushes both files to ADLS
  • ADLS receives a new copy of trigger_flag.csv
  • Databricks trigger fires

This mechanism ensures that every deploy, every update, every modification triggers the pipeline.

3. Databricks Deletes the Flag File at Pipeline Start

In the first cell of PL_CSV_to_Delta.py, we delete:

trigger_flag.csv  

csv_to_delta_RM.png

In this first cell itself, we delete the flag file.

  • This step is not accidental - it is purposefully designed so that once the pipeline begins executing, the ADLS landing zone is reset to a controlled state.
  • Even though the flag exists permanently in the Azure DevOps (CI/CD) repository, it gets removed from ADLS during every pipeline run.

To understand this mechanism more easily, consider the simple example below (just for explanation, not real production values):

 

Location (Example) File Count After Pipeline Start Purpose / Interpretation

Azure DevOps Repo (CI/CD)

2 files (metadata + flag)

Flag always preserved → acts as the permanent trigger initiator

ADLS Dropzone

1 file after deletion

Ensures any next arrival of flag file is considered new and thus triggers the pipeline again

Note: The counts above are only to demonstrate the concept.
In a real production environment, you may have multiple incoming data files - but the flag file logic remains exactly the same.

By keeping the flag in CI/CD and continuously deleting it inside ADLS, we ensure:

  • Every commit can re-trigger the workflow
  • Even updates to existing files generate a fresh execution event
  • No manual re-upload or name-change hacks required

This + / – mechanism is what makes the entire trigger solution stable, repeatable, and production-friendly.

4. Updated Files Now Trigger Processing Automatically

This is the magic.

Let’s say we update EDP_settings.csv by adding a new row.

CI/CD deploys:

  • Updated EDP_settings.csv
  • Fresh trigger_flag.csv (new event for ADLS)

What ADLS sees:

New file arrived → trigger_flag.csv

What Databricks sees:

Event Grid new file → Trigger pipeline

Even though the actual updated file (EDP_settings.csv) existed earlier, the flag file tricks the trigger system into firing.

The pipeline then:

  1. Deletes the flag file.
  2. Processes all files (including the updated one)
  3. Writes data to Delta.

This workflow is entirely automated and consistent.

How We Overcame the Limitation (Clear Breakdown)

The limitation was simple but severe:

Updating an existing file in ADLS does NOT generate a trigger.

This meant that even if metadata changed, the Databricks pipeline never executed.

To solve this without renaming files, adding timestamps, or using external schedulers, we introduced a controlled event-generating workflow using an empty flag file.

Here is exactly how we overcame the limitation:

Before (Problem State)

  • ADLS only emits Create events
  • Overwriting a file with the same name emits no event
  • Databricks File Arrival Trigger never fires
  • CI/CD deployments with fixed file names do not activate the pipeline
  • Metadata updates often go unnoticed
  • ETL logic runs using outdated rules
  • Manual intervention was needed to force pipeline execution

In short, here is what was happening before the fix:

  • When a file was updated in ADLS, no event was generated
  • Because no event was generated, the Databricks trigger did not run
  • Since the trigger did not run, the pipeline never executed

This broke the promise of full automation.

After (Solution State - Using Flag File Mechanism)

We introduced a new file:

trigger_flag.csv

This file acts as a guaranteed new arrival during every CI/CD deployment.

Here’s how the solution works clearly:

1.Flag file always exists in the repo

The repo permanently stores trigger_flag.csv.

2.CI/CD pushes the flag file to ADLS on every update

Even if the real metadata file name hasn't changed, the deployment introduces the flag file as a new creation.

3.ADLS treats the flag as a fresh event

Since it's a new file, ADLS generates a Create event.

4.Databricks File Arrival Trigger activates

The trigger runs because ADLS sent a new file event.

5.Pipeline deletes the flag file immediately

This ensures:

  • The landing zone resets to a state where the flag is absent
  • The next CI/CD deployment introduces the flag file again
  • ADLS will treat it as a new file every time

6.Updated metadata is processed successfully

Even if the main file (EDP_settings.csv) has the same name, the pipeline still runs and picks up the updated content.

Why This Solution Works So Well

Thanks to the flag file:

  • Every update results in a fresh ADLS “Create” event
  • Every CI/CD commit automatically triggers the pipeline
  • Every metadata change is processed instantly
  • No renaming or manual actions are required
  • The entire pipeline becomes deterministic, reliable, and fully automated

In essence:

We bypassed the ADLS overwrite limitation by generating our own controlled event - the flag file.

This is how we converted a platform constraint into a predictable and enterprise-ready automation pattern.

Final Thoughts

This solution may look simple - but its impact is massive.

By introducing a small, intelligent event signal file, we turned Databricks File Arrival Triggers into a fully dynamic, update-aware, DevOps-friendly automation system that works even with unchanged filenames.

It addresses a real limitation with a pragmatic, production-ready approach that is:

  • Elegant
  • Maintainable
  • Stable
  • Cloud-native
  • Enterprise-scalable

This pattern has proven extremely reliable in production and can be applied to any metadata-driven or CI/CD-driven data platform.

Sometimes, the simplest ideas unlock the biggest automation wins.

1 Comment
Louis_Frolio
Databricks Employee
Databricks Employee

@Kesavan31 , you’re calling out a problem a lot of teams run into in practice—usually only after something breaks and nobody’s quite sure why.

What really stands out to me is that you didn’t try to fight the platform. You leaned into the reality that file arrival triggers are Create-event driven and designed a clean, deterministic pattern around that constraint. The flag file mechanism is simple, intentional, and—most importantly—production-proven.

It also avoids all the usual workarounds we’ve all seen: timestamped filenames, forced renames, cron jobs, or bolting on an external scheduler just to paper over storage behavior. Instead, you establish a clear contract between CI/CD, ADLS, and Databricks that guarantees a trigger when it actually matters.

That delete-and-reintroduce loop is the fundamental insight. It turns a platform limitation into a predictable automation pattern and restores something easy to lose in data platforms: trust. When metadata changes, pipelines run. Every time.

Sometimes the best solutions aren’t complex—they’re thoughtfully engineered. This is an excellent example of that.

Cheers, Louis.