cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

File Arrival Trigger - Multiple tables

maddy08
New Contributor II

I have 100+ tables , with CDC Iโ€™m getting files on GCS bucket every 15mins or some random time based on source changes.

I have enabled file arrival triggers for each tables.

is this good approach or consolidating tables in one job to have one trigger? 

2 REPLIES 2

balajij8
Contributor

You can use separate triggers if tables req high priority. 100+ file arrival triggers can cause issues. Group 20 tables per job and use CDC to reduce complexity.

SteveOstrowski
Databricks Employee
Databricks Employee

@maddy08

Managing 100+ CDC tables with file arrival triggers is a common architecture decision, and there are tradeoffs either way. Here is a breakdown to help you decide.


UNDERSTANDING THE LIMITS

The most important thing to know is that without file events enabled on your external location, there is a hard limit of 50 jobs with file arrival triggers per workspace. Since you have 100+ tables, you would hit this limit if you create one job per table.

With file events enabled, this limit is removed. File events allow Databricks to use cloud provider change notifications (rather than polling) to detect new files, which is both faster and more scalable. Existing triggers start benefiting within minutes of enabling file events, and new triggers benefit within seconds.

To enable file events, you need to be the owner of the external location or have the MANAGE privilege on it.

Reference: https://docs.databricks.com/aws/en/jobs/file-arrival-triggers


OPTION 1: ONE TRIGGER PER TABLE (WITH FILE EVENTS ENABLED)

If you enable file events, you can safely have 100+ separate jobs each with their own file arrival trigger. This gives you:

- Independent scheduling and monitoring per table
- Isolated failure handling (one table failing does not block others)
- Clear lineage and easier debugging
- Ability to set different priority/SLA per table

The downside is more jobs to manage and monitor, and higher cluster startup overhead if each job spins up its own compute.


OPTION 2: CONSOLIDATE TABLES INTO FEWER JOBS

You can use a single file arrival trigger on a parent directory and then process multiple tables within that job using multi-task jobs. Databricks jobs support multiple tasks with dependencies, so you can structure this as:

- A single file arrival trigger monitoring a parent path (e.g., gs://bucket/cdc/)
- Multiple tasks within that job, each processing a different table subfolder
- Or use a For Each task to iterate over a list of table names dynamically

A For Each task lets you pass a JSON array of table names and run a nested notebook task for each one, with configurable parallelism. This is a clean way to process N tables in one triggered job.

# Example: For Each task input (JSON array)
["table_a", "table_b", "table_c", ...]

# In the nested notebook, reference the current table:
table_name = dbutils.widgets.get("input")
df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.load(f"gs://bucket/cdc/{table_name}/")

Reference on For Each tasks: https://docs.databricks.com/aws/en/jobs/for-each
Reference on configuring multi-task jobs: https://docs.databricks.com/aws/en/jobs/configure-task


OPTION 3: USE LAKEFLOW DECLARATIVE PIPELINES (SDP)

Since you are working with CDC data, another approach worth considering is Lakeflow Spark Declarative Pipelines (SDP). SDP pipelines can:

- Use Auto Loader to incrementally ingest new files with exactly-once guarantees
- Apply CDC processing with APPLY CHANGES INTO for merge/upsert logic
- Handle multiple tables within a single pipeline definition
- Be triggered by a file arrival trigger on the job that runs the pipeline

This is often the cleanest architecture for CDC workloads with many tables, since the pipeline handles schema evolution, data quality checks, and incremental processing natively.

Reference: https://docs.databricks.com/aws/en/ldp/


MY RECOMMENDATION

For 100+ CDC tables arriving on GCS:

1. Enable file events on your external location -- this is the single most impactful step regardless of which approach you choose.

2. If tables have different SLAs or arrival patterns, group them into logical batches (e.g., 10-20 tables per job) with For Each tasks. This balances manageability with isolation.

3. If all tables follow the same pattern, consider a Lakeflow Spark Declarative Pipeline (SDP) with Auto Loader sources for each table. One pipeline, one trigger, and SDP handles the rest.

4. Avoid 100+ individual jobs without file events, as you will hit the 50-job workspace limit and create unnecessary operational overhead.


TIPS FOR FILE ARRIVAL TRIGGERS AT SCALE

- If you monitor a parent directory, create a Unity Catalog volume mapped to your specific subdirectory rather than monitoring a deep subpath. This isolates your target path as the trigger's effective root and reduces unrelated change noise.
- File arrival triggers check for new files recursively across all subdirectories.
- Only new files trigger runs -- overwriting existing files does not fire the trigger.
- You can configure a minimum time between triggers and a wait time after last change to batch arrivals.

Reference: https://docs.databricks.com/aws/en/jobs/file-arrival-triggers

Hope this helps. Let us know which approach you go with and if you run into any issues.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.