Re: Uploading file to volume and start ingestion j...

Ashwin_DSA · 3 weeks ago

You don't have to build a custom solution for this. Databricks now has native components that align very well with what you want.

If you want the job to start as soon as new files land in a volume, the recommended approach is to use file-arrival triggers on a Unity Catalog volume or external location, and have that trigger start your ingestion job or Lakehouse pipeline. You point the trigger at something like /Volumes/<catalog>/<schema>/<volume>/incoming/, and Databricks will poll for new files (roughly once a minute) and fire the job when it sees new arrivals, without needing GitHub Actions to orchestrate that part anymore. See the docs for file-arrival triggers and using volumes for ingestion.

For “how do I know which files have been uploaded/processed?”, the key is to lean on Auto Loader rather than rolling your own state tracking. When you read from the volume with spark.readStream.format("cloudFiles") (for example, with cloudFiles.format = "csv"), Auto Loader persists file metadata in its checkpoint and uses that to guarantee that each file is processed exactly once and that the stream can resume safely after failures.... You don’t need a separate audit table just to avoid reprocessing the same file. See What is Auto Loader? and the “How does Auto Loader track ingestion progress?” section there.

If you want human-readable observability ("which files, when, and in which batch?"), then yes, it’s common to add an ingestion log table on top... either by querying Auto Loader’s cloud_files_state metadata (which stores per-file state including commit_time) or by logging the path column from your stream in a small foreachBatch into a Delta table. That gives you a clean audit trail without owning the low-level dedup logic yourself. The heavy lifting still comes from Auto Loader’s internal state. The relevant options and the cloud_files_state TVF are documented under Auto Loader options.

A robust pattern for your scenario is... land CSVs (from GitHub Actions or manual upload) into a Unity Catalog volume, trigger a job on file arrival, use Auto Loader from that volume into a bronze table, then do your validation/normalisation and any pivoting into silver, and finally MERGE into the final table. This keeps uploads simple, makes ingestion incremental and mostly self-driven, and still lets you add an explicit audit table if you want extra transparency for which files were processed when.

By the way, are you exporting data to CSV from an upstream system and then uploading it to the volume for any specific reason (governance, network, tooling, etc.)? If you have direct access to the source system, you might also look at pulling data straight into Databricks using Lakeflow Connect instead of going via CSV. Lakeflow Connect provides managed connectors for common SaaS apps and databases, with incremental ingestion into streaming tables, which can remove a lot of custom file-handling logic. See What is Lakeflow Connect? and Managed connectors in Lakeflow Connect. if you are interested to learn more.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post