Uploading file to volume and start ingestion job

maikel · 3 weeks ago

Hello Community!

I am writing to you with my idea about data ingestion job which we have to implement in our project.

The data which we have are in CSV file format and depending on the case it differs a little bit. Before uploading we pivoting csv files to have unified schema. Currently we use github actions to copy the data to volume and when all files are copied we start ingestion job. The same can be done via manual upload, a running the job manually.

Ingestion job is responsible for validation, data transformation (let's say normalization) and data merge into final table.

We would like to automate our pipeline as much as we can. What we think of first is to use auto job run as soon as new files are added to the volume. However is there a possiblity to know what files have been uploaded? As far as I browsed through the documentation it seems, it is not? So I guess we have to create something like audit table to verify what files have been already uploaded, correct?

If you have any suggestions how to approach this data ingestion in general, I would really be thankful for that!

Thank you very much!