Databricks Community

data-engineer-d · ‎03-21-2024

I am trying to ingest hundreds of tables with CDC, where I want to create a generic/dynamic pipeline which can accept parameters (e.g table_name, schema, file path) and run the logic on it. However, I am not able to find a way to pass parameters to pipeline.

PS: I am aware of using the metadata table to iterate however I might trigger the pipeline using the REST API by passing parameters to it.
Also, I cannot use pipeline configurations for this as they can't be passed dynamically before triggering execution.

Gilg · ‎03-21-2024

If you have different folders for each of your source tables, you can leverage python loops to naturally iterate over the folders.

To do this, you need to create a create_pipeline function that has table_name, schema, path as your parameters. Inside this function, you have your DLT function that creates your raw or bronze table which your parameter will be use.

You can then simply call the main function and loop it for each folder you have in your path using dbutils.

for folder in dbutils.fs.ls("<your path>"):
  table_name = folder.name[:-1]
  create_pipeline(table_name, schema, path)

data-engineer-d · ‎03-22-2024

Hi @Gilg , Thank you for your response.
However, as I am working on Unity catalog this solution might not be suitable for me, also the plan is to use another orchestrator to trigger jobs so parameters need to come separately.

Gilg · ‎03-22-2024

the method that I mentioned will definitely work with workspaces that has UC enabled as I am doing the same.

Also I think I misinterpret what you mean by schema, are you talking about catalog schema or data schema? If catalog schema then you just need to remove it from the functions and then create a separate pipeline for tables in a different catalog schema.

not sure what orchestration tool you are planning to use but this will work using Databricks workflow and ADF. You can even built your own ETL framwork using Databricks.