Databricks Community

MartinIsti · ‎03-13-2024

I have started to use DLT in a prototype framework and I now face the below challenge for which any help would be appreciated.

First let me give a brief context:

I have metadata sitting in a .json file that I read as the first task and put it into a log table with all the relevant attributes (including the list of tables to be processed by the DLT pipeline)
That log table has multiple records including those of past executions so I have to filter it down to the current one using a timestamp (e.g. IngestAdventureWorks_20240314)
For that I need to pass that ID as a parameter to the DLT pipeline so it can be used in a SQL query to find the relevant records and built the list of tables to be processed.
When I hardcode it as a Key-Value pair during design-time I can access those values easily using the spark.conf.get("ID", None) syntax

My question/challenge is how to pass that parameter using either a task in a workflow (similarly how I can reference prior tasks' output and pass it to a widget in a downstream notebook task) or execute the DLT pipeline using a notebook.

That would be really important for me to make the solution really dynamic without hardcoding parameter values.

Thanks for any help in advance

István

MartinIsti · ‎03-17-2024

Thanks Kaniz to your response. It would have been great to use a similar approach like the widgets to a normal notebook. Specifying these parameters at design time does not allow the flexibility needed for running my DLT pipeline truly metadata-driven.

I was also going towards using the job REST API from a notebook but then I ended up tweaking my configuration tables in a way that I can utilise a hardcoded parameter in the DLT definition and still have it dynamic.

If the REST API call functionality could be integrated into the workflows later on to pass these values as to other tasks, that would be really great!

I accept it as a solution because your third suggestion would work. I still keep hoping a more integrated approach will come in the future 😉

Vamshikrishna_r · ‎08-27-2024

Hi @MartinIsti , How did you manage tweaking the metadata to handle dynamically. Can you pls brief it out based on what you told is the below.

"I ended up tweaking my configuration tables in a way that I can utilise a hardcoded parameter in the DLT definition and still have it dynamic."

MartinIsti · ‎08-27-2024

Sure, and for the record I'm still not fully happy with how parameters need to be set at design time.

As mentioned, I store the metadata in a .json file that I read using a standard notebook. The content of that I then save into DBFS as a delta table overwriting any previous version. Then the DLT notebook reads from that table and I only need to specify the name of the process (e.g. IngestAdventureWorks) and that name matches the name of the DLT pipeline itself (or it can be derived).

Once I determine which table to read from the DLT pipeline can be driven by the metadata in that table.

I still find dealing with DLTs inconsistent with orchestration of standard notebook-driven data handling, it is an odd-one out that mostly needs a slightly different way of handling but so far I have found a workaround for every of these small inconsistencies.