DLT - runtime parameterisation of execution
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ03-13-2024 12:24 PM
I have started to use DLT in a prototype framework and I now face the below challenge for which any help would be appreciated.
First let me give a brief context:
- I have metadata sitting in a .json file that I read as the first task and put it into a log table with all the relevant attributes (including the list of tables to be processed by the DLT pipeline)
- That log table has multiple records including those of past executions so I have to filter it down to the current one using a timestamp (e.g. IngestAdventureWorks_20240314)
- For that I need to pass that ID as a parameter to the DLT pipeline so it can be used in a SQL query to find the relevant records and built the list of tables to be processed.
- When I hardcode it as a Key-Value pair during design-time I can access those values easily using the spark.conf.get("ID", None) syntax
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ03-17-2024 01:25 PM
Thanks Kaniz to your response. It would have been great to use a similar approach like the widgets to a normal notebook. Specifying these parameters at design time does not allow the flexibility needed for running my DLT pipeline truly metadata-driven.
I was also going towards using the job REST API from a notebook but then I ended up tweaking my configuration tables in a way that I can utilise a hardcoded parameter in the DLT definition and still have it dynamic.
If the REST API call functionality could be integrated into the workflows later on to pass these values as to other tasks, that would be really great!
I accept it as a solution because your third suggestion would work. I still keep hoping a more integrated approach will come in the future ๐
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ08-27-2024 03:10 AM
Hi @MartinIsti , How did you manage tweaking the metadata to handle dynamically. Can you pls brief it out based on what you told is the below.
"I ended up tweaking my configuration tables in a way that I can utilise a hardcoded parameter in the DLT definition and still have it dynamic."
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ08-27-2024 02:50 PM
Sure, and for the record I'm still not fully happy with how parameters need to be set at design time.
As mentioned, I store the metadata in a .json file that I read using a standard notebook. The content of that I then save into DBFS as a delta table overwriting any previous version. Then the DLT notebook reads from that table and I only need to specify the name of the process (e.g. IngestAdventureWorks) and that name matches the name of the DLT pipeline itself (or it can be derived).
Once I determine which table to read from the DLT pipeline can be driven by the metadata in that table.
I still find dealing with DLTs inconsistent with orchestration of standard notebook-driven data handling, it is an odd-one out that mostly needs a slightly different way of handling but so far I have found a workaround for every of these small inconsistencies.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ08-28-2024 10:42 PM
@MartinIsti thanks for your detailed explanation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ03-29-2024 12:00 PM
@Retired_mod Can you please provide some reference to REST API approach? I do not see that available on the docs.
TIA