cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Parameterized Delta live table pipeline

Edthehead
New Contributor III

I'm trying to create an ETL framework on delta live tables and basically use the same pipeline for all the transformation from bronze to silver to gold. 

This works absolutely fine when I hard code the tables and the SQL transformations as an array within the notebook itself. Now I need to put this config in another location so it can be easily maintained without touching the notebook.

I can't use a json file because I may have multiple transformations which need to be executed in sequence so I need to sort the transformations from this table and then execute them one by one. 

When I try putting this in another delta table, I'm getting error while trying to convert this into a pandas dataframe to iterate over the rows. 

I referred to this article https://docs.databricks.com/en/delta-live-tables/create-multiple-tables.html but even this has the config hard coded in the pipeline. Is there any example of this kind of use case or is there an alternative to using the config source from outside the DLT pipeline?

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @EdtheheadConfiguring your ETL framework for Delta Live Tables (DLT) can be done in a flexible and maintainable way. Let’s explore some options:

  1. Pipeline Settings in DLT:

    • DLT provides a user-friendly interface for configuring pipeline settings. You can use the UI to define and edit these settings.
    • Additionally, you have the option to display and edit settings in JSON format.
    • Most settings can be configured either through the UI or by specifying a JSON configuration.
    • Some advanced options are only available via JSON configuration.
    • Databricks recommends starting with the UI to familiarize yourself with the settings, but you can directly edit the JSON configuration in the workspace if needed.
    • JSON configuration files are useful when deploying pipelines to new environments or when using the CLI or REST API.
    • For a comprehensive reference to DLT JSON configuration settings, refer to the Delta Live Tables pipeline configurations documentation.
  2. Product Edition Selection:

    • Choose the appropriate DLT product edition based on your pipeline requirements:
      • Core: Suitable for streaming ingest workloads without advanced features like change data capture (CDC) or DLT expectations.
      • Pro: Supports streaming ingest and CDC workloads, including updating tables based on changes in source data.
      • Advanced: Includes Core and Pro features, plus support for enforcing data quality constraints using DLT expectations.
  3. Pipeline Source Code:

    • You can configure the source code defining your pipeline using the file selector in the DLT UI.
    • Pipeline source code can be defined in Databricks notebooks or in SQL/Python scripts stored in workspace files.
    • This approach allows you to separate the configuration from the notebook itself, making it easier to maintain without touching the notebook directly.
  4. Parameterization:

    • Consider parameterizing your pipeline settings.
    • You can pass configuration parameters to your pipeline, even if you’re using serverless compute resources.
    • While compute settings like Enhanced Autoscaling, cluster policies, and instance types are not available for serverless pipelines, you can still set parameters in the JSON configuration.
  5. Cloud Storage Configuration:

    • Specify the storage location for your pipeline output tables.
    • Ensure that the target schema for these tables is well-defined.
  6. Compute Settings:

    • Configure compute settings such as instance types and autoscaling.
    • Note that serverless pipelines have fully managed compute resources, so some settings may not apply.
  7. Pipeline Trigger Interval:

    • Set the interval at which your pipelines trigger.
    • This ensures timely execution of transformations.
  8. Error Handling and Notifications:

    • Add email notifications for pipeline events.
    • Control tombstone management for SCD type 1 queries.

Remember that DLT is designed to simplify ETL workflows, and its declarative approach allows you to focus on defining the desired target state. By leveraging the right settings and separating configuration from code, you can create a robust and maintainable ETL framework. If you encounter specific issues while converting to a pandas dataframe, feel free to share more details, and I’ll be happy to assist! 🚀🔗

 

Thanks but I found what I was looking for in THIS article. This shows how to access a Metadata Delta table which is maintained outside the pipeline and convert the data to a dictionary that can be used to do different processing within the pipeline.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!