cancel
Showing results for 
Search instead for 
Did you mean: 
Community Articles
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results for 
Search instead for 
Did you mean: 

The Open Source DLT Meta Framework

RiyazAliM
Honored Contributor

DLT Meta is an open-source framework developed by Databricks Labs that enables the automation of bronze and silver data pipelines through metadata configuration rather than manual code development.

At its core, the framework uses a Dataflowspec - a JSON-based specification file that contains all the metadata needed to define source connections, target schemas, data quality rules, and transformation logic.

A high level process flow is depicted below:

aayrm5_1-1753154791734.png

 

How DLT Meta Works: The framework operates through three key components:

1. Onboarding JSON (Dataflowspec):

This metadata files defines the source details, source format, bronze, silver, and gold table details along with their storage location (catalog & schema).

Example:

{
  "tables": [
    {
      "source_format": "cloudFiles",
      "source_details": {
        "source_path": "/path/to/source",
        "source_schema_path": "/path/to/schema"
      },
      "target_format": "delta",
      "target_details": {
        "database": "bronze_db",
        "table": "customer_data"
      }
    }
  ]
}

2. Data Quality Expectations:

This is a separate JSON files that defines quality rules to be applied to the bronze and bronze quarantine tables:

{
   "expect_or_drop": {
      "no_rescued_data": "_rescued_data IS NULL",
      "valid_id": "id IS NOT NULL",
      "valid_operation": "operation IN ('APPEND', 'DELETE', 'UPDATE')"
   },
   "expect_or_quarantine": {
      "quarantine_rule": "_rescued_data IS NOT NULL OR id IS NULL OR operation IS NULL"
   }
}

3. Silver Transformations

Business logic transformations defined as SQL to be applied on bronze to create the silver layer:

[
  {
    "target_table": "customers_silver",
    "select_exp": [
      "address",
      "email",
      "firstname",
      "id",
      "lastname",
      "operation_date",
      "operation",
      "_rescued_data"
    ]
  },
  {
    "target_table": "transactions_silver",
    "select_exp": [
      "id",
      "customer_id",
      "amount",
      "item_count",
      "operation_date",
      "operation",
      "_rescued_data"
    ]
  }
]

Once you have all of the JSONs created, you can deploy these json to create a Spec Table using the onboard data flow spec script in the src folder. I've created a onboarding job to pass the parameters which would be passed to the notebooks via dbutils widgets.

The notebook would look like below:

aayrm5_3-1753155396659.png

The parameters passed to the onboarding job is as follows:

aayrm5_4-1753155477157.png

Once the onboarding job runs successfully, you'd have bronze, silver, and gold spec tables that your DLT Job would take them as configurations.

The typical process would look like below:

aayrm5_2-1753155019763.png

Let's proceed to create the DLT pipeline to execute our medallian flow defined in the onboarding json and stored in the spec tables.

The JSON config to create the DLT pipeline is as follows:

{
    "pipeline_type": "WORKSPACE",
    "clusters": [
        {
            "label": "default",
            "node_type_id": "Standard_D3_v2",
            "driver_node_type_id": "Standard_D3_v2",
            "num_workers": 1
        }
    ],
    "development": true,
    "continuous": false,
    "channel": "CURRENT",
    "photon": false,
    "libraries": [
        {
            "notebook": {
                "path": "path/to/the/dlt_meta_notebook"
            }
        }
    ],
    "name": "your_dlt_pipeline_name",
    "edition": "ADVANCED",
    "catalog": "catalog_name",
    "configuration": {
        "layer": "bronze_silver_gold",
        "bronze.dataflowspecTable": "<bronze_spec_table_details>",
        "bronze.group": "<dataflow_group_defined_in_the_onboarding>",
        "silver.dataflowspecTable": "<silver_spec_table_details>",
        "silver.group": "<dataflow_group_defined_in_the_onboarding>",
        "gold.dataflowspecTable": "<gold_spec_table_details>",
        "gold.group": "<dataflow_group_defined_in_the_onboarding>",
    },
    "schema": "<schema_name>"
}

The layer - bronze_silver_gold will trigger all the tables available in the 3 layers defined in the spec tables. 

The dlt_meta_notebook defined in the source code is shown below:

aayrm5_5-1753157095037.png

When you finally start the pipeline, it will request resources from the cloud provider (or Databricks if its serverless) and initiate the DAG for your pipeline.

The DAG for my usecase looks like below, which is a combination of both streaming tables and materialized views:

aayrm5_6-1753157194894.png

If you want to check this out your-selves, take a look at the Databricks Labs GitHub link: https://github.com/databrickslabs/dlt-meta

Please let me know if you have any questions. Thank you!

 

 

Riz
4 REPLIES 4

Advika
Databricks Employee
Databricks Employee

Great breakdown of DLT Meta’s architecture and process flow. Thanks for sharing, @RiyazAliM!

RiyazAliM
Honored Contributor

Thank you @Advika 🙂

Riz

sridharplv
Valued Contributor II

Great Article Riyaz. keep Sharing more knowledge

RiyazAliM
Honored Contributor

Thank you @sridharplv 

Riz