Databricks Community

RiyazAliM · ‎07-21-2025

DLT Meta is an open-source framework developed by Databricks Labs that enables the automation of bronze and silver data pipelines through metadata configuration rather than manual code development.

At its core, the framework uses a Dataflowspec - a JSON-based specification file that contains all the metadata needed to define source connections, target schemas, data quality rules, and transformation logic.

A high level process flow is depicted below:

How DLT Meta Works: The framework operates through three key components:

1. Onboarding JSON (Dataflowspec):

This metadata files defines the source details, source format, bronze, silver, and gold table details along with their storage location (catalog & schema).

Example:

{
  "tables": [
    {
      "source_format": "cloudFiles",
      "source_details": {
        "source_path": "/path/to/source",
        "source_schema_path": "/path/to/schema"
      },
      "target_format": "delta",
      "target_details": {
        "database": "bronze_db",
        "table": "customer_data"
      }
    }
  ]
}

2. Data Quality Expectations:

This is a separate JSON files that defines quality rules to be applied to the bronze and bronze quarantine tables:

{
   "expect_or_drop": {
      "no_rescued_data": "_rescued_data IS NULL",
      "valid_id": "id IS NOT NULL",
      "valid_operation": "operation IN ('APPEND', 'DELETE', 'UPDATE')"
   },
   "expect_or_quarantine": {
      "quarantine_rule": "_rescued_data IS NOT NULL OR id IS NULL OR operation IS NULL"
   }
}

3. Silver Transformations

Business logic transformations defined as SQL to be applied on bronze to create the silver layer:

[
  {
    "target_table": "customers_silver",
    "select_exp": [
      "address",
      "email",
      "firstname",
      "id",
      "lastname",
      "operation_date",
      "operation",
      "_rescued_data"
    ]
  },
  {
    "target_table": "transactions_silver",
    "select_exp": [
      "id",
      "customer_id",
      "amount",
      "item_count",
      "operation_date",
      "operation",
      "_rescued_data"
    ]
  }
]

Once you have all of the JSONs created, you can deploy these json to create a Spec Table using the onboard data flow spec script in the src folder. I've created a onboarding job to pass the parameters which would be passed to the notebooks via dbutils widgets.

The notebook would look like below:

The parameters passed to the onboarding job is as follows:

Once the onboarding job runs successfully, you'd have bronze, silver, and gold spec tables that your DLT Job would take them as configurations.

The typical process would look like below:

Let's proceed to create the DLT pipeline to execute our medallian flow defined in the onboarding json and stored in the spec tables.

The JSON config to create the DLT pipeline is as follows:

{
    "pipeline_type": "WORKSPACE",
    "clusters": [
        {
            "label": "default",
            "node_type_id": "Standard_D3_v2",
            "driver_node_type_id": "Standard_D3_v2",
            "num_workers": 1
        }
    ],
    "development": true,
    "continuous": false,
    "channel": "CURRENT",
    "photon": false,
    "libraries": [
        {
            "notebook": {
                "path": "path/to/the/dlt_meta_notebook"
            }
        }
    ],
    "name": "your_dlt_pipeline_name",
    "edition": "ADVANCED",
    "catalog": "catalog_name",
    "configuration": {
        "layer": "bronze_silver_gold",
        "bronze.dataflowspecTable": "<bronze_spec_table_details>",
        "bronze.group": "<dataflow_group_defined_in_the_onboarding>",
        "silver.dataflowspecTable": "<silver_spec_table_details>",
        "silver.group": "<dataflow_group_defined_in_the_onboarding>",
        "gold.dataflowspecTable": "<gold_spec_table_details>",
        "gold.group": "<dataflow_group_defined_in_the_onboarding>",
    },
    "schema": "<schema_name>"
}

The layer - bronze_silver_gold will trigger all the tables available in the 3 layers defined in the spec tables.

The dlt_meta_notebook defined in the source code is shown below:

When you finally start the pipeline, it will request resources from the cloud provider (or Databricks if its serverless) and initiate the DAG for your pipeline.

The DAG for my usecase looks like below, which is a combination of both streaming tables and materialized views:

If you want to check this out your-selves, take a look at the Databricks Labs GitHub link: https://github.com/databrickslabs/dlt-meta

Please let me know if you have any questions. Thank you!

Riz

Advika · ‎07-22-2025

Great breakdown of DLT Meta’s architecture and process flow. Thanks for sharing, @RiyazAliM!

RiyazAliM · ‎07-22-2025

Thank you @Advika 🙂

Riz

sridharplv · ‎07-23-2025

Great Article Riyaz. keep Sharing more knowledge

RiyazAliM · ‎07-29-2025

Thank you @sridharplv

Riz

Databricks Community

The Open Source DLT Meta Framework

1. Onboarding JSON (Dataflowspec):

2. Data Quality Expectations:

3. Silver Transformations

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples