Re: How to create a DLT pipeline with SQL statemen...

guangyi · ‎09-06-2024

I need a DLT pipeline to create a materialized view for fetching event logs. All the ways below I tried are failed:

Attach a notebook with pure SQL inside: No magic cell like `%sql` are failed
Attach a notebook with `spark.sql` python code: Failed because I cannot use the `CREATE VIEW` statement.
Attach a file with `.sql` format: Failed because sql format is not supported yet.

I follow the instructions here to query event log: https://docs.databricks.com/en/delta-live-tables/observability.html#monitor-delta-live-tables-pipeli...

How can I achieve this with the DLT pipeline?

szymon_dybczak · ‎09-06-2024

Hi @guangyi ,

It should be sufficient to attach notebook with sql language set as a default. What error did you get?

filipniziol · ‎09-06-2024

Hi @guangyi ,
As @szymon_dybczak informed, .py and .sql formats are supported. The 2 first options you mentioned are not supported: you cannot use SQL in .py files, but SQL commands supported when .sql file is used.

Check the docs:

https://learn.microsoft.com/en-us/azure/databricks/delta-live-tables/tutorial-pipelines#sql-example

View solution in original post

guangyi · ‎09-08-2024

Hi @szymon_dybczak @filipniziol

Here is how I implement the 3rd option and how it failed:

I create a JSON file with the DLT pipeline definition inside

{
    "name": "query_data_quality_event_log_pipeline",
    "clusters": [
        {
            "label": "default",
            "spark_conf": {
                "spark.databricks.acl.needAdminPermissionToViewLogs": "false"
            },
            "policy_id": "xxxxxx",
            "autoscale": {
                "min_workers": 1,
                "max_workers": 2,
                "mode": "ENHANCED"
            }
        },
        {
            "label": "maintenance",
            "policy_id": "xxxxxx"
        }
    ],
    "development": true,
    "continuous": false,
    "channel": "PREVIEW",
    "edition": "CORE",
    "catalog": "xxxxxx",
    "target": "xxxxxx",
    "libraries": [
        {
            "notebook": {
                "path": "/Workspace/Users/xxx@xxx/query_data_quality_event_log.sql"
            }
        }
    ]
}

Then create the pipeline via Databricks CLI

databricks pipelines create --json "$(cat single-dlt.json)" -p PID

The pipeline can be created successfully. However, when I clicked the running button, It show me this result:

BAD_REQUEST: Failed to load notebook '/Workspace/Users/xxx@xxx/query_data_quality_event_log.sql'. Only SQL and Python notebooks are supported currently.

guangyi · ‎09-08-2024

After just finishing my last reply, I realized what’s wrong with my code: I should use “file” property instead of “notebook” in the libraries section.

It works now. Thank you guys, you are my rubber duck!

How to create a DLT pipeline with SQL statement