Friday
AWS databricks
I want to create data quality monitoring and event-driven architecture without trigger on file arrival but once at deploy.
I plan to create a job which trigger once at deploy.
The job run this tasks sequentially.
1. run script to create external table if not exist to load data in delta format from S3 as tables in landing schema. configure with properties such as
enableChangeDataFeed= true
delta.enableRowTracking = true
delta.enableDeletionVectors = true to enable incremental update in downstream materialized view
2. dlt task
- create materialized view tables as bronze schema with expectation (warning) trigger on update
- create materialized view tables as silver schema with expectation (drop) trigger on update
- create materialzied view for data profile based on DAMA framework trigger on schedule. it pulls data quality enabled by lake monitoring feature.
Does this make sense and realistic?
15 hours ago
i found out that materializeed view can't incremental update when it references from external location.
this architecture doesn't work
15 hours ago
i found out that materializeed view can't incremental update when it references from external location.
this architecture doesn't work
15 hours ago
@tana_sakakimiya just out of curiosity, where did you find this out?
I'm looking at the docs right now for incremental refreshes for materialized views: https://docs.databricks.com/aws/en/optimizations/incremental-refresh
This section seems to say external is supported?
Could you point to where it says otherwise? I appreciate I might be chucking a red herring out there 🙂.
All the best,
BS
14 hours ago
I found from Azure documentation
Incremental refresh for materialized views - Azure Databricks | Microsoft Learn
I'm not sure i misunderstand or not.
It says
"Sources such as volumes, external locations, and foreign catalogs are not supported."
so i think external table is not supported. how do you think?
Thank you.
14 hours ago
maybe it works only when data stored in S3 is in delta format
12 hours ago - last edited 12 hours ago
@tana_sakakimiya I do understand the confusion. In the screenshot I sent you, it looks like it should work. In the screenshot you sent me, it looks like it shouldn't work. I guess our best hope is Delta format if using an External Location
Could you give that a try and see if it gives some success 🙏. Fingers crossed.
All the best,
BS
12 hours ago
@tana_sakakimiya ah, I think I see the difference.
My screenshot says that "external tables" backed by delta lake will work. This means, you'll need to have the table already created in databricks, from your external location i.e. make an external table.
Perhaps you could include that as part of your pipeline? External Location -> External Table -> Execute Rest of Pipeline 🤔.
All the best,
BS
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now