Hello, Databricks Community!
I'd like to ask for advice on how you approach in reading and maintaining bucket files in Delta file format created through Fivetran Managed Data Lake (MDLS) destination?
I've set-up a Fivetran Destination to a GCP Bucket where it stores the data this way. The tables only contain less than 50k rows.
example_bucket/landing/
+-- connector_1/
| +-- table_1/
| | +-- _delta_log/
| | +-- data/
| | +-- metadata/
| +-- table_2/
| | +-- _delta_log/
| | +-- data/
| | +-- metadata/
+-- connector_2/
| +-- ...
Now, for me to read and add them to the pipeline that refreshes daily, I use this SQL syntax.
USE CATALOG example_catalog;
USE CATALOG example_schema_1;
CREATE OR REFRESH MATERIALIZED VIEW table_1 AS
SELECT * FROM `delta`.`gs://example_bucket/landing/connector_1/table_1`;
I'm also aware that Fivetran has an option in their setup to update the Unity Catalog for a Databricks workspace upon syncing to the bucket. However, I think this wouldn't be applicable to me because I wanted to replicate this to 3 DBx workspaces (dev, staging, prod) via DAB and it wouldn't be cost efficient to create multiple Fivetran connectors to different Databricks workspaces. Therefore, I'm inclined into storing them in the bucket first and all 3 workspaces will reference the same bucket to read the data.
With all that said, I'd like to ask:
- Is this a good practice? What improvements from this setup that can be done?
- Is it possible for Databricks to create an external catalog that simply references the whole bucket? Right now, I'm creating multiple materialized views and schemas but it'll be time-consuming to set them up individually - What if I have a connector that has 50+ tables?
- Would Auto Loader be better in this scenario?
Thanks in advance!