Pipelines using dlt modules from the Unity Catalog

rt-slowth — Wed, 03 Jan 2024 00:01:11 GMT

[Situation]
I am using AWS DMS to store mysql cdc in S3 as a parquet file.
I have implemented a streaming pipeline using the DLT module.
The target destination is Unity Catalog.

[Questions and issues].
- Where are the tables and materialized views specified in Unity Catalog stored, in DBFS or metastore?
- Can I delete a parquet that has been readStreamed even once in S3?
- Is there a way to save a DataFrame with Join and Window operations on a table read with dlt.read from a streaming Delta Live Table as a Table instead of a Materialized View?
- The output of the @Dlt.table decorator seems to be created as a Matarialized View, but is it possible to change it to a Table?

You can answer them one by one.

Re: Pipelines using dlt modules from the Unity Catalog

rt-slowth — Thu, 04 Jan 2024 02:39:48 GMT

Here's the pipeline for the delta live table I'm creating.

1. import CDC and source using AWS DMS
2. after import dlt, create streaming table with dlt.create_streaming_table (set scd = 1)
3. read the streaming table with dlt.read and perform operations such as join
4. save the result of step 3 as dlt.table

At this time, even if you specify a dlt.table decorator in step 4, it is being saved as a materialized view.
Currently, I am using Unity Catalog.

topic Re: Pipelines using dlt modules from the Unity Catalog in Data Engineering

Pipelines using dlt modules from the Unity Catalog

Re: Pipelines using dlt modules from the Unity Catalog