- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-06-2022 03:47 PM
I am building an ETL pipeline which reads data from a Kafka topic ( data is serialized in Thrift format) and writes it to Delta Table in databricks. I want to have two layers
Bronze Layer -> which has raw Kafka data
Silver Layer -> which has deserialized data
I can think of two ways to do it
First way is to read data from Kafka, write the raw data to bronze then read data from bronze and decode it and write it to silver
Second way is to read data from Kafka, write data to bronze and simultaneously decode the data and write it to silver.
I am trying to understand the advantages & disadvantages of each solution. Solution two is much easier to implement but feels like solution one is more fault tolerant
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-07-2022 03:15 AM
@John Constantine , "Bronze Layer -> which has raw Kafka data"
If you use confluent.io, you can also utilize a direct sink to DataLake Storage - bronze layer.
"Silver Layer -> which has deserialized data"
Then use Delta Live Tables to process it to delta silver. (file notification mode recommended)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-07-2022 03:15 AM
@John Constantine , "Bronze Layer -> which has raw Kafka data"
If you use confluent.io, you can also utilize a direct sink to DataLake Storage - bronze layer.
"Silver Layer -> which has deserialized data"
Then use Delta Live Tables to process it to delta silver. (file notification mode recommended)