โ11-07-2024 11:28 AM
I have created an autoloader job that reads data from S3 (files with no extension) having json using (cloudFiles.format, text). Now this job is suppose to run every 4 hours and read all the new data that arrived. But before writing into a delta table, I want to make sure I have just one occurrence of the id with the max timestamp being written into the delta table.
I am using micro batch for doing this using row_number() == 1. But is gives me max timestamp occurrence of the micro batch and not the entire data read when script is run.
How do I tackle this. My dataset is huge so can't use maxFiles or maxBytes. Also I need to append not merge into the final delta table
โ11-07-2024 11:02 PM
@hkmodi
That's what Medallion Architecture was designed for.
Basically, you load everything you get into Bronze layer (meaning that you'll have duplicates in there), then do the deduplication when loading the data into the Silver layer.
โ11-08-2024 08:30 AM
I did have this approach in mind. But every 4 hours of data that I have is equivalent to 1.2 TB
I want to avoid writing all that into a table and then performing dedeuplication.
Is there a way I can create a temp view and do deduplication before writing that into a table?
โ11-08-2024 04:17 AM - edited โ11-08-2024 04:18 AM
HI @hkmodi ,
Basically, as @daniel_sahal said, bronze layer should reflect the source system. The silver layer is dedicated for deduplication/cleaning/enrichment of dataset. If you still need to deduplicate at bronze layer you have 2 options:
- use merge statement (but you said that you cannot use merge)
- after bronze table is loaded, run process that will deduplicate table (I don't like this approach ;))
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group