Perform row_number() filter in autoloader
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-07-2024 11:28 AM
I have created an autoloader job that reads data from S3 (files with no extension) having json using (cloudFiles.format, text). Now this job is suppose to run every 4 hours and read all the new data that arrived. But before writing into a delta table, I want to make sure I have just one occurrence of the id with the max timestamp being written into the delta table.
I am using micro batch for doing this using row_number() == 1. But is gives me max timestamp occurrence of the micro batch and not the entire data read when script is run.
How do I tackle this. My dataset is huge so can't use maxFiles or maxBytes. Also I need to append not merge into the final delta table
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-07-2024 11:02 PM
@hkmodi
That's what Medallion Architecture was designed for.
Basically, you load everything you get into Bronze layer (meaning that you'll have duplicates in there), then do the deduplication when loading the data into the Silver layer.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-08-2024 08:30 AM
I did have this approach in mind. But every 4 hours of data that I have is equivalent to 1.2 TB
I want to avoid writing all that into a table and then performing dedeuplication.
Is there a way I can create a temp view and do deduplication before writing that into a table?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-08-2024 04:17 AM - edited 11-08-2024 04:18 AM
HI @hkmodi ,
Basically, as @daniel_sahal said, bronze layer should reflect the source system. The silver layer is dedicated for deduplication/cleaning/enrichment of dataset. If you still need to deduplicate at bronze layer you have 2 options:
- use merge statement (but you said that you cannot use merge)
- after bronze table is loaded, run process that will deduplicate table (I don't like this approach ;))

