Databricks Community

hkmodi · ‎11-07-2024

I have created an autoloader job that reads data from S3 (files with no extension) having json using (cloudFiles.format, text). Now this job is suppose to run every 4 hours and read all the new data that arrived. But before writing into a delta table, I want to make sure I have just one occurrence of the id with the max timestamp being written into the delta table.

I am using micro batch for doing this using row_number() == 1. But is gives me max timestamp occurrence of the micro batch and not the entire data read when script is run.

How do I tackle this. My dataset is huge so can't use maxFiles or maxBytes. Also I need to append not merge into the final delta table

daniel_sahal · ‎11-07-2024

@hkmodi
That's what Medallion Architecture was designed for.
Basically, you load everything you get into Bronze layer (meaning that you'll have duplicates in there), then do the deduplication when loading the data into the Silver layer.

hkmodi · ‎11-08-2024

I did have this approach in mind. But every 4 hours of data that I have is equivalent to 1.2 TB
I want to avoid writing all that into a table and then performing dedeuplication.

Is there a way I can create a temp view and do deduplication before writing that into a table?

szymon_dybczak · ‎11-08-2024

HI @hkmodi ,

Basically, as @daniel_sahal said, bronze layer should reflect the source system. The silver layer is dedicated for deduplication/cleaning/enrichment of dataset. If you still need to deduplicate at bronze layer you have 2 options:
- use merge statement (but you said that you cannot use merge)

- after bronze table is loaded, run process that will deduplicate table (I don't like this approach ;))

Databricks Community

Perform row_number() filter in autoloader

Connect with Databricks Users in Your Area

Securely share data, analytics and AI

Data Intelligence for Data Engineers

Databricks Learning Festival (Virtual): 15 January - 31 January 2025