cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Perform row_number() filter in autoloader

hkmodi
New Contributor II

I have created an autoloader job that reads data from S3 (files with no extension) having json using (cloudFiles.format, text). Now this job is suppose to run every 4 hours and read all the new data that arrived. But before writing into a delta table, I want to make sure I have just one occurrence of the id with the max timestamp being written into the delta table. 

I am using micro batch for doing this using row_number() == 1. But is gives me max timestamp occurrence of the micro batch and not the entire data read when script is run.

How do I tackle this. My dataset is huge so can't use maxFiles or maxBytes. Also I need to append not merge into the final delta table

 

3 REPLIES 3

daniel_sahal
Esteemed Contributor

@hkmodi 
That's what Medallion Architecture was designed for.
Basically, you load everything you get into Bronze layer (meaning that you'll have duplicates in there), then do the deduplication when loading the data into the Silver layer.

hkmodi
New Contributor II

I did have this approach in mind. But every 4 hours of data that I have is equivalent to 1.2 TB
I want to avoid writing all that into a table and then performing dedeuplication.

Is there a way I can create a temp view and do deduplication before writing that into a table?

szymon_dybczak
Contributor III

HI @hkmodi ,

Basically, as @daniel_sahal  said, bronze layer should reflect the source system. The silver layer is dedicated for deduplication/cleaning/enrichment of dataset. If you still need to deduplicate at bronze layer you have 2 options:
- use merge statement (but you said that you cannot use merge)

- after bronze table is loaded, run process that will deduplicate table (I don't like this approach ;))

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group