cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How not to reprocess old files without delta ?

William_Scardua
Valued Contributor

Hi guys,

Look that case: Company ACME (hypothetical company)

This company does not use delta, but uses open source Spark to process raw data for .parquet, we have a 'sales' process which consists of receiving every hour a new dataset (.csv) within the '/raw/sales' and write to '/insight/sales' (df.write.parquet ... ), after a few hours how do you go about not reprocessing the older datasets?

7 REPLIES 7

Kaniz_Fatma
Community Manager
Community Manager

Hi William Scardua! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

-werners-
Esteemed Contributor III

you can land your raw data in a folder with a date/timestamp.

f.e. /raw/sales/2021/10/27/dataset.csv

In your spark program you can program to read only from the path with a certain date (today, yesterday etc).

If you get full extracts every day, this works fine. The downside is that your raw storage has a lot of redundant data (because of the full extracts every day) but with data lakes being so cheap that is not a big issue.

The biggest plus for delta imo is the merge functionality which enables you to go for an incremental scenario.

Hubert-Dudek
Esteemed Contributor III

In such a case the best option is to use spark autoload to detect new csvs (or new records) in /raw/sales and than use append to add transformed records to insights.

jose_gonzalez
Moderator
Moderator

Hi @William Scardua​ ,

Like @Hubert Dudek​ mentioned, maybe the best option is to use auto loader. You can find docs and examples on how to use it here

William_Scardua
Valued Contributor

Hi @Jose Gonzalez​ , ​

I agree the best option is to use auto load, but some cases you don`t have the databricks plataform and don`t use delta, i this cases you need build a way to process the new raw files

-werners-
Esteemed Contributor III

In that case I suggest my suggestion 🙂 We work like that for quite a few data streams.

Hubert-Dudek
Esteemed Contributor III

Databricks autloader works excellent also with other types of files like csv etc. If you don't want to use as stream you can trigger it once.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!