topic Re: How not to reprocess old files without delta ? in Data Engineering

How not to reprocess old files without delta ?

William_Scardua — Tue, 26 Oct 2021 18:42:15 GMT

Hi guys,

Look that case: Company ACME (hypothetical company)

This company does not use delta, but uses open source Spark to process raw data for .parquet, we have a 'sales' process which consists of receiving every hour a new dataset (.csv) within the '/raw/sales' and write to '/insight/sales' (df.write.parquet ... ), after a few hours how do you go about not reprocessing the older datasets?

Re: How not to reprocess old files without delta ?

-werners- — Wed, 27 Oct 2021 06:17:54 GMT

you can land your raw data in a folder with a date/timestamp.

f.e. /raw/sales/2021/10/27/dataset.csv

In your spark program you can program to read only from the path with a certain date (today, yesterday etc).

If you get full extracts every day, this works fine. The downside is that your raw storage has a lot of redundant data (because of the full extracts every day) but with data lakes being so cheap that is not a big issue.

The biggest plus for delta imo is the merge functionality which enables you to go for an incremental scenario.

Re: How not to reprocess old files without delta ?

Hubert-Dudek — Wed, 27 Oct 2021 08:42:39 GMT

In such a case the best option is to use spark autoload to detect new csvs (or new records) in /raw/sales and than use append to add transformed records to insights.

Re: How not to reprocess old files without delta ?

jose_gonzalez — Fri, 29 Oct 2021 23:49:04 GMT

Hi @William Scardua ,

Like @Hubert Dudek mentioned, maybe the best option is to use auto loader. You can find docs and examples on how to use it here

Re: How not to reprocess old files without delta ?

William_Scardua — Mon, 01 Nov 2021 18:44:24 GMT

Hi @Jose Gonzalez ,

I agree the best option is to use auto load, but some cases you don`t have the databricks plataform and don`t use delta, i this cases you need build a way to process the new raw files

Re: How not to reprocess old files without delta ?

-werners- — Tue, 02 Nov 2021 08:35:23 GMT

In that case I suggest my suggestion 🙂 We work like that for quite a few data streams.

Re: How not to reprocess old files without delta ?

Hubert-Dudek — Tue, 02 Nov 2021 12:01:37 GMT

Databricks autloader works excellent also with other types of files like csv etc. If you don't want to use as stream you can trigger it once.