Databricks Community

William_Scardua · ‎10-26-2021

Hi guys,

Look that case: Company ACME (hypothetical company)

This company does not use delta, but uses open source Spark to process raw data for .parquet, we have a 'sales' process which consists of receiving every hour a new dataset (.csv) within the '/raw/sales' and write to '/insight/sales' (df.write.parquet ... ), after a few hours how do you go about not reprocessing the older datasets?

-werners- · ‎10-26-2021

you can land your raw data in a folder with a date/timestamp.

f.e. /raw/sales/2021/10/27/dataset.csv

In your spark program you can program to read only from the path with a certain date (today, yesterday etc).

If you get full extracts every day, this works fine. The downside is that your raw storage has a lot of redundant data (because of the full extracts every day) but with data lakes being so cheap that is not a big issue.

The biggest plus for delta imo is the merge functionality which enables you to go for an incremental scenario.

Hubert-Dudek · ‎10-27-2021

In such a case the best option is to use spark autoload to detect new csvs (or new records) in /raw/sales and than use append to add transformed records to insights.

jose_gonzalez · ‎10-29-2021

Hi @William Scardua ,

Like @Hubert Dudek mentioned, maybe the best option is to use auto loader. You can find docs and examples on how to use it here

William_Scardua · ‎11-01-2021

Hi @Jose Gonzalez ,

I agree the best option is to use auto load, but some cases you don`t have the databricks plataform and don`t use delta, i this cases you need build a way to process the new raw files