10-26-2021 11:42 AM
Hi guys,
Look that case: Company ACME (hypothetical company)
This company does not use delta, but uses open source Spark to process raw data for .parquet, we have a 'sales' process which consists of receiving every hour a new dataset (.csv) within the '/raw/sales' and write to '/insight/sales' (df.write.parquet ... ), after a few hours how do you go about not reprocessing the older datasets?
10-26-2021 11:17 PM
you can land your raw data in a folder with a date/timestamp.
f.e. /raw/sales/2021/10/27/dataset.csv
In your spark program you can program to read only from the path with a certain date (today, yesterday etc).
If you get full extracts every day, this works fine. The downside is that your raw storage has a lot of redundant data (because of the full extracts every day) but with data lakes being so cheap that is not a big issue.
The biggest plus for delta imo is the merge functionality which enables you to go for an incremental scenario.
10-27-2021 01:42 AM
In such a case the best option is to use spark autoload to detect new csvs (or new records) in /raw/sales and than use append to add transformed records to insights.
10-29-2021 04:49 PM
Hi @William Scardua ,
Like @Hubert Dudek mentioned, maybe the best option is to use auto loader. You can find docs and examples on how to use it here
11-01-2021 11:44 AM
Hi @Jose Gonzalez ,
I agree the best option is to use auto load, but some cases you don`t have the databricks plataform and don`t use delta, i this cases you need build a way to process the new raw files
11-02-2021 01:35 AM
In that case I suggest my suggestion 🙂 We work like that for quite a few data streams.
11-02-2021 05:01 AM
Databricks autloader works excellent also with other types of files like csv etc. If you don't want to use as stream you can trigger it once.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group