How not to reprocess old files without delta ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-26-2021 11:42 AM
Hi guys,
Look that case: Company ACME (hypothetical company)
This company does not use delta, but uses open source Spark to process raw data for .parquet, we have a 'sales' process which consists of receiving every hour a new dataset (.csv) within the '/raw/sales' and write to '/insight/sales' (df.write.parquet ... ), after a few hours how do you go about not reprocessing the older datasets?
- Labels:
-
Delta
-
Open Source Spark
-
Pyspark
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-26-2021 11:17 PM
you can land your raw data in a folder with a date/timestamp.
f.e. /raw/sales/2021/10/27/dataset.csv
In your spark program you can program to read only from the path with a certain date (today, yesterday etc).
If you get full extracts every day, this works fine. The downside is that your raw storage has a lot of redundant data (because of the full extracts every day) but with data lakes being so cheap that is not a big issue.
The biggest plus for delta imo is the merge functionality which enables you to go for an incremental scenario.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-27-2021 01:42 AM
In such a case the best option is to use spark autoload to detect new csvs (or new records) in /raw/sales and than use append to add transformed records to insights.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-29-2021 04:49 PM
Hi @William Scardua ,
Like @Hubert Dudek mentioned, maybe the best option is to use auto loader. You can find docs and examples on how to use it here
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-01-2021 11:44 AM
Hi @Jose Gonzalez ,
I agree the best option is to use auto load, but some cases you don`t have the databricks plataform and don`t use delta, i this cases you need build a way to process the new raw files
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2021 01:35 AM
In that case I suggest my suggestion 🙂 We work like that for quite a few data streams.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-02-2021 05:01 AM
Databricks autloader works excellent also with other types of files like csv etc. If you don't want to use as stream you can trigger it once.
![](/skins/images/8C2A30E5B696B676846234E4B14F2C7B/responsive_peak/images/icon_anonymous_message.png)
![](/skins/images/8C2A30E5B696B676846234E4B14F2C7B/responsive_peak/images/icon_anonymous_message.png)