Hi User Community,
Requesting some advice on the below issue please:
I have 4 Databricks notebooks, 1 That ingests data from a Kafka topic (metric data from many servers) and dumps the data in parquet format into a specified location. My 2nd data bricks notebook reads data from the location where the raw data was dumped. It performs some cleaning up, e.g., JSON explosion, etc. This data gets dumped into a location called bronze. My 3rd notebook again reads from the bronze location and does some further stuff and then dumps data into a location called silver. My Gold notebook reads data from the silver and again dumps it into a gold location. These notebooks run in a single pipeline that runs every 5 minutes.
I have another notebook that performs deletes and vacuuming procedures for retention and optimization purposes that runs every Sunday, pausing the other notebooks. The vacuum notebook does take a while to complete, and once it completes, the other four notebooks begin running; this then leads to an increased processing time (previously on average 20 minutes to now 2hr+), which I believe is caused by a backlog of data that needs processing. Can the vacuum notebook run concurrently with the other notebooks? Bear in mind the vacuum/deletes are as follows.
- Raw -> 30 Days
- Bronze -> 3 Days
- Silver & Gold -> 30 Days
I believe this would reduce the backlog, thereby maintaining the consistent 20-minute processing time. Is this possible? I previously ended up getting some concurrent errors here I was using a shorter retention period of 0.
Note:
- Vacuuming/optimization is done concurrently for the different locations (R, B, S, G) to reduce the processing time.
- Data is stored in delta format for B,S,G notebooks
- Notebooks uses structured streaming/ read/WriteStream with checkpoint defined