cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Vacuum and Streaming Issue

ndatabricksuser
New Contributor

Hi User Community,

Requesting some advice on the below issue please:

I have 4 Databricks notebooks, 1 That ingests data from a Kafka topic (metric data from many servers) and dumps the data in parquet format into a specified location. My 2nd data bricks notebook reads data from the location where the raw data was dumped. It performs some cleaning up, e.g., JSON explosion, etc. This data gets dumped into a location called bronze. My 3rd notebook again reads from the bronze location and does some further stuff and then dumps data into a location called silver. My Gold notebook reads data from the silver and again dumps it into a gold location. These notebooks run in a single pipeline that runs every 5 minutes.

 

I have another notebook that performs deletes and vacuuming procedures for retention and optimization purposes that runs every Sunday, pausing the other notebooks. The vacuum notebook does take a while to complete, and once it completes, the other four notebooks begin running; this then leads to an increased processing time (previously on average 20 minutes to now 2hr+), which I believe is caused by a backlog of data that needs processing. Can the vacuum notebook run concurrently with the other notebooks? Bear in mind the vacuum/deletes are as follows.

  • Raw -> 30 Days
  • Bronze -> 3 Days
  • Silver & Gold -> 30 Days

I believe this would reduce the backlog, thereby maintaining the consistent 20-minute processing time. Is this possible? I previously ended up getting some concurrent errors here I was using a shorter retention period of 0.

Note:

  • Vacuuming/optimization is done concurrently for the different locations (R, B, S, G) to reduce the processing time.
  • Data is stored in delta format for B,S,G notebooks
  • Notebooks uses structured streaming/ read/WriteStream with checkpoint defined

 

 

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @ndatabricksuser , 

Running the vacuuming and optimization process alongside other notebooks in Databricks is possible and can save time. To ensure smooth concurrent processing, it's best to use separate Databricks clusters for different pipeline steps, optimize cluster resources, and fine-tune your processing code for performance. Keep an eye on resource usage, prioritize jobs as needed, and have robust error handling in place. This careful management will help your pipeline work efficiently.

jose_gonzalez
Moderator
Moderator

Hi @ndatabricksuser ,

Just a friendly follow-up. Have you had a chance to review my colleague's response to your inquiry? Did it prove helpful, or are you still in need of assistance? Your response would be greatly appreciated.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.