cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Vacuum and Streaming Issue

ndatabricksuser
New Contributor

Hi User Community,

Requesting some advice on the below issue please:

I have 4 Databricks notebooks, 1 That ingests data from a Kafka topic (metric data from many servers) and dumps the data in parquet format into a specified location. My 2nd data bricks notebook reads data from the location where the raw data was dumped. It performs some cleaning up, e.g., JSON explosion, etc. This data gets dumped into a location called bronze. My 3rd notebook again reads from the bronze location and does some further stuff and then dumps data into a location called silver. My Gold notebook reads data from the silver and again dumps it into a gold location. These notebooks run in a single pipeline that runs every 5 minutes.

 

I have another notebook that performs deletes and vacuuming procedures for retention and optimization purposes that runs every Sunday, pausing the other notebooks. The vacuum notebook does take a while to complete, and once it completes, the other four notebooks begin running; this then leads to an increased processing time (previously on average 20 minutes to now 2hr+), which I believe is caused by a backlog of data that needs processing. Can the vacuum notebook run concurrently with the other notebooks? Bear in mind the vacuum/deletes are as follows.

  • Raw -> 30 Days
  • Bronze -> 3 Days
  • Silver & Gold -> 30 Days

I believe this would reduce the backlog, thereby maintaining the consistent 20-minute processing time. Is this possible? I previously ended up getting some concurrent errors here I was using a shorter retention period of 0.

Note:

  • Vacuuming/optimization is done concurrently for the different locations (R, B, S, G) to reduce the processing time.
  • Data is stored in delta format for B,S,G notebooks
  • Notebooks uses structured streaming/ read/WriteStream with checkpoint defined

 

 

2 REPLIES 2

jose_gonzalez
Databricks Employee
Databricks Employee

Hi @ndatabricksuser ,

Just a friendly follow-up. Have you had a chance to review my colleague's response to your inquiry? Did it prove helpful, or are you still in need of assistance? Your response would be greatly appreciated.

mroy
Contributor

Vacuuming is also a lot faster with inventory tables!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group