cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Expectations vs Great expectations with Databricks DLT pipelines

RevathiTiger
New Contributor II

Hi All,

We are working on creating a DQ framework on DLT pipelines in Databricks. 

Databricks DLT pipelines reads incoming data from Kafka / Files sources. once data is ingested Data validation must happen on top of the ingested data. Customer is evaluating if Expectations or Great Expectations will work on streaming real time pipelines. The DQ rules will be applied on top of the incoming data as a chunk read per event and not on individual rows. 

3 REPLIES 3

dataoculus_app
New Contributor III

all those DQ tools work on SQL architecture, so its not built for streaming, also its not built for batch dataset with efficiency and complex DQ checks in mind. 
this is why we built most comprehensive data quality/monitoring platform, happy to share how you can build one to. you can DM me.

chanukya-pekala
Contributor II

I recommend using Spark Structured Streaming or Auto Loader with micro-batch processing. This approach allows processing data in discrete chunks (e.g., every 10 seconds or using availableNow for backfill scenarios), rather than handling individual rows.

By setting an appropriate trigger interval (e.g., processingTime = '10 seconds' or using availableNow), each micro-batch can include a manageable volume of data (e.g., 100,000 records), making it feasible and efficient to apply transformations + data quality validations as null value checks, custom data profiling rules etc.,

Within each micro-batch, data validation rules can be implemented using even customer spark logic..

Chanukya

chanukya-pekala
Contributor II

If you have decided to use DLT, it handles micro batching and checkpointing for you. But typically, we can take more control, if you rewrite the logic using Autoloader or Structured Streaming by custom checkpointing the file directory and maintain yourself. DLT does it everything for you behind the scene. But both approaches does the same.. except for the pricing.. Job cluster is way too cheaper than DLT. 

Chanukya

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now