topic Re: Expectations vs Great expectations with Databricks DLT pipelines in Data Engineering

Expectations vs Great expectations with Databricks DLT pipelines

RevathiTiger — Wed, 19 Feb 2025 05:57:08 GMT

Hi All,

We are working on creating a DQ framework on DLT pipelines in Databricks.

Databricks DLT pipelines reads incoming data from Kafka / Files sources. once data is ingested Data validation must happen on top of the ingested data. Customer is evaluating if Expectations or Great Expectations will work on streaming real time pipelines. The DQ rules will be applied on top of the incoming data as a chunk read per event and not on individual rows.

Re: Expectations vs Great expectations with Databricks DLT pipelines

dataoculus_app — Wed, 18 Jun 2025 06:43:45 GMT

all those DQ tools work on SQL architecture, so its not built for streaming, also its not built for batch dataset with efficiency and complex DQ checks in mind.
this is why we built most comprehensive data quality/monitoring platform, happy to share how you can build one to. you can DM me.

Re: Expectations vs Great expectations with Databricks DLT pipelines

chanukya-pekala — Wed, 18 Jun 2025 11:02:32 GMT

I recommend using Spark Structured Streaming or Auto Loader with micro-batch processing. This approach allows processing data in discrete chunks (e.g., every 10 seconds or using availableNow for backfill scenarios), rather than handling individual rows.

By setting an appropriate trigger interval (e.g., processingTime = '10 seconds' or using availableNow), each micro-batch can include a manageable volume of data (e.g., 100,000 records), making it feasible and efficient to apply transformations + data quality validations as null value checks, custom data profiling rules etc.,

Within each micro-batch, data validation rules can be implemented using even customer spark logic..

Re: Expectations vs Great expectations with Databricks DLT pipelines

chanukya-pekala — Wed, 18 Jun 2025 11:05:14 GMT

If you have decided to use DLT, it handles micro batching and checkpointing for you. But typically, we can take more control, if you rewrite the logic using Autoloader or Structured Streaming by custom checkpointing the file directory and maintain yourself. DLT does it everything for you behind the scene. But both approaches does the same.. except for the pricing.. Job cluster is way too cheaper than DLT.