Databricks Community

RevathiTiger · ‎02-18-2025

Hi All,

We are working on creating a DQ framework on DLT pipelines in Databricks.

Databricks DLT pipelines reads incoming data from Kafka / Files sources. once data is ingested Data validation must happen on top of the ingested data. Customer is evaluating if Expectations or Great Expectations will work on streaming real time pipelines. The DQ rules will be applied on top of the incoming data as a chunk read per event and not on individual rows.

dataoculus_app · ‎06-17-2025

all those DQ tools work on SQL architecture, so its not built for streaming, also its not built for batch dataset with efficiency and complex DQ checks in mind.
this is why we built most comprehensive data quality/monitoring platform, happy to share how you can build one to. you can DM me.

chanukya-pekala · ‎06-18-2025

I recommend using Spark Structured Streaming or Auto Loader with micro-batch processing. This approach allows processing data in discrete chunks (e.g., every 10 seconds or using availableNow for backfill scenarios), rather than handling individual rows.

By setting an appropriate trigger interval (e.g., processingTime = '10 seconds' or using availableNow), each micro-batch can include a manageable volume of data (e.g., 100,000 records), making it feasible and efficient to apply transformations + data quality validations as null value checks, custom data profiling rules etc.,

Within each micro-batch, data validation rules can be implemented using even customer spark logic..

Chanukya

chanukya-pekala · ‎06-18-2025

If you have decided to use DLT, it handles micro batching and checkpointing for you. But typically, we can take more control, if you rewrite the logic using Autoloader or Structured Streaming by custom checkpointing the file directory and maintain yourself. DLT does it everything for you behind the scene. But both approaches does the same.. except for the pricing.. Job cluster is way too cheaper than DLT.

Chanukya

Databricks Community

Expectations vs Great expectations with Databricks DLT pipelines

Join Us as a Local Community Builder!

Free Edition Hackathon

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples

Level Up with Databricks Specialist Sessions

🌟 Community Pulse: Your Weekly Roundup! November 07 – 13, 2025

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐