cancel
Showing results for 
Search instead for 
Did you mean: 

Data getting missed while reading from azure event hub using spark streaming

Rishi045
New Contributor III

Hi All,

I am facing an issue of data getting missed.

I am reading the data from azure event hub and after flattening the json data I am storing it in a parquet file and then using another databricks notebook to perform the merge operations on my delta table by adding some etl columns to it.

However in between somewhere the records are getting missed.

I have scheduled the job to run every hour.

Can someone please help me out with this.

11 REPLIES 11

Rishi045
New Contributor III

As of now I am not having any foreachbatch in my code. 

I am performing dedup on entire data coming from event hub

Hubert_Dudek1
Esteemed Contributor III

- In the EventHub, you can preview the event hub job using Azure Analitycs, so please first check are all records there

- Please set in Databricks that it is saved directly to the bronze delta table without performing any aggregation, just 1 to 1, and check if all records are there.

- Please consider using Delta Live Table for ingestion from Event Hub. It will make your live easier regarding monitoring stream, data quality, and performing full refresh when needed.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.