Question about stateful processing
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-21-2024 01:17 PM
I'm experiencing an issue that I don't understand. I am using Python's arbitrary stateful processing with structured streaming to calculate metrics for each item/ID. A timeout is set, after which I clear the state for that item/ID and display each ID with its statistics in a resulting data frame. I also save this data frame to a file (delta). However, when I load and display the data frame from the delta file and compare it to the original computed data frame, that is displayed in the notebook, the metrics differ. Why could this be happening? And which data frame is the correct one? Basically, no new data should be considered for the resulting data frame, since the item/ID is only yielded and shown in the resulting data frame when no new data has arrived for the defined timeout. And the data frame that is written to the file is this resulting data frame...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-22-2024 12:48 PM
Hi @fperry,
How are you doing today?
As per my understanding, Consider checking for any differences in how the stateful streaming function is writing and persisting data. It's possible that while the state is cleared after the timeout, some state might persist or be recalculated when saving to Delta. You might want to compare the data frames immediately before and after writing to Delta to ensure the output is consistent. Also, ensure no new data is being ingested after the timeout and that your job is not picking up additional data. If possible, validate your streaming logic to confirm that the data being saved to Delta is exactly what's being displayed in the notebook.
Please give a try and let me know if it works.
Regards,
Brahma

