Hi All,
Can you please share us the best practices for implementing early arriving fact handling in databricks for streaming data processed in near real time using structured streaming.
There are many ways to handle this use case in batch/mini batch. Specially we are looking for best practices to handle this use case using structured streaming in near real time.
example:
Example of early arriving fact:
Please refer to the below tables explaining early arriving fact scenarios.
- One record is received (highlighted in red) in SalesDetail transaction data where corresponding customer (C4) is not loaded into DimCustomer dimension yet.
- The data for fact (FactSalesDetail) table arrived earlier than corresponding dimension (C4 in DimCustomer) data.
Regards,
Phani