Help design my streaming pipeline

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

###Data Source
- AWS RDS
- Database migration tasks have been created using AWS DMS
- Relevant cdc information is being stored in a specific bucket in S3

### Data frequency
- Once a day (but not sure when, sometime after 6pm)

### Development environment
- databricks
- Delat Live Table from databricks

### Data Status
- CLOSE_DT, CURR_F_CD, CURR_T_CD are PK, JOIN conditions
- CLOSE_DT is DATE type
- Data comes in from source(=RDS) once a day on weekdays.
- This data is written as a cdc to S3 via AWS DMS

### Processing requirements
- No data comes into source on non-weekday holidays, but must be matched to the most recent data.
- Data comes in once a day on weekdays, and the presence or absence of a specific CLOSE_DT can be used to determine if data came in today or not.
- For example, let's say today is 2023-12-28.
- You don't know when data with a CLOSE_DT of 2023-12-28 will come in today.
- So until the data comes in, you create the 2023-12-28 data from the most recent 2023-12-27 data.
- When the 2023-12-28 data comes in, the data is swapped.
- No data comes in at all on holidays, so data must be generated with the most recent data each day