cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Help design my streaming pipeline

rt-slowth
Contributor

###Data Source
- AWS RDS
- Database migration tasks have been created using AWS DMS
- Relevant cdc information is being stored in a specific bucket in S3

### Data frequency
- Once a day (but not sure when, sometime after 6pm)

### Development environment
- databricks
- Delat Live Table from databricks

### Data Status
- CLOSE_DT, CURR_F_CD, CURR_T_CD are PK, JOIN conditions
- CLOSE_DT is DATE type
- Data comes in from source(=RDS) once a day on weekdays.
- This data is written as a cdc to S3 via AWS DMS

### Processing requirements
- No data comes into source on non-weekday holidays, but must be matched to the most recent data.
- Data comes in once a day on weekdays, and the presence or absence of a specific CLOSE_DT can be used to determine if data came in today or not.
- For example, let's say today is 2023-12-28.
- You don't know when data with a CLOSE_DT of 2023-12-28 will come in today.
- So until the data comes in, you create the 2023-12-28 data from the most recent 2023-12-27 data.
- When the 2023-12-28 data comes in, the data is swapped.
- No data comes in at all on holidays, so data must be generated with the most recent data each day

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group