I am designing a data ingestion strategy for ingesting 10 tables from a PostgreSQL 10 database into the Bronze layer using Databricks only (without ADF or other external tools).
- Full Load: 7 tables will be fully loaded in each run.
- Incremental Load: 3 tables have >10M rows, so I plan to perform an initial full load followed by incremental ingestion.
I am looking for best practices and efficient approaches to achieve this in Databricks, considering:
- The best way to track and implement incremental loads for the 3 high-volume tables.
- Handling schema changes, deduplication, and performance optimization.
- Ensuring transactional consistency across the ingestion process.
- Secure storage and retrieval of PostgreSQL credentials within Databricks