Databricks Community

ijaza0489 · ‎02-01-2025

I am designing a data ingestion strategy for ingesting 10 tables from a PostgreSQL 10 database into the Bronze layer using Databricks only (without ADF or other external tools).

Full Load: 7 tables will be fully loaded in each run.
Incremental Load: 3 tables have >10M rows, so I plan to perform an initial full load followed by incremental ingestion.

I am looking for best practices and efficient approaches to achieve this in Databricks, considering:

The best way to track and implement incremental loads for the 3 high-volume tables.
Handling schema changes, deduplication, and performance optimization.
Ensuring transactional consistency across the ingestion process.
Secure storage and retrieval of PostgreSQL credentials within Databricks

Alberto_Umana · ‎02-01-2025

Hello @ijaza0489,

Here are key points to keep in mind:

Tracking and Implementing Incremental Loads:

Delta Lake: Utilize Delta Lake for managing incremental loads. Delta Lake supports ACID transactions and allows you to perform upserts and merges efficiently. For the three high-volume tables, you can use the MERGE operation to handle incremental updates.
Change Data Capture (CDC): Implement CDC to capture changes in the source tables. You can use the cdc function in Delta Lake to track changes and apply them incrementally.

Handling Schema Changes:

Schema Evolution: Delta Lake supports schema evolution, which allows you to handle changes in the schema of your source tables. You can enable schema evolution by setting the mergeSchema option to true when writing data to Delta tables.
Schema Inference: Use Databricks Auto Loader with the cloud_files function to automatically infer and evolve the schema as new data arrives.

Deduplication:

Primary Key Constraints: Ensure that your Delta tables have primary key constraints to avoid duplicate records. You can use the CONSTRAINT clause in Delta Lake to enforce uniqueness.
Deduplication Logic: Implement deduplication logic in your ingestion pipeline to remove duplicate records based on a unique identifier or a combination of columns.

Performance Optimization:

Optimize and Z-Order: Use the OPTIMIZE command to compact small files and improve query performance. Combine it with ZORDER to physically sort data by frequently queried columns.
Partitioning: Partition your Delta tables based on columns that are frequently used in filter conditions to improve query performance. Avoid over-partitioning, especially for tables under 1TB in size.
Auto Optimize: Enable autoOptimize and autoCompact to automatically optimize file sizes during writes.

Transactional Consistency:

ACID Transactions: Delta Lake ensures transactional consistency with ACID properties. Use Delta Lake's transaction log to maintain consistency across your ingestion process.
Checkpointing: Implement checkpointing in your streaming pipelines to ensure that data is processed exactly once and to recover from failures.

Secure Storage and Retrieval of PostgreSQL Credentials:

Databricks Secrets: Store PostgreSQL credentials securely using Databricks Secrets. Create a secret scope and store your credentials as secrets. Access these secrets in your notebooks or jobs using the dbutils.secrets.get function.
Environment Variables: Avoid hardcoding credentials in your code. Use environment variables or secret management tools to securely manage and retrieve credentials.

Databricks Community

Best Strategy for Ingesting PostgreSQL Data into Bronze Layer in Databricks

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!