Databricks Community

peritus · ‎12-25-2024

I'm new to Databricks and, I'm looking to get data from an external database into Databricks and keep it synchronized when changes occur in the source tables. It seems like I may be able to some form of change data capture and the delta live tables. I've looked at some documentation and some online tutorials related to CDC and Databricks but haven't found anything that really shows an example of setting it all up, so I guess I'm looking for someone to point me in the right direction here. My question is basically how can I configured an external database for synchronizing and maintaining a replicate of that data within Databricks.

Walter_C · ‎12-25-2024

You can refer to https://docs.databricks.com/en/connect/index.html and specifically to a connection by using JDBC: https://docs.databricks.com/en/connect/external-systems/jdbc.html

fmadeiro · ‎12-25-2024

Step 1: Set Up a JDBC Connection

Install the JDBC Driver
- Ensure you have the JDBC driver for the external database (e.g., MySQL, Postgres, SQL Server).
- Place the driver in the Databricks cluster using the %pip install command or by uploading the driver file.
Create a JDBC Connection String
Use the syntax for the database. For example:
php
Copiar código
jdbc:mysql://<hostname>:<port>/<database>?user=<username>&password=<password>
Replace <hostname>, <port>, <database>, <username>, and <password> with the appropriate values.
Read Data from the External Database
Use the following Databricks command to read data:
python
Copiar código
jdbc_url = "jdbc:mysql://<hostname>:<port>/<database>" properties = { "user": "<username>", "password": "<password>", "driver": "com.mysql.cj.jdbc.Driver" } df = spark.read.jdbc(url=jdbc_url, table="source_table", properties=properties) df.display()

Step 2: Enable Change Data Capture (CDC)

Most databases have built-in CDC capabilities. Here are some examples:

MySQL: Use binlog with tools like Debezium.
SQL Server: Enable CDC for the tables with:
sql
Copiar código
EXEC sys.sp_cdc_enable_table @source_schema = N'schema_name', @source_name = N'table_name', @role_name = NULL;
Postgres: Use logical replication slots or tools like Debezium.

Step 3: Stream Data into Databricks

Set Up a Streaming Framework
Use a tool like Apache Kafka, AWS DMS, or a Databricks partner integration to stream data into Databricks.
Configure Delta Live Tables (DLT)
Create a Delta Live Table pipeline for CDC using the Databricks SQL syntax.
Example:
python
Copiar código
CREATE OR REFRESH STREAMING LIVE TABLE cdc_table AS SELECT * FROM ( SELECT *, _change_type FROM cloud_files( "/path/to/cdc/files", "json", map("mergeSchema", "true") ) ) WHERE _change_type IN ('insert', 'update', 'delete');
Merge Updates into Delta Table
Use the MERGE INTO command to handle inserts, updates, and deletions:
sql
Copiar código
MERGE INTO target_table t USING cdc_table c ON t.id = c.id WHEN MATCHED AND c._change_type = 'update' THEN UPDATE SET * WHEN MATCHED AND c._change_type = 'delete' THEN DELETE WHEN NOT MATCHED THEN INSERT *;

Step 4: Schedule Synchronization

Use Databricks Workflows to schedule your DLT pipelines to run periodically and keep the target data synchronized.

Step 5: Monitor and Optimize

Monitor the pipeline using the Databricks Jobs and DLT dashboard for errors and performance issues.
Optimize the process by using partitioning, Z-order indexing, and caching.

Helpful Resources

Databricks JDBC Documentation
Delta Live Tables Documentation
Change Data Capture (CDC) on Databricks

This approach allows you to maintain a replica of your external database in Databricks with minimal latency and high accuracy.

RiyazAliM · ‎12-26-2024

Hey @peritus ,

I would suggest using Lakehouse Federation and create a DLT pipeline to read tables as Materialised Views. If you trigger a refresh of that pipeline at a scheduled interval, you should have the SQL server data replicated in Databricks.

Riz

peritus · ‎12-27-2024

First off, I'd like to thank the community here for all their detailed timely responses, even during the holidays. You have a great group here. I will look into these avenues and may post some follow-up questions.

john533 · ‎12-27-2024

To synchronize data from an external database into Databricks with change data capture (CDC), you can use Delta Live Tables (DLT). Start by configuring a JDBC connection to your source database and use a CDC tool (like Debezium or database-native CDC mechanisms) to capture changes. In Databricks, create a DLT pipeline that reads the captured changes (e.g., from a Kafka topic or directly from change tables) and writes them to a Delta table. This ensures your Databricks Delta table stays in sync with the source. Check Databricks' documentation on DLT and CDC for setup steps and examples. YouTube APK