Wednesday
I'm new to Databricks and, I'm looking to get data from an external database into Databricks and keep it synchronized when changes occur in the source tables. It seems like I may be able to some form of change data capture and the delta live tables. I've looked at some documentation and some online tutorials related to CDC and Databricks but haven't found anything that really shows an example of setting it all up, so I guess I'm looking for someone to point me in the right direction here. My question is basically how can I configured an external database for synchronizing and maintaining a replicate of that data within Databricks.
Wednesday
You can refer to https://docs.databricks.com/en/connect/index.html and specifically to a connection by using JDBC: https://docs.databricks.com/en/connect/external-systems/jdbc.html
Wednesday
Install the JDBC Driver
Create a JDBC Connection String
Use the syntax for the database. For example:
Replace <hostname>, <port>, <database>, <username>, and <password> with the appropriate values.
Read Data from the External Database
Use the following Databricks command to read data:
Most databases have built-in CDC capabilities. Here are some examples:
Set Up a Streaming Framework
Use a tool like Apache Kafka, AWS DMS, or a Databricks partner integration to stream data into Databricks.
Configure Delta Live Tables (DLT)
Create a Delta Live Table pipeline for CDC using the Databricks SQL syntax.
Example:
Merge Updates into Delta Table
Use the MERGE INTO command to handle inserts, updates, and deletions:
This approach allows you to maintain a replica of your external database in Databricks with minimal latency and high accuracy.
yesterday
Hey @peritus ,
I would suggest using Lakehouse Federation and create a DLT pipeline to read tables as Materialised Views. If you trigger a refresh of that pipeline at a scheduled interval, you should have the SQL server data replicated in Databricks.
17 hours ago
First off, I'd like to thank the community here for all their detailed timely responses, even during the holidays. You have a great group here. I will look into these avenues and may post some follow-up questions.
16 hours ago - last edited 16 hours ago
To synchronize data from an external database into Databricks with change data capture (CDC), you can use Delta Live Tables (DLT). Start by configuring a JDBC connection to your source database and use a CDC tool (like Debezium or database-native CDC mechanisms) to capture changes. In Databricks, create a DLT pipeline that reads the captured changes (e.g., from a Kafka topic or directly from change tables) and writes them to a Delta table. This ensures your Databricks Delta table stays in sync with the source. Check Databricks' documentation on DLT and CDC for setup steps and examples. YouTube APK
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group