cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Synchronize SQLServer tables to Databricks

peritus
New Contributor

I'm new to Databricks and, I'm looking to get data from an external database into Databricks and keep it synchronized when changes occur in the source tables. It seems like I may be able to some form of change data capture and the delta live tables. I've looked at some documentation and some online tutorials related to CDC and Databricks but haven't found anything that really shows an example of setting it all up, so I guess I'm looking for someone to point me in the right direction here. My question is basically how can I configured an external database for synchronizing and maintaining a replicate of that data within Databricks. 






3 REPLIES 3

Walter_C
Databricks Employee
Databricks Employee

fmadeiro
Visitor

Step 1: Set Up a JDBC Connection

  1. Install the JDBC Driver

    • Ensure you have the JDBC driver for the external database (e.g., MySQL, Postgres, SQL Server).
    • Place the driver in the Databricks cluster using the %pip install command or by uploading the driver file.
  2. Create a JDBC Connection String
    Use the syntax for the database. For example:

    php
    Copiar código
    jdbc:mysql://<hostname>:<port>/<database>?user=<username>&password=<password>

    Replace <hostname>, <port>, <database>, <username>, and <password> with the appropriate values.

  3. Read Data from the External Database
    Use the following Databricks command to read data:

    python
    Copiar código
    jdbc_url = "jdbc:mysql://<hostname>:<port>/<database>" properties = { "user": "<username>", "password": "<password>", "driver": "com.mysql.cj.jdbc.Driver" } df = spark.read.jdbc(url=jdbc_url, table="source_table", properties=properties) df.display()

Step 2: Enable Change Data Capture (CDC)

Most databases have built-in CDC capabilities. Here are some examples:

  • MySQL: Use binlog with tools like Debezium.
  • SQL Server: Enable CDC for the tables with:
    sql
    Copiar código
    EXEC sys.sp_cdc_enable_table @source_schema = N'schema_name', @source_name = N'table_name', @role_name = NULL;
  • Postgres: Use logical replication slots or tools like Debezium.

Step 3: Stream Data into Databricks

  1. Set Up a Streaming Framework
    Use a tool like Apache Kafka, AWS DMS, or a Databricks partner integration to stream data into Databricks.

  2. Configure Delta Live Tables (DLT)
    Create a Delta Live Table pipeline for CDC using the Databricks SQL syntax.

    Example:

    python
    Copiar código
    CREATE OR REFRESH STREAMING LIVE TABLE cdc_table AS SELECT * FROM ( SELECT *, _change_type FROM cloud_files( "/path/to/cdc/files", "json", map("mergeSchema", "true") ) ) WHERE _change_type IN ('insert', 'update', 'delete');
  3. Merge Updates into Delta Table
    Use the MERGE INTO command to handle inserts, updates, and deletions:

    sql
    Copiar código
    MERGE INTO target_table t USING cdc_table c ON t.id = c.id WHEN MATCHED AND c._change_type = 'update' THEN UPDATE SET * WHEN MATCHED AND c._change_type = 'delete' THEN DELETE WHEN NOT MATCHED THEN INSERT *;

Step 4: Schedule Synchronization

  • Use Databricks Workflows to schedule your DLT pipelines to run periodically and keep the target data synchronized.

Step 5: Monitor and Optimize

  • Monitor the pipeline using the Databricks Jobs and DLT dashboard for errors and performance issues.
  • Optimize the process by using partitioning, Z-order indexing, and caching.

Helpful Resources

This approach allows you to maintain a replica of your external database in Databricks with minimal latency and high accuracy.

RiyazAli
Valued Contributor

Hey @peritus ,

I would suggest using Lakehouse Federation and create a DLT pipeline to read tables as Materialised Views. If you trigger a refresh of that pipeline at a scheduled interval, you should have the SQL server data replicated in Databricks. 

Riz

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group