Databricks Community

turagittech · ‎02-13-2025

Hi All,

I have an issue. I have several databases with the same schemas I need to source data from. Those databases are going to end up aggregated in a data warehouse. The problem is the id column in each means different things. Example: a client id in one database e.g. clientid 1 is not the same client in the other databases. With Data Factory, I can add the source database as a column. I want to move away from Data Factory for this project. I think now Databricks may be more cost effective as a total solution of ETL and data warehouse.

I have created a connection to the source databases under External Data and have an associated schema in the catalog. This looks like a simple solution in itself, but I can't see as I copy data into other tables how I would identify which connection, schema the data is being sourced from using SQL. Perhaps this is a python job.

Any recommendations appreciated

mark_ott · ‎10-31-2025

Migrating from Data Factory to Databricks for ETL and warehousing is a solid choice, especially for flexibility and cost-effectiveness in data engineering projects. The core issue—disambiguating “id” fields that are only unique within each source database—is a common challenge in multi-source data consolidation. Here’s a concise overview and actionable architecture that solves your problem:

Key Recommendation

Always create a surrogate key or composite key in your target tables that uniquely identifies the origin of each row, by combining the original "id" with a unique database (or source system) identifier.

Why This Matters

Source context is lost if you only store the native id in the warehouse. “1” in DB1 ≠ “1” in DB2.
Keeping track of data provenance (which database, which schema, etc.) is critical for auditing and downstream data accuracy.

Options for Including Source Metadata in Databricks

1. During ETL with PySpark/DatabricksSQL

Read each source database separately (using the External Data connection or JDBC).
Add a new column (e.g., source_db or source_schema) when loading the DataFrame.
Write to the warehouse with this extra column, or use it to construct a composite key (e.g., source_db || '-' || clientid).

Example in PySpark:

python

# Read source
df = spark.read.format("jdbc") \
    .option("url", jdbc_url) \
    .option("dbtable", "clients") \
    .option("user", user) \
    .option("password", password) \
    .load()

# Add source identifier
df_with_source = df.withColumn("source_db", lit("DB1"))

# Optionally, create a composite key
df_with_source = df_with_source.withColumn("warehouse_client_id", concat(df_with_source["source_db"], lit("_"), df_with_source["clientid"]))

# Write to warehouse
df_with_source.write.format("delta").mode("append").save("/mnt/warehouse/clients")

Repeat for each database, adjusting the source_db value.

2. SQL-Only Approach

If using Databricks SQL with catalogs, you can reference the originating catalog/schema, but you must copy this info into your tables during ETL, as there’s no automatic “source” column—unlike Data Factory’s mapping data source features.

3. Automation

Automate the above with a loop in your Databricks notebook or pipeline job, iterating over connection configs. Add the connection/source name each time.

Best Practices

Never rely on native id fields alone for cross-source uniqueness.
Always include a source identifier (database, schema, or ETL job name) in your staged/landing tables so you can trace each row’s origin.
Consider using a surrogate key (warehouse-side unique id) if you must join across sources, but keep the original id and its source as columns for traceability.
Document source system mappings in your data catalog or metadata store.

Summary Table

Data Field	Description
clientid	Native client id from source DB
source_db	Which source database (e.g., "DB1")
warehouse_client_id	Composite key: source_db + clientid

Final Advice

Python jobs in Databricks give you full control to annotate and move data, solving your issue and avoiding ambiguity.
Databricks’ orchestration (with jobs, workflows, and Delta Lake) is a strong, scalable alternative to Data Factory for these use-cases.