cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Identify source of data in query

turagittech
Contributor

Hi All,

I have an issue. I have several databases with the same schemas I need to source data from. Those databases are going to end up aggregated in a data warehouse. The problem is the id column in each means different things. Example: a client id in one database e.g. clientid 1 is not the same client in the other databases. With Data Factory, I can add the source database as a column. I want to move away from Data Factory for this project. I think now Databricks may be more cost effective as a total solution of ETL and data warehouse.

I have created a connection to the source databases under External Data and have an associated schema in the catalog. This looks like a simple solution in itself, but I can't see as I copy data into other tables how I would identify which connection, schema the data is being sourced from using SQL. Perhaps this is a python job. 

Any recommendations appreciated

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

Migrating from Data Factory to Databricks for ETL and warehousing is a solid choice, especially for flexibility and cost-effectiveness in data engineering projects. The core issue—disambiguating “id” fields that are only unique within each source database—is a common challenge in multi-source data consolidation. Here’s a concise overview and actionable architecture that solves your problem:

Key Recommendation

Always create a surrogate key or composite key in your target tables that uniquely identifies the origin of each row, by combining the original "id" with a unique database (or source system) identifier.


Why This Matters

  • Source context is lost if you only store the native id in the warehouse. “1” in DB1 ≠ “1” in DB2.

  • Keeping track of data provenance (which database, which schema, etc.) is critical for auditing and downstream data accuracy.


Options for Including Source Metadata in Databricks

1. During ETL with PySpark/DatabricksSQL

  • Read each source database separately (using the External Data connection or JDBC).

  • Add a new column (e.g., source_db or source_schema) when loading the DataFrame.

  • Write to the warehouse with this extra column, or use it to construct a composite key (e.g., source_db || '-' || clientid).

Example in PySpark:

python
# Read source df = spark.read.format("jdbc") \ .option("url", jdbc_url) \ .option("dbtable", "clients") \ .option("user", user) \ .option("password", password) \ .load() # Add source identifier df_with_source = df.withColumn("source_db", lit("DB1")) # Optionally, create a composite key df_with_source = df_with_source.withColumn("warehouse_client_id", concat(df_with_source["source_db"], lit("_"), df_with_source["clientid"])) # Write to warehouse df_with_source.write.format("delta").mode("append").save("/mnt/warehouse/clients")
  • Repeat for each database, adjusting the source_db value.

2. SQL-Only Approach

  • If using Databricks SQL with catalogs, you can reference the originating catalog/schema, but you must copy this info into your tables during ETL, as there’s no automatic “source” column—unlike Data Factory’s mapping data source features.

3. Automation

  • Automate the above with a loop in your Databricks notebook or pipeline job, iterating over connection configs. Add the connection/source name each time.


Best Practices

  • Never rely on native id fields alone for cross-source uniqueness.

  • Always include a source identifier (database, schema, or ETL job name) in your staged/landing tables so you can trace each row’s origin.

  • Consider using a surrogate key (warehouse-side unique id) if you must join across sources, but keep the original id and its source as columns for traceability.

  • Document source system mappings in your data catalog or metadata store.


Summary Table

Data Field Description
clientid Native client id from source DB
source_db Which source database (e.g., "DB1")
warehouse_client_id Composite key: source_db + clientid
 
 

Final Advice

  • Python jobs in Databricks give you full control to annotate and move data, solving your issue and avoiding ambiguity.

  • Databricks’ orchestration (with jobs, workflows, and Delta Lake) is a strong, scalable alternative to Data Factory for these use-cases.