cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to implement Source to Target ETL Mapping sheet in PySpark using Delta tables

anonymous1
New Contributor III

Schema Design :

Source : Miltiple CSV Files like (SourceFile1 ,SourceFile2)

Target : Delta Table like (Target_Table)

Excel File : ETL_Mapping_Sheet

File Columns : SourceTable ,SourceColumn, TargetTable, TargetColum , MappingLogic

MappingLogic columns contains (SELECT * FROM TABLE OR

SELECT * FROM SourceFile1 A LEFT JOIN SourceFile2 B

ON A.ID = B.ID ) like SQL statements.

Que : How Can I use the MappingLogic cloumns values in dataframe to build the mapping Logic ??

Can I Directly execute SQL statement from using Column values??

My Approach :

  1. Load Excel file into dataframe (df_mapping)
  2. Assign values of MappingLogic cloumns(SQL Select statements) into a Variable
  3. Call spark.sql(variablename) , it will execute the SQL Query -- Not 100% sure how to do this

Updated a sample rows from a ETL mapping sheet :

image

7 REPLIES 7

anonymous1
New Contributor III

@Aman Sehgal​   @Hubert Dudek​ @Piper Wilson​ @Werner Stinckens​ 

Can someone pls check this query ??

-werners-
Esteemed Contributor III

I struggle to understand the question, so please correct me here:

If I understand correctly you have an excel filled with sql expressions (or field mappings source-sink) and want to use the content of that excel to insert it into spark code?

Technically I think it is possible, you could read the excel file into python or into a spark DF and extract the values (f.e. with the collect() function).

But is this really the way you want to go? Because basically you put your mapping logic into an excel file, which is opening the gates to hell IMO.

I would rather go for a selectExpr() expression. Like that the mappings reside into code, you can check it in into git, have versioning etc.

anonymous1
New Contributor III

Thanks for your response.

your understanding is correct.

I updated the sample etl mapping in the que.

As you can see , this mapping sheet contains sql statement to get target values and I have 500 mappings like this so I was thinking to use this mapping sheet directly for the logic

Don't you think it will be a good approach ??

-werners-
Esteemed Contributor III

A wise man once said: violence and excel are never the answer 🙂

The issue with the excel approach is that it will be hard to figure out the data lineage.

you also have to consult 2 locations: the notebook and the excel file.

Also, what if someone else opens the excel file and you have to edit it? Stuff like that.

IMO excel is good for data analysis, it does not belong in data engineering.

Hubert-Dudek
Esteemed Contributor III

I think you can construct SQL queries and use the loop to fill them with your code.

spark.sql(f"INSERT INTO {Target} ....

Or even better, use MERGE INTO

AmanSehgal
Honored Contributor III

Following on @Werner Stinckens​ response, if you can give an example then it will be good.

Ideally you can read each row from excel file in python and pass each column as a parameter to a function.

Eg; def apply_mapping_logic(SourceTable ,SourceColumn, TargetTable, TargetColum , MappingLogic)

Within this function you can define what you would like to do with the mapping logic.

Again, to do this you'll have to come up with a logic based on different types of mapping logics you have in your excel file.

@Aman Sehgal​ 

thanks for your response , I update the sample mapping example

I already have mapping logic in the mappingsheet so do I still need this extra function now ... can I directly store this SQL logic in a variable and directly exeute like below.

Pyspark code :

variable = df.select("mappinglogic").collect()[0]

df_spark_sql = spark.sql(variable )

and after that if I want to perform any further operation , I can easily do it in df_spark_sql dataframe

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group