topic Re: How to implement Source to Target ETL Mapping sheet in PySpark using Delta tables in Data Engineering

How to implement Source to Target ETL Mapping sheet in PySpark using Delta tables

thewfhengineer — Thu, 01 Sep 2022 23:46:06 GMT

Schema Design :

Source : Miltiple CSV Files like (SourceFile1 ,SourceFile2)

Target : Delta Table like (Target_Table)

Excel File : ETL_Mapping_Sheet

File Columns : SourceTable ,SourceColumn, TargetTable, TargetColum , MappingLogic

MappingLogic columns contains (SELECT * FROM TABLE OR

SELECT * FROM SourceFile1 A LEFT JOIN SourceFile2 B

ON A.ID = B.ID ) like SQL statements.

Que : How Can I use the MappingLogic cloumns values in dataframe to build the mapping Logic ??

Can I Directly execute SQL statement from using Column values??

My Approach :

Load Excel file into dataframe (df_mapping)
Assign values of MappingLogic cloumns(SQL Select statements) into a Variable
Call spark.sql(variablename) , it will execute the SQL Query -- Not 100% sure how to do this

Updated a sample rows from a ETL mapping sheet :

Re: How to implement Source to Target ETL Mapping sheet in PySpark using Delta tables

thewfhengineer — Mon, 19 Sep 2022 22:48:46 GMT

@Aman Sehgal @Hubert Dudek @Piper Wilson @Werner Stinckens

Can someone pls check this query ??

Re: How to implement Source to Target ETL Mapping sheet in PySpark using Delta tables

-werners- — Tue, 20 Sep 2022 07:46:39 GMT

I struggle to understand the question, so please correct me here:

If I understand correctly you have an excel filled with sql expressions (or field mappings source-sink) and want to use the content of that excel to insert it into spark code?

Technically I think it is possible, you could read the excel file into python or into a spark DF and extract the values (f.e. with the collect() function).

But is this really the way you want to go? Because basically you put your mapping logic into an excel file, which is opening the gates to hell IMO.

I would rather go for a selectExpr() expression. Like that the mappings reside into code, you can check it in into git, have versioning etc.

Re: How to implement Source to Target ETL Mapping sheet in PySpark using Delta tables

AmanSehgal — Tue, 20 Sep 2022 07:55:34 GMT

Following on @Werner Stinckens response, if you can give an example then it will be good.

Ideally you can read each row from excel file in python and pass each column as a parameter to a function.

Eg; def apply_mapping_logic(SourceTable ,SourceColumn, TargetTable, TargetColum , MappingLogic)

Within this function you can define what you would like to do with the mapping logic.

Again, to do this you'll have to come up with a logic based on different types of mapping logics you have in your excel file.

Re: How to implement Source to Target ETL Mapping sheet in PySpark using Delta tables

thewfhengineer — Tue, 20 Sep 2022 10:48:20 GMT

Thanks for your response.

your understanding is correct.

I updated the sample etl mapping in the que.

As you can see , this mapping sheet contains sql statement to get target values and I have 500 mappings like this so I was thinking to use this mapping sheet directly for the logic

Don't you think it will be a good approach ??

Re: How to implement Source to Target ETL Mapping sheet in PySpark using Delta tables

thewfhengineer — Tue, 20 Sep 2022 10:55:39 GMT

@Aman Sehgal

thanks for your response , I update the sample mapping example

I already have mapping logic in the mappingsheet so do I still need this extra function now ... can I directly store this SQL logic in a variable and directly exeute like below.

Pyspark code :

variable = df.select("mappinglogic").collect()[0]

df_spark_sql = spark.sql(variable )

and after that if I want to perform any further operation , I can easily do it in df_spark_sql dataframe

Re: How to implement Source to Target ETL Mapping sheet in PySpark using Delta tables

-werners- — Tue, 20 Sep 2022 10:57:54 GMT

A wise man once said: violence and excel are never the answer 🙂

The issue with the excel approach is that it will be hard to figure out the data lineage.

you also have to consult 2 locations: the notebook and the excel file.

Also, what if someone else opens the excel file and you have to edit it? Stuff like that.

IMO excel is good for data analysis, it does not belong in data engineering.

Re: How to implement Source to Target ETL Mapping sheet in PySpark using Delta tables

Hubert-Dudek — Tue, 20 Sep 2022 11:01:05 GMT

I think you can construct SQL queries and use the loop to fill them with your code.

spark.sql(f"INSERT INTO {Target} ....

Or even better, use MERGE INTO