โ09-01-2022 04:46 PM
Schema Design :
Source : Miltiple CSV Files like (SourceFile1 ,SourceFile2)
Target : Delta Table like (Target_Table)
Excel File : ETL_Mapping_Sheet
File Columns : SourceTable ,SourceColumn, TargetTable, TargetColum , MappingLogic
MappingLogic columns contains (SELECT * FROM TABLE OR
SELECT * FROM SourceFile1 A LEFT JOIN SourceFile2 B
ON A.ID = B.ID ) like SQL statements.
Que : How Can I use the MappingLogic cloumns values in dataframe to build the mapping Logic ??
Can I Directly execute SQL statement from using Column values??
My Approach :
Updated a sample rows from a ETL mapping sheet :
โ09-19-2022 03:48 PM
@Aman Sehgalโ @Hubert Dudekโ @Piper Wilsonโ @Werner Stinckensโ
Can someone pls check this query ??
โ09-20-2022 12:46 AM
I struggle to understand the question, so please correct me here:
If I understand correctly you have an excel filled with sql expressions (or field mappings source-sink) and want to use the content of that excel to insert it into spark code?
Technically I think it is possible, you could read the excel file into python or into a spark DF and extract the values (f.e. with the collect() function).
But is this really the way you want to go? Because basically you put your mapping logic into an excel file, which is opening the gates to hell IMO.
I would rather go for a selectExpr() expression. Like that the mappings reside into code, you can check it in into git, have versioning etc.
โ09-20-2022 03:48 AM
Thanks for your response.
your understanding is correct.
I updated the sample etl mapping in the que.
As you can see , this mapping sheet contains sql statement to get target values and I have 500 mappings like this so I was thinking to use this mapping sheet directly for the logic
Don't you think it will be a good approach ??
โ09-20-2022 03:57 AM
A wise man once said: violence and excel are never the answer ๐
The issue with the excel approach is that it will be hard to figure out the data lineage.
you also have to consult 2 locations: the notebook and the excel file.
Also, what if someone else opens the excel file and you have to edit it? Stuff like that.
IMO excel is good for data analysis, it does not belong in data engineering.
โ09-20-2022 04:01 AM
I think you can construct SQL queries and use the loop to fill them with your code.
spark.sql(f"INSERT INTO {Target} ....
Or even better, use MERGE INTO
โ09-20-2022 12:55 AM
Following on @Werner Stinckensโ response, if you can give an example then it will be good.
Ideally you can read each row from excel file in python and pass each column as a parameter to a function.
Eg; def apply_mapping_logic(SourceTable ,SourceColumn, TargetTable, TargetColum , MappingLogic)
Within this function you can define what you would like to do with the mapping logic.
Again, to do this you'll have to come up with a logic based on different types of mapping logics you have in your excel file.
โ09-20-2022 03:55 AM
@Aman Sehgalโ
thanks for your response , I update the sample mapping example
I already have mapping logic in the mappingsheet so do I still need this extra function now ... can I directly store this SQL logic in a variable and directly exeute like below.
Pyspark code :
variable = df.select("mappinglogic").collect()[0]
df_spark_sql = spark.sql(variable )
and after that if I want to perform any further operation , I can easily do it in df_spark_sql dataframe
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group