cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
cancel
Showing results for 
Search instead for 
Did you mean: 

cross join issue generating surrogate keys in delta table

Ramacloudworld
New Contributor

I used below code to populate target table, it is working as expected but expect for surrogatekey column. After I inserted dummy entry -1 and did run merge code, it is generating the numbers in surrogatekey column like this 1,3,5,7,9 odd numbers, it is really strange, but if I don't insert -1 entry then it is generating like this 1,2,3,4,5,6,7 but I need dummy entry for all dimensions to capture unknow data. Please help me on this. 

 

is cross join causing issue? but not sure

 

%sql
CREATE TABLE xxxx.target_table (
general_sk BIGINT NOT NULL GENERATED BY DEFAULT AS IDENTITY (START WITH 1 INCREMENT BY 1),
ethnicity_code STRING,
ethnicity_description STRING,
indicator_code STRING,
indicator_description STRING,
effective_from DATE,
effective_to DATE)

%sql
insert into xxxx.target_table(gneral_sk,ethnicity_code,ethnicity_description,indicator_code,indicator_description,effective_from,effective_to)
SELECT -1,-1,'Missing (Not Applicable/ Not Known)',-1,'Missing (Not Applicable/ Not Known)','1900-01-01','9999-12-31'

 

df = spark.sql(f""" select distinct
et.code as ethnicity_code,
et.description as ethnicity_description,
hl.code as indicator_code,
hl.description as indicator_description
from
(select * from xxx.lookup where tablename in ('ethnicity') ) as et
cross join (select * from xxx.lookup where tablename in ('indicator')) as hl
""")

df.createOrReplaceTempView("updated_rows")

merge into xxxx.target_table trg
using (
select distinct
md5(concat(ethnicity_code,indicator_code)) as merge_key,
ethnicity_code,
ethnicity_description,
indicator_code,
indicator_description

from
updated_rows st
union all
select distinct
null as merge_key,
st.ethnicity_code,
st.ethnicity_description,
st.indicator_code,
st.indicator_description
from
updated_rows st
join
xxxx.target_table trg
on
md5(concat(trg.ethnicity_code,trg.indicator_code)) = md5(concat(st.ethnicity_code,st.indicator_code))
where coalesce(trg.ethnicity_description,'-1') != coalesce(st.ethnicity_description,'-1')
and coalesce(trg.indicator_description,'-1') != coalesce(st.indicator_description,'-1')

) st on md5(concat(coalesce(trg.ethnicity_code,'-1') , coalesce(trg.indicator_code,'-1'))) = st.merge_key
when matched and coalesce(trg.ethnicity_description,'-1') != coalesce(st.ethnicity_description,'-1')
and coalesce(trg.indicator_description,'-1') != coalesce(st.indicator_description,'-1') then
update set trg.effective_to = current_date()-1
when not matched then
insert (ethnicity_code,ethnicity_description,indicator_code,indicator_description,effective_from,effective_to)
values (st.ethnicity_code,st.ethnicity_description,st.indicator_code,st.indicator_description,current_date(),'9999-12-31')

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @RamacloudworldIt seems you’re encountering an issue with the surrogate key generation in your merge code. Let’s break it down and address the problem.

  1. Surrogate Keys:

    • A surrogate key is a system-assigned unique value used to identify an entity occurrence. It’s typically an artificial key (e.g., an auto-incrementing integer or a GUID) that has no business meaning.
    • In your case, the general_sk column serves as the surrogate key for your target_table.
  2. Natural Keys:

    • A natural key is what the business uses to identify an entity occurrence. It’s based on meaningful business attributes (e.g., SSN, birthdate).
    • You’ve identified the natural key for your data as the combination of ethnicity_code and indicator_code.
  3. Issue Description:

    • When you insert a dummy entry with -1 values and then run the merge code, the surrogate key values become odd numbers (1, 3, 5, 7, 9).
    • However, if you omit the dummy entry, the surrogate key values follow a sequential pattern (1, 2, 3, 4, 5, 6, 7).
  4. Possible Causes:

    • The behavior you’re observing could be due to how the database system generates surrogate keys.
    • Cross joins (also known as Cartesian products) can indeed affect the order of generated keys.
  5. Recommendations:

    • Consider using a consistent approach for generating surrogate keys:
      • If you want sequential integers, avoid cross joins and ensure that the dummy entry doesn’t disrupt the sequence.
      • Alternatively, use a hash-based approach (e.g., MD5) to generate surrogate keys consistently across different source systems.
    • Verify that the merge logic correctly handles the dummy entry and existing data.
  6. Example Approach:

    • Create a separate table (e.g., Person_Key) in your staging area to map source keys to surrogate keys.
    • Use the natural key (SSN + birthdate) to identify individuals across different source systems.
    • When integrating new data (e.g., from System D), look up the natural key in the Person_Key table to find the corresponding surrogate key.

Remember that surrogate keys should be stable and consistent, regardless of the source system. If you encounter further issues, consider adjusting your approach based on the specific requirements of your data integration process. 🤝

 
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.