Databricks

Michael42 · ‎11-11-2022

Input Data:

One batch load of a daily dataset, roughly 10 million items a day of transactions.
Another daily batch load of roughly the same size.
Each row in one dataset should have a corresponding row in the other dataset.

Problem to solve:

The problem is to identify which rows in each dataset that are related.
This is a matching problem that could be describe as a series of one-to-one inner joins between the two datasets.
Join columns are always account/date/and amount.
However there are also varying matching rules, on additional columns of each dataset, that need to be explored if multiple rows are found to match up on either side after applying the basic rule above. For example, when multiple rows on each side have the same account and date, but that have amount values that are very common.
In addition, sometimes a matching item from one dataset on any given day may not arrive in the other dataset until the following day, or perhaps even several days later.

Output:

Matched rows: Once a unique set of one-to-one matches are found, they can be offloaded to a result set of matched rows.
Unmatched rows: remaining unmatched rows from either side should be added to tomorrow's dataset for subsequent matching on a later day.

Curious about what techniques someone might use to go about solving this using databricks.

I am new to databricks, having only recently received a data engineering associates cert.

There are additional nuances to the problem I can get into if the discussion proceeds further. Apologies if this discussion seems inappropriate for this forum.

Lennart · ‎11-13-2022

I've dealt with something similar in the past.

There was an order system that had order items that was supposed to be matched up against corresponding products in another system that acted as a master and handled invoicing.

As for unqiue considerations for doing this with databricks, maybe see if you can get the rows that needs to be joined unto the same worker so you can avoid shuffeling after the initial read?

Another thing to watch out for with this kind of problem is if you have rules where you have multiple candidates for matching, but where what gets matched depends on if a record is already matched.

In that case you may have some kind of reccursive problem on your hand, that can keep producing additonal matches when you run the same operation multiple times on the same dataset.

As for how to deal with this, atleast try to give all the rows unique keys to reduce the complexitiy of the queries/cognetive overhead when trying to deal with the business rules.

Also consider making a link/junction table and store the matched key pairs with useful metadata like timestamp, matching rule used etc. Keys missing from that table would automatically be elligible for future matching and it can be used to join the tables back together again in other contexts without reproducing the logic.