Would like to start a discussion regarding techniques for joining two relatively large tables of roughly equal size on a daily basis. I realize this may be a bit of a conundrum with databricks, but review the details.

Michael42
New Contributor III

Input Data:

  • One batch load of a daily dataset, roughly 10 million items a day of transactions.
  • Another daily batch load of roughly the same size.
  • Each row in one dataset should have a corresponding row in the other dataset.

Problem to solve:

  • The problem is to identify which rows in each dataset that are related.
  • This is a matching problem that could be describe as a series of one-to-one inner joins between the two datasets.
  • Join columns are always account/date/and amount.
  • However there are also varying matching rules, on additional columns of each dataset, that need to be explored if multiple rows are found to match up on either side after applying the basic rule above. For example, when multiple rows on each side have the same account and date, but that have amount values that are very common.
  • In addition, sometimes a matching item from one dataset on any given day may not arrive in the other dataset until the following day, or perhaps even several days later.

Output:

  • Matched rows: Once a unique set of one-to-one matches are found, they can be offloaded to a result set of matched rows.
  • Unmatched rows: remaining unmatched rows from either side should be added to tomorrow's dataset for subsequent matching on a later day.

Curious about what techniques someone might use to go about solving this using databricks.

I am new to databricks, having only recently received a data engineering associates cert.

There are additional nuances to the problem I can get into if the discussion proceeds further. Apologies if this discussion seems inappropriate for this forum.