Hi
We have a couple of unity catalogs. In each are schema's, some used, some old, same named temp_data etc. etc Within those schema are tables with the same erratic approach. The result of a couple of years of bad housekeeping. (who not experienced that before?)
We now have an initiative to tidy all this up as we merge a couple of departments.
The plan is to create a couple of new catalogs and new schemas within those, then cherry pick the good stuff from the old mess. The first stage of this operation has been to identify the tables we want to keep and set out what the new schemas will be called.
The big question in front of us is just how to best go about porting the old stuff to the new destination. As described its not just a simple 1:1 lift and shift, table "xxx" may well become table "abc" and live in a different schema that which was its previous home.
The data in the old schema is live and will be continued to be updated while the new catalogs are created. There is a requirement to keep the old and new in sync. Maybe not instant synchronization like database replication might have, but perhaps the new is brought up to date by a refresh every day - or whenever we instruct it to do so!
By now it will probably be obvious to you that at some point all the notebooks and reports that use the current data will be switched to point at the new locations. Maybe this will be achieved by a migration branch of the code in git having all the reference paths to point to the new tidy catalogs, and, at the requisite point in time to "go go go" the new tables will be brough up to date and we'll pull the migration branch into the master and the new world will be born. (The old catalogs will rust and die eventually to be deleted)
So. I'm seeking advice on a best strategy to do all this. I could embark on a roll my own method by scripting out tables, tweaking the scripts to the new names and creating the new tables. Then producing a set of sql statements that first truncate the new tables and select from old and inset in to new in order to refresh the data.
Although - with care - this would work but it feels clunky and error prone. So...my big question and pivotal point of learning is...
Is there a more elegant and "Databricks approved/advised" approach to take?
Something where I would set a configuration file/table to instruct;
"Databricks.. please replicate this table from here to here, and keep the two in sync daily at this time. until I instruct you to stop"
Please help before I roll my own solution and then find out there was a better way all along!