cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Best practice to handle SQL table archives?

mkkao924
New Contributor II

Many of our source data are setup in a way that the main table only keep small amount of data, and historical data are move to another archive table with very similar schema.

My goal is have one table in Databricks, maybe with a flag to indicate if the data is current or archived.

How should I setup my pipeline to support this, and is there a best practice that I should follow?

3 REPLIES 3

Coffee77
Contributor III

Interesting question that (from my view) implies a change of mindset between SQL Transactional/Relational and Lakehouse worlds. I have ended up working with both worlds but I came here from SQL relational world and then, first advise is that one. Do not deal with delta tables as they were relational tables. I could explain more in detail if needed but now I'll go straight to the most important.

In SQL it is very common to have multiple tables to store current or historical data, to have materialized views to divide data by dates (or other fields), partition tables in different file groups, etc. as a way of ensuring transactional daily operations and trying to get a balance between writes and reads. So, it makes a lot of sense to move historical data to other tables, databases, etc. as scalability is compromised. Those tables can keep kind of same schemas and as from them you could even run ETLs to feed datawarehouses.

With delta tables, scalability is one of the main advantages as long as you define correctly partitions or liquid clustered indexes. I mean, there wouldn't be a need to to separate data in different tables unless there are specific requirements in your business. Databricks also features very cool options to deal with historical changes as change data capture (CDC), time travel, etc. you can take a look at docs.

Having said that, if you want to isolate data in different data objects, I'd go with separate managed tables or even catalogs for being able to select even different storage accounts to support that data. Then, you can define views to join, union or consolidate your data. It's true that the less data you have in tables, the less will take queries or "optimizations" jobs applied to them but penalization is by far less in lakehouse scenarios than SQL relational world.

I hope this helps. Happy to get other opinions.


Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData

mkkao924
New Contributor II

Thank you @Coffee77 

This is exactly what I want to achieve, that while in my relational database I have multiple tables (current + historical), I want to setup a pipeline that only one table exists in databricks for all the records, but I am not sure what the best practice is to achieve this.

Furthermore, when a record is โ€œarchivedโ€ in my relational database, it will be seen by CDC as โ€œdelete from current table, then insert to historical tableโ€, and that totally messed up my current pipeline. Ideally, I would like to see my databricks table to have a column to indicate the record is either โ€œcurrentโ€ or โ€œhistoricalโ€.

Coffee77
Contributor III

I would need to dive deeper in your scenario but it sounds to me a strategy could be:

1) Create a view in your SQL Server database with "current data" UNION "historical data". You can set an additional boolean field with True in first query and False in second query.

2) Create an external federated catalog to your SQL Server to be able to run this query OR simply use SQL Server Spark access connector from your pipeline. Try to limit retrieved rows based on pre-detection of unchanged rows. 

3) Design and create a target delta table matching the previous query along with any other field that makes sense in your scenario

3) In the pipeline use this query to create a dataframe and use MERGE command to insert, update or delete target rows in your destination delta table. Consider whether you need additional fields in your source to set each row as Inderted, updated or deleted. Otherwise, set a default policy to always insert if not present, update if present (easy part) and concerning "delete", it will depend on how you implement this CDC source!

This is only a high level approach. I hope it helps.


Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData