cancel
Showing results for 
Search instead for 
Did you mean: 
Community Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

ETL Advice for Large Transactional Database

lwoodward
New Contributor II

I have a SQL server transactional database on an EC2 instance, and an AWS Glue job that pulls full tables in parquet files into an S3 bucket. There is a very large table that has 44 million rows, and records are added, updated and deleted from this table. How would I set up an ETL process into Databricks, to keep this table updated on a daily basis?

I have been doing research on ETL using DLT, but I'm not sure how that will work with the current setup. For more detail, autoloader seems to just be able to re-append the day-to-day results of this huge table to a table in Databricks, which blows it out with duplicates. What part of the setup is wrong here? Or is the concept of the gold table that it would be this big messy thing filled with duplicate data? And if so, how does Databricks account for that this table will be in the trillions?

1 ACCEPTED SOLUTION

Accepted Solutions

ScottSmithDB
Valued Contributor
Valued Contributor

If you have a CDC stream capability, you can use the APPLY CHANGES INTO API to perform SCD1, or SCD2 in a Delta Lake table in Databricks.  You can find more information here.  This is the best way to go if CDC is a possibility.

If you do not have a CDC source, using MERGE INTO can handle appends, updates, and deletes.  If the table size is several terabytes and the merge conditions required are very complex, other techniques may work.  Such as a partitioning scheme at the level of granularity the data can change by, change detection and logic for deletion of entire partitions (if that is a potential), and overwriting the changed or new partitions and deleting those that are no longer in the parquet directory.

I may not be covering all of the practical possibilities here.  But without more specifics about your data type, volume, complexity, rate of change, etc. it is difficult to provide prescriptive options if APPLY CHANGES INTO and MERGE INTO do not work for your use case.

I hope this is helpful.


 

View solution in original post

1 REPLY 1

ScottSmithDB
Valued Contributor
Valued Contributor

If you have a CDC stream capability, you can use the APPLY CHANGES INTO API to perform SCD1, or SCD2 in a Delta Lake table in Databricks.  You can find more information here.  This is the best way to go if CDC is a possibility.

If you do not have a CDC source, using MERGE INTO can handle appends, updates, and deletes.  If the table size is several terabytes and the merge conditions required are very complex, other techniques may work.  Such as a partitioning scheme at the level of granularity the data can change by, change detection and logic for deletion of entire partitions (if that is a potential), and overwriting the changed or new partitions and deleting those that are no longer in the parquet directory.

I may not be covering all of the practical possibilities here.  But without more specifics about your data type, volume, complexity, rate of change, etc. it is difficult to provide prescriptive options if APPLY CHANGES INTO and MERGE INTO do not work for your use case.

I hope this is helpful.


 

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!