Databricks Community

lwoodward · ‎02-16-2024

I have a SQL server transactional database on an EC2 instance, and an AWS Glue job that pulls full tables in parquet files into an S3 bucket. There is a very large table that has 44 million rows, and records are added, updated and deleted from this table. How would I set up an ETL process into Databricks, to keep this table updated on a daily basis?

I have been doing research on ETL using DLT, but I'm not sure how that will work with the current setup. For more detail, autoloader seems to just be able to re-append the day-to-day results of this huge table to a table in Databricks, which blows it out with duplicates. What part of the setup is wrong here? Or is the concept of the gold table that it would be this big messy thing filled with duplicate data? And if so, how does Databricks account for that this table will be in the trillions?

ScottSmithDB · ‎02-16-2024

If you have a CDC stream capability, you can use the APPLY CHANGES INTO API to perform SCD1, or SCD2 in a Delta Lake table in Databricks. You can find more information here. This is the best way to go if CDC is a possibility.

If you do not have a CDC source, using MERGE INTO can handle appends, updates, and deletes. If the table size is several terabytes and the merge conditions required are very complex, other techniques may work. Such as a partitioning scheme at the level of granularity the data can change by, change detection and logic for deletion of entire partitions (if that is a potential), and overwriting the changed or new partitions and deleting those that are no longer in the parquet directory.

I may not be covering all of the practical possibilities here. But without more specifics about your data type, volume, complexity, rate of change, etc. it is difficult to provide prescriptive options if APPLY CHANGES INTO and MERGE INTO do not work for your use case.

I hope this is helpful.

View solution in original post

ScottSmithDB · ‎02-16-2024

If you have a CDC stream capability, you can use the APPLY CHANGES INTO API to perform SCD1, or SCD2 in a Delta Lake table in Databricks. You can find more information here. This is the best way to go if CDC is a possibility.

If you do not have a CDC source, using MERGE INTO can handle appends, updates, and deletes. If the table size is several terabytes and the merge conditions required are very complex, other techniques may work. Such as a partitioning scheme at the level of granularity the data can change by, change detection and logic for deletion of entire partitions (if that is a potential), and overwriting the changed or new partitions and deleting those that are no longer in the parquet directory.

I may not be covering all of the practical possibilities here. But without more specifics about your data type, volume, complexity, rate of change, etc. it is difficult to provide prescriptive options if APPLY CHANGES INTO and MERGE INTO do not work for your use case.

I hope this is helpful.