I have a SQL server transactional database on an EC2 instance, and an AWS Glue job that pulls full tables in parquet files into an S3 bucket. There is a very large table that has 44 million rows, and records are added, updated and deleted from this table. How would I set up an ETL process into Databricks, to keep this table updated on a daily basis?
I have been doing research on ETL using DLT, but I'm not sure how that will work with the current setup. For more detail, autoloader seems to just be able to re-append the day-to-day results of this huge table to a table in Databricks, which blows it out with duplicates. What part of the setup is wrong here? Or is the concept of the gold table that it would be this big messy thing filled with duplicate data? And if so, how does Databricks account for that this table will be in the trillions?