Hi All,
We are facing a situation where our data source is Snowflake, and the data is saved in a storage location(adls) in parquet format. However, the tables or data lack a date column or any incremental column for performing incremental loads to Databricks.
We have come up with two solutions to do incremental load /refresh the table to data bricks :
1) At present ,Current scenario data stored in an ADLS folder, which includes entire data from source table in a single Parquet file. We are processing the parquet file to create a hash key for each row.
Additionally, we are reading the existing delta table and generating hash keys for its rows as well.
We then compare the hash keys from both the parquet and the delta table and performing delete and insert operations. However, we cannot perform updates because of duplicates.
2) The second method involves reading a CSV or Parquet file and creating another Delta table in Databricks. Then, by using zero copy cloning, we can clone the table. This process eliminates the need to generate a hash key, perform comparisons, or copy the data. Note: The clone feature is currently in public preview, and clients may not want to use it in production because it is not generally available. Will we encounter any problems if the feature remains in public preview?
Please recommend is there any method for doing incremental loads, especially for tables that have a large amount of data but lack keys.
Regards
Phani