Databricks Community

Phani1 · ‎10-17-2024

Hi All,

We are facing a situation where our data source is Snowflake, and the data is saved in a storage location(adls) in parquet format. However, the tables or data lack a date column or any incremental column for performing incremental loads to Databricks.

We have come up with two solutions to do incremental load /refresh the table to data bricks :
1) At present ,Current scenario data stored in an ADLS folder, which includes entire data from source table in a single Parquet file. We are processing the parquet file to create a hash key for each row.
Additionally, we are reading the existing delta table and generating hash keys for its rows as well.
We then compare the hash keys from both the parquet and the delta table and performing delete and insert operations. However, we cannot perform updates because of duplicates.

2) The second method involves reading a CSV or Parquet file and creating another Delta table in Databricks. Then, by using zero copy cloning, we can clone the table. This process eliminates the need to generate a hash key, perform comparisons, or copy the data. Note: The clone feature is currently in public preview, and clients may not want to use it in production because it is not generally available. Will we encounter any problems if the feature remains in public preview?

Please recommend is there any method for doing incremental loads, especially for tables that have a large amount of data but lack keys.

Regards

Phani

-werners- · ‎10-18-2024

Ideally you would have some change tracking system (cdc f.e.) on the source tables (Streams in the case of Snowflake, Introduction to Streams | Snowflake Documentation).
But that is not the case.
So I think you approach is ok. You cannot track what is not registered.
All you can do is do some kind of comparison between source/target.

About the cloning feature: if something is in public preview, it will become GA and is already stable.
So no issue in using it.