cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

incremental loads without date column

Phani1
Valued Contributor II

Hi All,

We are facing a situation where our data source is Snowflake, and the data is saved in a storage location(adls) in parquet format. However, the tables or data lack a date column or any incremental column for performing incremental loads to Databricks.

We have come up with two solutions to do incremental load /refresh the table to data bricks :
1) At present ,Current scenario data stored in an ADLS folder, which includes entire data from source table in a single Parquet file. We are processing the parquet file to create a hash key for each row.
Additionally, we are reading the existing delta table and generating hash keys for its rows as well.
We then compare the hash keys from both the parquet and the delta table and performing delete and insert operations. However, we cannot perform updates because of duplicates.

2) The second method involves reading a CSV or Parquet file and creating another Delta table in Databricks. Then, by using zero copy cloning, we can clone the table. This process eliminates the need to generate a hash key, perform comparisons, or copy the data. Note: The clone feature is currently in public preview, and clients may not want to use it in production because it is not generally available. Will we encounter any problems if the feature remains in public preview?

Please recommend is there any method for doing incremental loads, especially for tables that have a large amount of data but lack keys.

Regards

Phani

 

1 REPLY 1

-werners-
Esteemed Contributor III

Ideally you would have some change tracking system (cdc f.e.) on the source tables (Streams in the case of Snowflake, Introduction to Streams | Snowflake Documentation).
But that is not the case.
So I think you approach is ok.  You cannot track what is not registered.
All you can do is do some kind of comparison between source/target.


About the cloning feature: if something is in public preview, it will become GA and is already stable.
So no issue in using it.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group