cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How is Idempotency ensured for COPY INTO command

User16869510359
Esteemed Contributor
 
1 ACCEPTED SOLUTION

Accepted Solutions

User16869510359
Esteemed Contributor

COPY INTO command internally uses key-value store - RocksDB to store the details of the input files. This information is stored inside the Delta table log directory. This acts like the checkpointing information for a streaming query. Next time a COPY INTO command is triggered on the same table, as a first step, the data from the RocksDB is loaded and compared against the input files. Under the hood, a dedupe logic is performed to ensure idempotency. 

More details here: 

https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-copy-into.html

For COPY_OPTIONS, the parameter force if set to 'true', idempotency is disabled and files are loaded regardless of whether theyโ€™ve been loaded before. 

View solution in original post

2 REPLIES 2

User16869510359
Esteemed Contributor

COPY INTO command internally uses key-value store - RocksDB to store the details of the input files. This information is stored inside the Delta table log directory. This acts like the checkpointing information for a streaming query. Next time a COPY INTO command is triggered on the same table, as a first step, the data from the RocksDB is loaded and compared against the input files. Under the hood, a dedupe logic is performed to ensure idempotency. 

More details here: 

https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-copy-into.html

For COPY_OPTIONS, the parameter force if set to 'true', idempotency is disabled and files are loaded regardless of whether theyโ€™ve been loaded before. 

N_M
New Contributor III

How does COPY_INTO work with table restore?

I made some tests, and the restore method does NOT restore the key-store values of the target at the specific version, which means that the data that came after the chosen version cannot be inserted (unless forced).

Is this behavior intended?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.