cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How is Idempotency ensured for COPY INTO command

brickster_2018
Databricks Employee
Databricks Employee
 
1 ACCEPTED SOLUTION

Accepted Solutions

brickster_2018
Databricks Employee
Databricks Employee

COPY INTO command internally uses key-value store - RocksDB to store the details of the input files. This information is stored inside the Delta table log directory. This acts like the checkpointing information for a streaming query. Next time a COPY INTO command is triggered on the same table, as a first step, the data from the RocksDB is loaded and compared against the input files. Under the hood, a dedupe logic is performed to ensure idempotency. 

More details here: 

https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-copy-into.html

For COPY_OPTIONS, the parameter force if set to 'true', idempotency is disabled and files are loaded regardless of whether theyโ€™ve been loaded before. 

View solution in original post

2 REPLIES 2

brickster_2018
Databricks Employee
Databricks Employee

COPY INTO command internally uses key-value store - RocksDB to store the details of the input files. This information is stored inside the Delta table log directory. This acts like the checkpointing information for a streaming query. Next time a COPY INTO command is triggered on the same table, as a first step, the data from the RocksDB is loaded and compared against the input files. Under the hood, a dedupe logic is performed to ensure idempotency. 

More details here: 

https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-copy-into.html

For COPY_OPTIONS, the parameter force if set to 'true', idempotency is disabled and files are loaded regardless of whether theyโ€™ve been loaded before. 

N_M
Contributor

How does COPY_INTO work with table restore?

I made some tests, and the restore method does NOT restore the key-store values of the target at the specific version, which means that the data that came after the chosen version cannot be inserted (unless forced).

Is this behavior intended?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group