cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

How is Idempotency ensured for COPY INTO command

brickster_2018
Databricks Employee
Databricks Employee
 
1 ACCEPTED SOLUTION

Accepted Solutions

brickster_2018
Databricks Employee
Databricks Employee

COPY INTO command internally uses key-value store - RocksDB to store the details of the input files. This information is stored inside the Delta table log directory. This acts like the checkpointing information for a streaming query. Next time a COPY INTO command is triggered on the same table, as a first step, the data from the RocksDB is loaded and compared against the input files. Under the hood, a dedupe logic is performed to ensure idempotency. 

More details here: 

https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-copy-into.html

For COPY_OPTIONS, the parameter force if set to 'true', idempotency is disabled and files are loaded regardless of whether they’ve been loaded before. 

View solution in original post

2 REPLIES 2

brickster_2018
Databricks Employee
Databricks Employee

COPY INTO command internally uses key-value store - RocksDB to store the details of the input files. This information is stored inside the Delta table log directory. This acts like the checkpointing information for a streaming query. Next time a COPY INTO command is triggered on the same table, as a first step, the data from the RocksDB is loaded and compared against the input files. Under the hood, a dedupe logic is performed to ensure idempotency. 

More details here: 

https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-copy-into.html

For COPY_OPTIONS, the parameter force if set to 'true', idempotency is disabled and files are loaded regardless of whether they’ve been loaded before. 

N_M
Contributor

How does COPY_INTO work with table restore?

I made some tests, and the restore method does NOT restore the key-store values of the target at the specific version, which means that the data that came after the chosen version cannot be inserted (unless forced).

Is this behavior intended?

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now