Databricks

User16869510359 · ‎06-25-2021

User16869510359 · ‎06-25-2021

COPY INTO command internally uses key-value store - RocksDB to store the details of the input files. This information is stored inside the Delta table log directory. This acts like the checkpointing information for a streaming query. Next time a COPY INTO command is triggered on the same table, as a first step, the data from the RocksDB is loaded and compared against the input files. Under the hood, a dedupe logic is performed to ensure idempotency.

More details here:

https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-copy-into.html

For COPY_OPTIONS, the parameter force if set to 'true', idempotency is disabled and files are loaded regardless of whether they’ve been loaded before.

View solution in original post

User16869510359 · ‎06-25-2021

COPY INTO command internally uses key-value store - RocksDB to store the details of the input files. This information is stored inside the Delta table log directory. This acts like the checkpointing information for a streaming query. Next time a COPY INTO command is triggered on the same table, as a first step, the data from the RocksDB is loaded and compared against the input files. Under the hood, a dedupe logic is performed to ensure idempotency.

More details here:

https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-copy-into.html

For COPY_OPTIONS, the parameter force if set to 'true', idempotency is disabled and files are loaded regardless of whether they’ve been loaded before.

N_M · ‎12-01-2023

How does COPY_INTO work with table restore?

I made some tests, and the restore method does NOT restore the key-store values of the target at the specific version, which means that the data that came after the chosen version cannot be inserted (unless forced).

Is this behavior intended?

Databricks

How is Idempotency ensured for COPY INTO command

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs