cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Access historical injected data of COPY INTO command

N_M
New Contributor III

Dear Community,

I'm using the COPY INTO command to automate the staging of files that I get in an S3 bucket into specific delta tables (with some transformation on the fly).

The command works smoothly, and files are indeed inserted only once (writing idempotency works fine). The documentation says that filenames are written in a form of key:value in some RockDB.

The fact is that I need to access the (new) staged filenames in the workflow, and the idea is to look into the metadata or transactional logs rather than in the table itself (that is huge). Unfortunately, the table history does not contain this information. So my questions are:

  • is it possible to access the inserted filenames history metadata?
  • how long is the retention period of such information? (I'm asking because apparently SNOWFLAKE has the same COPY INTO command with identical features, but the documentation clearly says that historical information is stored for 64 days, after that, they are forgotten and COPY INTO will re-stage the files, if found...)

Can you help me?

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @N_M Let’s dive into your questions regarding the COPY INTO command and the associated metadata in Snowflake:

  1. Accessing Inserted Filenames History Metadata:

    • When using the COPY INTO command in Snowflake, the filenames of staged files are indeed stored in metadata. However, accessing this information directly from the metadata or transactional logs can be a bit tricky.
    • The primary way to access the filenames is by querying the table itself after the data has been loaded. Unfortunately, as you mentioned, querying a large table might not be ideal.
    • Snowflake does not provide direct access to historical filenames metadata beyond what’s available in the table history.
  2. Retention Period for Filenames Information:

  3. Comparison with SNOWFLAKE:

    • It’s interesting to note that SNOWFLAKE (the COPY INTO command) retains historical information for 64 days before forgetting it. During this period, if the same files are encountered, they won’t be re-staged.
    • In contrast, Snowflake’s retention period is shorter (14 days), but it still provides a window for managing the loading process.

In summary, while direct access to filenames history metadata isn’t readily available, querying the table or using the LIST command within the 14-day retention period should help you monitor and manage your data loading workflow. If you have any further questions or need assistance, feel free to ask! 😊