Databricks Community

data-grassroots · ‎04-16-2024

We have a data feed with files whose filenames stays the same but the contents change over time (brand_a.csv, brand_b.csv, brand_c.csv ....).

Copy Into seems to ignore the files when they change.

If we set the Force flag to true and run it, we end up with duplicates.

If you Truncate the table and attempt to re-load (without setting force to true), databricks doesn't re-copy the records.

So, behind the scenes, Databricks is holding some kind of load history info and is smart enough not to re-load files it's already seen but not quite smart enough to know that a file has been updated and potential updates are available.

We're pulling _metadata column in which lists filename, size, mod date, etc. Not that Databricks knows that but it does have enough information at some point to know when an input file has changed.

This particular dataset is small enough where dropping the table and recreating isn't that big a deal but I'm interested in knowing for the general case.

Do either Autoloader or Delta Live handle that sort of situation differently or do I need to load into a fresh staging table and run MERGE INTO to handle it?

FWIW, we're using UNITY and DELTA.

-werners- · ‎04-18-2024

If you do not have control over the content of the files I suggest the following:
Each day you get new files/data (I suppose these are not incremental). These files contain new, updated and deleted data, and are overwritten.
Because of this, autoloader will not work (and copy into).
So the easiest way is to read the files and merge them into a delta lake table. The merge statement will apply an insert/update or delete depending on your merge condition.
Because you do not use any fancy file detection mechanism, this will work.
For autoloader to work, you need unique filenames f.e. (aka no overwrites).
DLT might be able to do this too, but I don't use it so can't say a lot about that.

PS. we also never use SQL for engineering loads, only for analytical purposes. Dataframes are a bliss according to us 🙂

View solution in original post

-werners- · ‎04-17-2024

can you elaborate more on this 'data feed'?
Because the behavior you mention seems like you are using autoloader.
Overwriting files in general does not really work well in incremental setups as immutability is one of the key concepts.

data-grassroots · ‎04-17-2024

Hey werners - thanks for your reply.

Sure, the data feed is ~1000 csv files from another party that get dumped out to a cloud storage bucket nightly. Their filenames remain the same and look something like acme_corp.csv. They contain unique ids along with a bunch of other product data as well as a last updated type column. We'll see new records, updated records, and old records may drop out of the feed (which we don't care about - our intent is to retain history).

The example I'm asking about is definitely COPY INTO. I have seen similar results with Autoloader although I've just started exploring that direction and Autoloader has tons of options that I haven't gotten around to yet.

The copy statement looks something like...

COPY INTO my_table

FROM (select

id::int,

name,

revised_date,

... ,

_metadata

FROM '/Volumes/test/sources/test/')

FILEFORMAT = CSV

FORMAT_OPTIONS (

'mergeSchema' = 'true',

'inferSchema' = 'true',

'header' = 'true',

'ignoreLeadingWhiteSpace' = 'true',

'ignoreTrailingWhiteSpace' = 'true',

'mode' = 'DROPMALFORMED'

-- PERMISSIVE', 'DROPMALFORMED', and 'FAILFAST'.'

)

COPY_OPTIONS ('mergeSchema' = 'true', 'force'='false');

-werners- · ‎04-17-2024

The COPY INTO SQL command lets you load data from a file location into a Delta table. This is a re-triable and idempotent operation; files in the source location that have already been loaded are skipped.
If you want to reread everything, I suggest another method than copy into, basically something that just reads a whole directory (as we did all the time in the past) and then merge or append.

data-grassroots · ‎04-17-2024

That's the question, short of treating the initial copy into as a temp table and executing a merge statement after it into another table where we can do the add, update type operations is there another option - with COPY INTO or AUTOLOADER or DLT - that would automatically handle changes within a source file where that file modified date changes?

We're open to modifying the process just seeking to understand options.

-werners- · ‎04-18-2024

If you do not have control over the content of the files I suggest the following:
Each day you get new files/data (I suppose these are not incremental). These files contain new, updated and deleted data, and are overwritten.
Because of this, autoloader will not work (and copy into).
So the easiest way is to read the files and merge them into a delta lake table. The merge statement will apply an insert/update or delete depending on your merge condition.
Because you do not use any fancy file detection mechanism, this will work.
For autoloader to work, you need unique filenames f.e. (aka no overwrites).
DLT might be able to do this too, but I don't use it so can't say a lot about that.

PS. we also never use SQL for engineering loads, only for analytical purposes. Dataframes are a bliss according to us 🙂

data-grassroots · ‎04-22-2024

Thanks for the validation, Werners! That's the path we've been heading down (copy + merge). I still have some DLT experiments planned but - at least for this situation - copy + merge works just fine.

Databricks Community

Ingesting Files - Same file name, modified content

Connect with Databricks Users in Your Area

Introducing an exclusively Databricks-hosted Assistant

How to present and share your Notebook insights in AI/BI Dashboards

Meet the Databricks MVPs

Now Hiring: Databricks Community Technical Moderator

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs