cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Ingesting Files - Same file name, modified content

data-grassroots
New Contributor II

We have a data feed with files whose filenames stays the same but the contents change over time (brand_a.csv, brand_b.csv, brand_c.csv ....).

Copy Into seems to ignore the files when they change.

If we set the Force flag to true and run it, we end up with duplicates.

If you Truncate the table and attempt to re-load (without setting force to true), databricks doesn't re-copy the records.

So, behind the scenes, Databricks is holding some kind of load history info and is smart enough not to re-load files it's already seen but not quite smart enough to know that a file has been updated and potential updates are available.

We're pulling _metadata column in which lists filename, size, mod date, etc. Not that Databricks knows that but it does have enough information at some point to know when an input file has changed.

This particular dataset is small enough where dropping the table and recreating isn't that big a deal but I'm interested in knowing for the general case. 

Do either Autoloader or Delta Live handle that sort of situation differently or do I need to load into a fresh staging table and run MERGE INTO to handle it?

FWIW, we're using UNITY and DELTA.

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

If you do not have control over the content of the files I suggest the following:
Each day you get new files/data (I suppose these are not incremental).  These files contain new, updated and deleted data, and are overwritten.
Because of this, autoloader will not work (and copy into).
So the easiest way is to read the files and merge them into a delta lake table.  The merge statement will apply an insert/update or delete depending on your merge condition.
Because you do not use any fancy file detection mechanism, this will work.
For autoloader to work, you need unique filenames f.e. (aka no overwrites).
DLT might be able to do this too, but I don't use it so can't say a lot about that.

PS. we also never use SQL for engineering loads, only for analytical purposes.  Dataframes are a bliss according to us 🙂

View solution in original post

6 REPLIES 6

-werners-
Esteemed Contributor III

can you elaborate more on this 'data feed'?
Because the behavior you mention seems like you are using autoloader.
Overwriting files in general does not really work well in incremental setups as immutability is one of the key concepts.

Hey werners - thanks for your reply.

Sure, the data feed is ~1000 csv files from another party that get dumped out to a cloud storage bucket nightly. Their filenames remain the same and look something like acme_corp.csv. They contain unique ids along with a bunch of other product data as well as a last updated type column. We'll see new records, updated records, and old records may drop out of the feed (which we don't care about - our intent is to retain history).

The example I'm asking about is definitely COPY INTO. I have seen similar results with Autoloader although I've just started exploring that direction and Autoloader has tons of options that I haven't gotten around to yet.

The copy statement looks something like...

 

COPY INTO my_table
FROM (select
id::int,
name,
revised_date,
... ,
_metadata
FROM '/Volumes/test/sources/test/')
FILEFORMAT = CSV
FORMAT_OPTIONS (
'mergeSchema' = 'true',
'inferSchema' = 'true',
'header' = 'true',
'ignoreLeadingWhiteSpace' = 'true',
'ignoreTrailingWhiteSpace' = 'true',
'mode' = 'DROPMALFORMED'
-- PERMISSIVE', 'DROPMALFORMED', and 'FAILFAST'.'
)
COPY_OPTIONS ('mergeSchema' = 'true', 'force'='false');

-werners-
Esteemed Contributor III

The COPY INTO SQL command lets you load data from a file location into a Delta table. This is a re-triable and idempotent operation; files in the source location that have already been loaded are skipped.
If you want to reread everything, I suggest another method than copy into, basically something that just reads a whole directory (as we did all the time in the past) and then merge or append.

data-grassroots
New Contributor II

That's the question, short of treating the initial copy into as a temp table and executing a merge statement after it into another table where we can do the add, update type operations is there another option - with COPY INTO or AUTOLOADER or DLT - that would automatically handle changes within a source file where that file modified date changes?

We're open to modifying the process just seeking to understand options.

-werners-
Esteemed Contributor III

If you do not have control over the content of the files I suggest the following:
Each day you get new files/data (I suppose these are not incremental).  These files contain new, updated and deleted data, and are overwritten.
Because of this, autoloader will not work (and copy into).
So the easiest way is to read the files and merge them into a delta lake table.  The merge statement will apply an insert/update or delete depending on your merge condition.
Because you do not use any fancy file detection mechanism, this will work.
For autoloader to work, you need unique filenames f.e. (aka no overwrites).
DLT might be able to do this too, but I don't use it so can't say a lot about that.

PS. we also never use SQL for engineering loads, only for analytical purposes.  Dataframes are a bliss according to us 🙂

data-grassroots
New Contributor II

Thanks for the validation, Werners! That's the path we've been heading down (copy + merge). I still have some DLT experiments planned but - at least for this situation - copy + merge works just fine.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.