09-25-2024 05:48 AM
09-25-2024 06:58 AM
your patterns with the date folders is the classic way of processing data in data lakes.
But with autoloader you can get rid of them.
Create a folder per table, f.e. /bronze/files/table1 and read that folder using autoloader.
Your filenames can be anything as long as the can be lexicographically sorted. I use <epoch>_filename...
Autoloader keeps track of what has been processed and what not.
So like that you have one autoloader per table.
Does that make sense?
09-25-2024 07:36 AM
Hey,
Thanks for the swift reply, I don't have control over the way the source data is loaded into the blob store. So as I understand your answer it would be to go with the first option I listed:
have an intermediary service that is triggered by the kafka message and copies the files corresponding to the completed batch over to another blob storage account and put it in a different folder structure. probably /table_name/{timestamp}.parquet and then each autoloader only has to look at the directory that corresponds to it's table.
I have to copy the data across to a new storage container both for the efficiency gains of each autoloader and the control of when autoloader runs knowing that the container will only have full batches of data available.
You are right that autoloader will keep track of what data it has picked up but it is important for me to also know which batches it has picked up so that I know in my downstream tables that join and aggregate a lot of these source tables have all processed all of the incoming batches.
Currently I was planning on doing this by using that "BATCH_ID" column that is the timestamp that the batch was uploaded by the source system. Basically after running autoloader on the tables I would join all those dataframes together and getting a distinct list of the "BATCH_ID" column which I can then record in another table to essentially say all these source tables have been processed up to the highest "Batch_ID" number.
Is that what you would do? or is there a way of getting that information out of Autoloader?
09-26-2024 06:47 AM
there is an easier way to see what has been processed:
SELECT * FROM cloud_files_state('path/to/checkpoint'
https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/production.html
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group