Hi I am looking for some support on how to handle the following situation. I have a call center that generates call log files in json format that are sent to an s3 bucket. Some of the raw json files contain more than one call log object and they are not in correct json format. I.e. they are one large json file with objects right after each other and may contain up to 8 of these jsons.
I want to create a delta live data processing pipeline that will
1. ingest new files twice daily
2. separate the individual json objects into one file for each object
3. create a table with the json objects
4. clean the data to get rid of incomplete data
5. add some data columns such as callers who hung up and immediately called back
6. produce aggregated statistics about the calls
some of the questions I have are:
1. should I ingest the raw json files into a bronze table without separating them out, or separate them before adding them to the bronze level
2. My file live an an s3 bucket that is mounted to my db workspace, does this mean they are already in the "bronze" level, or do i need to copy them from the mounted directory into the DLT pipeline?