cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Guidance for creating DLT pipeline with raw json call logs

chrisf_sts
New Contributor II

Hi I am looking for some support on how to handle the following situation.  I have a call center that generates call log files in json format that are sent to an s3 bucket.  Some of the raw json files contain more than one call log object and they are not in correct json format.  I.e. they are one large json file with objects right after each other and may contain up to 8 of these jsons.  

I want to create a delta live data processing pipeline that will
1. ingest new files twice daily
2. separate the individual json objects into one file for each object
3. create a table with the json objects
4. clean the data to get rid of incomplete data
5. add some data columns such as callers who hung up and immediately called back
6. produce aggregated statistics about the calls 

some of the questions I have are:
1. should I ingest the raw json files into a bronze table without separating them out, or separate them before adding them to the bronze level
2.  My file live an an s3 bucket that is mounted to my db workspace, does this mean they are already in the "bronze" level, or do i need to copy them from the mounted directory into the DLT pipeline?

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group