cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
cancel
Showing results for 
Search instead for 
Did you mean: 

Guidance for creating DLT pipeline with raw json call logs

chrisf_sts
New Contributor II

Hi I am looking for some support on how to handle the following situation.  I have a call center that generates call log files in json format that are sent to an s3 bucket.  Some of the raw json files contain more than one call log object and they are not in correct json format.  I.e. they are one large json file with objects right after each other and may contain up to 8 of these jsons.  

I want to create a delta live data processing pipeline that will
1. ingest new files twice daily
2. separate the individual json objects into one file for each object
3. create a table with the json objects
4. clean the data to get rid of incomplete data
5. add some data columns such as callers who hung up and immediately called back
6. produce aggregated statistics about the calls 

some of the questions I have are:
1. should I ingest the raw json files into a bronze table without separating them out, or separate them before adding them to the bronze level
2.  My file live an an s3 bucket that is mounted to my db workspace, does this mean they are already in the "bronze" level, or do i need to copy them from the mounted directory into the DLT pipeline?

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @chrisf_sts, Handling call log files in JSON format and creating a Delta Live data processing pipeline involves several steps. 

 

Let’s break it down:

 

Ingestion:

  • You can ingest the raw JSON files directly into a bronze table without separating them. This approach allows you to keep the raw data intact and perform further processing downstream.
  • Alternatively, you can separate the individual JSON objects before adding them to the bronze level. This would involve splitting the large JSON file into smaller files, each containing a single call log object.

Bronze Level:

  • If your S3 bucket is mounted to your Databricks workspace, the files are accessible from within the workspace. However, this doesn’t automatically place them in the “bronze” level.
  • To create a Delta Live Tables (DLT) pipeline, you’ll need to define the pipeline settings, including the source (your S3 bucket), transformations, and destination (tables).
  • Consider creating a DLT pipeline that reads the raw JSON files from the mounted directory and processes them according to your requirements.

Data Transformation:

  • Once ingested, you can create a table with the JSON objects. Define the schema for the table based on the call log structure.
  • Use DLT expectations to validate and clean the data. For example, filter out incomplete records or handle missing values.
  • Add additional data columns, such as identifying callers who hung up and immediately called back.

Aggregated Statistics:

  • To produce aggregated statistics about the calls, create another table (e.g., a “summary” table) where you aggregate relevant metrics.
  • You can use SQL queries or DLT expectations to calculate statistics like call duration, call frequency, etc.

Remember that DLT provides a user-friendly interface for configuring pipelines, but you can also wor.... Familiarize yourself with the UI and explore the available features to tailor your pipeline to your specific use case.

 

Feel free to ask if you need further assistance or have additional questions!

View solution in original post

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @chrisf_sts, Handling call log files in JSON format and creating a Delta Live data processing pipeline involves several steps. 

 

Let’s break it down:

 

Ingestion:

  • You can ingest the raw JSON files directly into a bronze table without separating them. This approach allows you to keep the raw data intact and perform further processing downstream.
  • Alternatively, you can separate the individual JSON objects before adding them to the bronze level. This would involve splitting the large JSON file into smaller files, each containing a single call log object.

Bronze Level:

  • If your S3 bucket is mounted to your Databricks workspace, the files are accessible from within the workspace. However, this doesn’t automatically place them in the “bronze” level.
  • To create a Delta Live Tables (DLT) pipeline, you’ll need to define the pipeline settings, including the source (your S3 bucket), transformations, and destination (tables).
  • Consider creating a DLT pipeline that reads the raw JSON files from the mounted directory and processes them according to your requirements.

Data Transformation:

  • Once ingested, you can create a table with the JSON objects. Define the schema for the table based on the call log structure.
  • Use DLT expectations to validate and clean the data. For example, filter out incomplete records or handle missing values.
  • Add additional data columns, such as identifying callers who hung up and immediately called back.

Aggregated Statistics:

  • To produce aggregated statistics about the calls, create another table (e.g., a “summary” table) where you aggregate relevant metrics.
  • You can use SQL queries or DLT expectations to calculate statistics like call duration, call frequency, etc.

Remember that DLT provides a user-friendly interface for configuring pipelines, but you can also wor.... Familiarize yourself with the UI and explore the available features to tailor your pipeline to your specific use case.

 

Feel free to ask if you need further assistance or have additional questions!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.