Hi @chrisf_sts, Handling call log files in JSON format and creating a Delta Live data processing pipeline involves several steps.
Let’s break it down:
Ingestion:
- You can ingest the raw JSON files directly into a bronze table without separating them. This approach allows you to keep the raw data intact and perform further processing downstream.
- Alternatively, you can separate the individual JSON objects before adding them to the bronze level. This would involve splitting the large JSON file into smaller files, each containing a single call log object.
Bronze Level:
- If your S3 bucket is mounted to your Databricks workspace, the files are accessible from within the workspace. However, this doesn’t automatically place them in the “bronze” level.
- To create a Delta Live Tables (DLT) pipeline, you’ll need to define the pipeline settings, including the source (your S3 bucket), transformations, and destination (tables).
- Consider creating a DLT pipeline that reads the raw JSON files from the mounted directory and processes them according to your requirements.
Data Transformation:
- Once ingested, you can create a table with the JSON objects. Define the schema for the table based on the call log structure.
- Use DLT expectations to validate and clean the data. For example, filter out incomplete records or handle missing values.
- Add additional data columns, such as identifying callers who hung up and immediately called back.
Aggregated Statistics:
- To produce aggregated statistics about the calls, create another table (e.g., a “summary” table) where you aggregate relevant metrics.
- You can use SQL queries or DLT expectations to calculate statistics like call duration, call frequency, etc.
Remember that DLT provides a user-friendly interface for configuring pipelines, but you can also wor.... Familiarize yourself with the UI and explore the available features to tailor your pipeline to your specific use case.
Feel free to ask if you need further assistance or have additional questions!