07-11-2025 12:47 AM
How AI-powered development accelerated my data engineering workflow
🎬 Watch the Complete Development Process
📺 YouTube Video: See the entire 30-minute development session
This is a screen recording without voice narration showing the complete development process, including real debugging, problem-solving, and deployment with Amazon Q Developer and Databricks Free Edition.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🚀 The Challenge
As a Senior Data Engineer, I needed to quickly build a streaming Delta Live Tables (DLT) pipeline that could read JSON data and write it to a bronze layer table. Instead of spending hours researching documentation and writing boilerplate code, I decided to leverage Amazon Q Developer to accelerate the development process.
The Goal: Create a production-ready streaming DLT pipeline with comprehensive monitoring and data quality validation.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
💬 The Conversation That Started It All
Here's the exact prompt I used to kick off the project:
My Prompt: "You are senior data engineer, your job is to write streaming DLT pipeline that will read data from a json file and write it to the bronze layer table. use DLT SQL, put the notebook into the src folder, add pipeline into resources folder."
What happened next was remarkable. Amazon Q Developer didn't just provide a basic solution—it delivered a comprehensive, production-ready streaming pipeline architecture.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🛠️ What Amazon Q Developer Built
Within minutes, Q Developer had created:
📁 File Structure
src/
├── bronze_streaming_pipeline.sql # Basic streaming pipeline
├── custom_bronze_streaming.sql # Production-ready pipeline with advanced features
└── dlt_pipeline.ipynb # Original pipeline
resources/
├── bronze_streaming.pipeline.yml # Basic pipeline config
├── custom_bronze_streaming.pipeline.yml # Advanced pipeline config
├── dlt_101.pipeline.yml # Original config
└── dlt_101.job.yml # Job configuration
🎯 Key Features Delivered
• Streaming ingestion with continuous processing
• Data quality validation with built-in constraints
• Comprehensive monitoring with quality metrics
• Unity Catalog compatibility
• Production-ready configurations
• Alerting system
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🔥 The Code That Emerged
Here's a snippet of the streaming DLT SQL that Q Developer generated:
CREATE OR REFRESH STREAMING LIVE TABLE bronze_json_raw (
-- Data quality constraints
CONSTRAINT valid_data EXPECT (_rescued_data IS NULL OR _rescued_data = '') ON VIOLATION DROP ROW,
CONSTRAINT non_empty_record EXPECT (
DOLocationID IS NOT NULL OR
PULocationID IS NOT NULL OR
fare_amount IS NOT NULL OR
total_amount IS NOT NULL
) ON VIOLATION DROP ROW
)
COMMENT "Bronze layer: Raw JSON data with streaming ingestion and data quality validation"
TBLPROPERTIES (
"quality" = "bronze",
"layer" = "bronze",
"delta.autoOptimize.optimizeWrite" = "true",
"delta.autoOptimize.autoCompact" = "true"
)
AS SELECT
*,
current_timestamp() as ingestion_timestamp,
_metadata.file_path as source_file_name,
regexp_extract(_metadata.file_path, '([^/]+)$', 1) as file_name,
'streaming_dlt_pipeline' as ingestion_method,
date(current_timestamp()) as ingestion_date
FROM STREAM(
read_files(
"/databricks-datasets/nyctaxi/sample/json/",
format => "json",
header => "false",
multiLine => "true",
rescuedDataColumn => "_rescued_data"
)
);
What impressed me most: Q Developer included advanced features like data quality constraints, metadata enrichment, and auto-optimization—things that would typically require extensive experience to implement correctly.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🚧 Challenges Encountered
Challenge 1: Deployment Configuration Issues
The Error:
Warning: unknown field: expectations
Error: expected a file for "resources.pipelines.bronze_streaming_pipeline.libraries[0].file.path" but got a notebook
My Prompt: "databricks bundle deploy" (I simply ran the command and shared the error)
Q Developer's Solution:
Q Developer immediately identified two issues:
1. The expectations field wasn't valid in pipeline configuration
2. SQL files needed to be referenced as notebook: instead of file:
It automatically fixed both configuration files:
ibraries:
- notebook: # Changed from 'file:'
path: ../src/bronze_streaming_pipeline.sql
Challenge 2: Unity Catalog Compatibility
The Error:
The command(s): input_file_name are not supported in Unity Catalog.
Please use _metadata.file_path instead.
My Prompt: I simply pasted the error message.
Q Developer's Response:
Without any additional context, Q Developer:
1. Identified the root cause: Unity Catalog doesn't support input_file_name()
2. Provided the fix: Replace with _metadata.file_path
3. Updated both pipeline files automatically
4. Redeployed the solution
-- Before (causing error)
input_file_name() as source_file_name
-- After (Unity Catalog compatible)
_metadata.file_path as source_file_name
Challenge 3: Critical SQL Syntax Error - The struct(*) Wildcard Issue
The Error:
"class_name": "org.apache.spark.sql.catalyst.ExtendedAnalysisException",
"message": "[INVALID_USAGE_OF_STAR_OR_REGEX] Invalid usage of '*' in CollectMetrics. SQLSTATE: 42000"
My Prompt: "i have an error in my custom pipeline: [error message]"
The Problem:
The original data quality constraint used a complex struct(*) pattern that Spark SQL's catalyst optimizer couldn't resolve:
CONSTRAINT non_empty_record EXPECT (size(map_keys(from_json(to_json(struct(*)), 'map<string,string>'))) > 0) ON VIOLATION DROP ROW
Q Developer's Analysis & Solution:
Q Developer immediately recognized this as a Spark SQL limitation and provided multiple solution approaches:
1. Root Cause Identification: The * wildcard cannot be used directly within the struct() function in this context
2. Recommended Solution: Simplify the approach with explicit field validation
3. Implementation: Replace complex struct validation with straightforward null checks
The Fix:
CONSTRAINT non_empty_record EXPECT (
DOLocationID IS NOT NULL OR
PULocationID IS NOT NULL OR
fare_amount IS NOT NULL OR
total_amount IS NOT NULL
) ON VIOLATION DROP ROW
Why This Solution is Better:
• Cleaner and more readable - easier to understand and maintain
• More reliable - uses standard DLT expectation patterns
• Easier to debug - clear, straightforward conditions
Challenge 4: Implementing Critical Quality Enforcement
The Requirement: "lets' dot his: Option C: Use Delta Live Tables Expectations... and update documentation and release notes"
Q Developer's Implementation:
Building on the existing monitoring system, Q Developer added a critical quality enforcement mechanism:
CREATE OR REFRESH LIVE TABLE bronze_quality_enforcement (
CONSTRAINT critical_quality EXPECT (
current_quality_percentage >= 95
) ON VIOLATION FAIL UPDATE
)
AS SELECT
current_date() as enforcement_date,
current_timestamp() as enforcement_timestamp,
count(*) as total_records,
count_if(_rescued_data IS NOT NULL AND _rescued_data != '') as rescued_records,
round((count(*) - count_if(_rescued_data IS NOT NULL AND _rescued_data != '')) * 100.0 / count(*), 2) as current_quality_percentage,
'QUALITY_CHECK_PASSED' as status
FROM LIVE.bronze_json_raw;
What This Achieves:
• Automatic Pipeline Failure: When calculated quality percentage drops below 95%
• Built-in Notifications: DLT automatically sends email alerts on pipeline failures
• Quality Gate: Prevents bad data from progressing through the pipeline
• Real-time Enforcement: Quality checks happen during data ingestion
• Proper Column Reference: Constraint references calculated columns, not source table columns
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🎯 The Results
⚡ Development Speed
• Traditional approach: 4-6 hours of research, coding, and testing
• With Amazon Q Developer: 30 minutes from prompt to production-ready pipeline
🏗️ Architecture Quality
Q Developer delivered enterprise-grade features:
• Comprehensive error handling
• Data quality monitoring
• Production-ready configurations
• Complete documentation
• Unity Catalog compatibility
📚 Documentation Excellence
Beyond just code, Q Developer created:
• Detailed usage instructions
• Configuration guides
• Troubleshooting documentation
• Release notes
• Best practices recommendations
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
💡 Key Insights from This Experience
1. Prompt Engineering Matters
Starting with a clear, role-based prompt ("You are a senior data engineer...") set the context for production-quality output.
2. Error-Driven Development
When errors occurred, simply sharing the error message was enough for Q Developer to provide targeted fixes. This was especially evident with the struct(*) syntax error—Q Developer immediately identified the Spark SQL limitation and provided multiple solution approaches.
3. Iterative Improvement
Each interaction built upon the previous work, creating increasingly sophisticated solutions. The progression from basic pipeline → Unity Catalog compatibility → SQL syntax fixes shows how AI-assisted development can evolve rapidly.
4. Context Awareness
Q Developer understood the project structure and maintained consistency across multiple files and configurations, even when fixing complex SQL syntax issues.
5. Solution Quality Over Quick Fixes
Rather than providing a minimal fix for the struct(*) error, Q Developer recommended the most maintainable and performant solution, explaining why simpler approaches are often better than complex ones.
6. Complete Problem Resolution
Q Developer didn't just fix the code—it guided through the entire deployment process, from code changes to successful pipeline execution, ensuring the solution actually worked in production.
7. Iterative Feature Enhancement
When asked to implement critical quality enforcement, Q Developer seamlessly integrated the new feature with existing monitoring infrastructure, updated all documentation, and provided comprehensive explanations of the benefits.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🔧 Technical Deep Dive
What Made This Special
Streaming Architecture:
JSON Files → Auto-Ingestion → Bronze Tables → Quality Monitoring → Alerts
Data Quality Pipeline:
• Constraint-based validation for malformed records
• Quality percentage tracking with automated alerts
• Metadata enrichment for complete data lineage
• Monitoring dashboards for operational visibility
Production Features:
• Serverless compute for cost efficiency
• Auto-optimization for performance
• Unity Catalog integration for governance
• Environment-specific configurations for dev/prod
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🚀 Try It Yourself
Repository
Check out the complete implementation: dlt_101 Streaming Pipeline Project
Quick Start
# Clone the repository
git clone https://github.com/cloud-data-engineer/data.git
cd dlt_101
# Deploy the streaming pipeline
databricks bundle deploy --target dev
# Run the pipeline
databricks bundle run custom_bronze_streaming_pipeline
Key Files to Explore
• src/custom_bronze_streaming.sql - Production-ready streaming pipeline with advanced features
• resources/custom_bronze_streaming.pipeline.yml - Pipeline configuration
• STREAMING_PIPELINES.md - Comprehensive usage guide
• RELEASE_NOTES.md - Complete feature documentation
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🎯 Lessons Learned
1. AI-powered development can dramatically accelerate development
2. Clear prompts lead to better, more targeted solutions
3. Error messages become learning opportunities with AI assistance
4. Production quality is achievable from the first iteration
5. Complex doesn't mean better - Q Developer consistently recommended simpler, more maintainable solutions over complex ones
6. End-to-end problem solving - AI assistance extends beyond code generation to include deployment and troubleshooting
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
💬 Final Thoughts
Amazon Q Developer didn't just help me write code—it became my pair programming partner throughout the entire development lifecycle. From initial pipeline creation to complex SQL syntax debugging, Q Developer consistently provided:
• Rapid problem diagnosis - Immediately identifying root causes of complex errors
• Multiple solution approaches - Offering various options with clear trade-offs
• Best practice guidance - Recommending maintainable solutions over quick fixes
• Complete problem resolution - Following through from code fix to successful deployment
Key Takeaway: AI-powered development isn't just about speed—it's about elevating the quality and maintainability of solutions while reducing the cognitive load of debugging complex systems.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
About This Project
• Technology Stack: Databricks, Delta Live Tables, Unity Catalog
• Development Time: 30 minutes with Amazon Q Developer
• Lines of Code: 200+ lines of production-ready SQL and YAML
• Documentation: 5 comprehensive markdown files
• Features: Streaming ingestion, data quality monitoring, automated alerting
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Built with ❤️ and Databricks Free Edition
07-21-2025 02:09 AM
This is super insightful @Pat, thanks for sharing this with the Community!
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now