Hi Community,
As a senior data engineer migrating ETL workloads to Databricks (with Unity Catalog and Delta Lake), I'm building a cost-effective pipeline to ingest data from a REST API. Goals: minimize DBU costs, handle incremental loads, ensure scalability, and follow medallion architecture (bronze/silver/gold).
Current thinking:
Use Python notebooks/workflows with requests/asyncio for parallel API calls, write raw JSON/Parquet to ADLS/ABFSS.
Auto Loader or Structured Streaming for incremental bronze ingestion.
DLT pipelines with serverless compute for transformations (leveraging incremental processing for 5x better price-performance).
Optimize with autoscaling, auto-termination, and Predictive Optimization.
Challenges:
Rate limiting/pagination on API.
Cost monitoring via system.billing tables.
Best cluster sizing for sporadic loads.
What's the most efficient approach? Direct API in Spark UDFs, external functions (e.g., Azure Functions to storage), or Lakeflow Declarative Pipelines? Any code samples or pitfalls from production pipelines?
Thanks! Sachi.