cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Cost-Effective Databricks Pipeline for API Ingestion - Best Practices?

TheDataMaverick
New Contributor

Hi Community,

As a senior data engineer migrating ETL workloads to Databricks (with Unity Catalog and Delta Lake), I'm building a cost-effective pipeline to ingest data from a REST API. Goals: minimize DBU costs, handle incremental loads, ensure scalability, and follow medallion architecture (bronze/silver/gold).

Current thinking:

  • Use Python notebooks/workflows with requests/asyncio for parallel API calls, write raw JSON/Parquet to ADLS/ABFSS.

  • Auto Loader or Structured Streaming for incremental bronze ingestion.

  • DLT pipelines with serverless compute for transformations (leveraging incremental processing for 5x better price-performance).

  • Optimize with autoscaling, auto-termination, and Predictive Optimization.

Challenges:

  • Rate limiting/pagination on API.

  • Cost monitoring via system.billing tables.

  • Best cluster sizing for sporadic loads.

What's the most efficient approach? Direct API in Spark UDFs, external functions (e.g., Azure Functions to storage), or Lakeflow Declarative Pipelines? Any code samples or pitfalls from production pipelines?

Thanks! Sachi.

1 ACCEPTED SOLUTION

Accepted Solutions

Pat
Esteemed Contributor

HI @TheDataMaverick ,
The most efficient approach for your REST API ingestion pipeline on Databricks is to use an external service like Azure Functions (or AWS Lambda) to handle API calls, then land raw JSON/Parquet in ADLS/S3 for Auto Loader ingestion into bronze.
After external API landing and bronze ingestion via Auto Loader, the next step for silver/gold can be either SDP (Spark Declarative Pipelines) or regular Spark jobs/workflows, depending on requirements. SDP suits declarative, incremental medallion transforms with auto-orchestration and serverless efficiency, while jobs fit custom/complex logic.

View solution in original post

1 REPLY 1

Pat
Esteemed Contributor

HI @TheDataMaverick ,
The most efficient approach for your REST API ingestion pipeline on Databricks is to use an external service like Azure Functions (or AWS Lambda) to handle API calls, then land raw JSON/Parquet in ADLS/S3 for Auto Loader ingestion into bronze.
After external API landing and bronze ingestion via Auto Loader, the next step for silver/gold can be either SDP (Spark Declarative Pipelines) or regular Spark jobs/workflows, depending on requirements. SDP suits declarative, incremental medallion transforms with auto-orchestration and serverless efficiency, while jobs fit custom/complex logic.