cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Ingest data from REST endpoint into Databricks

RodrigoE
New Contributor III

Hello,

I'm looking for the best option to retrieve between 1-1.5TB of data per day from a REST API into Databricks.

Thank you,

Rodrigo Escamilla

1 ACCEPTED SOLUTION

Accepted Solutions

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @RodrigoE,

It would be helpful to have additional information to recommend the best options for your scenario. 

  • Who owns the REST API?
  • Is that in your control? 
  • Can the source push data to Databricks, or should you pull on a schedule?

If the source can push the data, consider Zerobus. This is the cleanest, most scalable Databricks-native pattern if the producer is under your control.

If you have no control over the source, you can build a custom Python data source wrapping their REST API and run it as a Databricks job/stream. While the pattern will work for your volumes, the bottleneck is usually the API’s own throughput/limits, not Databricks.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

 

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post

4 REPLIES 4

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @RodrigoE,

It would be helpful to have additional information to recommend the best options for your scenario. 

  • Who owns the REST API?
  • Is that in your control? 
  • Can the source push data to Databricks, or should you pull on a schedule?

If the source can push the data, consider Zerobus. This is the cleanest, most scalable Databricks-native pattern if the producer is under your control.

If you have no control over the source, you can build a custom Python data source wrapping their REST API and run it as a Databricks job/stream. While the pattern will work for your volumes, the bottleneck is usually the API’s own throughput/limits, not Databricks.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

 

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

rohan22sri
New Contributor III

Hi Rodrigo,

One simple approach I’ve used is calling the REST API directly from a Databricks notebook using standard Python libraries—no extra setup or tools required.

The idea is to keep it minimal: generate the API signature, call the endpoint, and load the response. Here’s a very simplified example:


import time
import hashlib
import requests

# Generate API signature
def generate_signature(api_key, secret):
raw = api_key + secret + str(int(time.time()))
return hashlib.md5(raw.encode()).hexdigest()

# Call API
def fetch_data():
api_key = "<YOUR_API_KEY>"
secret = "<YOUR_SECRET>"
endpoint = "your-endpoint"

sig = generate_signature(api_key, secret)
url = f"https://api.example.com/v3/{endpoint}?apiKey={api_key}&sig={sig}"

response = requests.get(url)
return response.json()

# Run
data = fetch_data()

That’s really all you need to get started. From there, you can store the data in DBFS or a table.

If you need more throughput, you can later add parallel calls or pagination—but for smaller payloads, this works well and is very easy to maintain.

Best regards,
Rohan

Rohan

Hi @rohan22sri,

This pattern is great for initial testing or low-volume pulls, but it won’t scale to the 1-1.5 TB/day @RodrigoE is targeting. A few reasons for this.. 

A single requests.get loop from one notebook driver will hit API and cluster limits long before you reach TB/day. You need partitioned/paginated reads and fan-out across workers (e.g., via mapInPandas, foreachBatch, or a Python Data Source), not a single-threaded client. At this volume, you must handle rate limits, exponential backoff, and idempotent retries systematically... baking that into a reusable ingestion component, not inline notebook code. 

Also.. for the daily TB-scale, you can’t keep re-pulling everything. You need a robust cursor strategy (timestamps/IDs), checkpointing, and the ability to replay/backfill safely.

LAstly, you’ll need scheduled workflows, monitoring (lag, error rate, API quota usage), and alerting. A one-off notebook with requests is hard to industrialise and support.

That's why, I would recommend that data be written directly into Databricks via Zerobus Ingest, which is designed for high-throughput, push-based ingestion into Delta tables... especially if it is a pull. For a pull model, build a custom Python data source for this REST API and run it as a Databricks job / structured stream, so Spark handles parallelism and retries. We can still use your minimal requests example as a starting point to validate auth and payload shape...but should treat it as a spike, not the production architecture.

Another thing that I wanted to call out is the use of DBFS. For production ingestion at this scale we wouldn’t land the data in DBFS. DBFS is really a legacy workspace file system and best for scratch / notebooks, not for 1-1.5 TB/day of source data. For long-term pipelines you should consider landing into Unity Catalog volumes and Delta tables, so you get proper governance (row/column ACLs), lineage, discovery, and all the newer features (Lakeflow, Zerobus, Auto Loader, etc.) that don’t integrate with DBFS.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

@Ashwin_DSA  I agree with your thought. I’ve been using a similar Python-based solution in Databricks to download a few GBs of data, and it has worked reliably so far.

Rohan