Thursday
Hello,
I'm looking for the best option to retrieve between 1-1.5TB of data per day from a REST API into Databricks.
Thank you,
Rodrigo Escamilla
Thursday
Hi @RodrigoE,
It would be helpful to have additional information to recommend the best options for your scenario.
If the source can push the data, consider Zerobus. This is the cleanest, most scalable Databricks-native pattern if the producer is under your control.
If you have no control over the source, you can build a custom Python data source wrapping their REST API and run it as a Databricks job/stream. While the pattern will work for your volumes, the bottleneck is usually the API’s own throughput/limits, not Databricks.
If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.
Thursday
Hi @RodrigoE,
It would be helpful to have additional information to recommend the best options for your scenario.
If the source can push the data, consider Zerobus. This is the cleanest, most scalable Databricks-native pattern if the producer is under your control.
If you have no control over the source, you can build a custom Python data source wrapping their REST API and run it as a Databricks job/stream. While the pattern will work for your volumes, the bottleneck is usually the API’s own throughput/limits, not Databricks.
If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.
Monday - last edited Monday
Hi Rodrigo,
One simple approach I’ve used is calling the REST API directly from a Databricks notebook using standard Python libraries—no extra setup or tools required.
The idea is to keep it minimal: generate the API signature, call the endpoint, and load the response. Here’s a very simplified example:
import time
import hashlib
import requests
# Generate API signature
def generate_signature(api_key, secret):
raw = api_key + secret + str(int(time.time()))
return hashlib.md5(raw.encode()).hexdigest()
# Call API
def fetch_data():
api_key = "<YOUR_API_KEY>"
secret = "<YOUR_SECRET>"
endpoint = "your-endpoint"
sig = generate_signature(api_key, secret)
url = f"https://api.example.com/v3/{endpoint}?apiKey={api_key}&sig={sig}"
response = requests.get(url)
return response.json()
# Run
data = fetch_data()
That’s really all you need to get started. From there, you can store the data in DBFS or a table.
If you need more throughput, you can later add parallel calls or pagination—but for smaller payloads, this works well and is very easy to maintain.
Best regards,
Rohan
Monday
Hi @rohan22sri,
This pattern is great for initial testing or low-volume pulls, but it won’t scale to the 1-1.5 TB/day @RodrigoE is targeting. A few reasons for this..
A single requests.get loop from one notebook driver will hit API and cluster limits long before you reach TB/day. You need partitioned/paginated reads and fan-out across workers (e.g., via mapInPandas, foreachBatch, or a Python Data Source), not a single-threaded client. At this volume, you must handle rate limits, exponential backoff, and idempotent retries systematically... baking that into a reusable ingestion component, not inline notebook code.
Also.. for the daily TB-scale, you can’t keep re-pulling everything. You need a robust cursor strategy (timestamps/IDs), checkpointing, and the ability to replay/backfill safely.
LAstly, you’ll need scheduled workflows, monitoring (lag, error rate, API quota usage), and alerting. A one-off notebook with requests is hard to industrialise and support.
That's why, I would recommend that data be written directly into Databricks via Zerobus Ingest, which is designed for high-throughput, push-based ingestion into Delta tables... especially if it is a pull. For a pull model, build a custom Python data source for this REST API and run it as a Databricks job / structured stream, so Spark handles parallelism and retries. We can still use your minimal requests example as a starting point to validate auth and payload shape...but should treat it as a spike, not the production architecture.
Another thing that I wanted to call out is the use of DBFS. For production ingestion at this scale we wouldn’t land the data in DBFS. DBFS is really a legacy workspace file system and best for scratch / notebooks, not for 1-1.5 TB/day of source data. For long-term pipelines you should consider landing into Unity Catalog volumes and Delta tables, so you get proper governance (row/column ACLs), lineage, discovery, and all the newer features (Lakeflow, Zerobus, Auto Loader, etc.) that don’t integrate with DBFS.
If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.
Monday - last edited Monday
@Ashwin_DSA I agree with your thought. I’ve been using a similar Python-based solution in Databricks to download a few GBs of data, and it has worked reliably so far.