Tuesday
Hi all,
I’ve been looking into the Python Data Source API and wanted to get some feedback from others who may be experimenting with it.
One of the more common challenges I run into is working with applications that expose APIs but don’t have out-of-the-box connectors.
In most cases, the pattern ends up being:
Write Python code to call the API
Handle authentication, pagination, and rate limits
Transform the response
Land the data
Schedule and maintain the pipeline
It works, but over time it turns into a collection of custom ingestion pipelines that are difficult to standardize.
The Python Data Source API seems like an attempt to address this by allowing you to define a data source directly in Python and integrate it more natively into your workflows.
Conceptually, it feels like a step toward treating API-based data as a first-class source rather than something that always requires custom ingestion logic.
That said, I’m trying to better understand where this fits in practice.
For example, in a typical ingestion pattern, I might:
Land raw JSON from an API into storage
Use Auto Loader / streaming to process it into bronze
Continue through the medallion layers from there
With the Python Data Source API, it seems like you could instead define the source and write more directly into your pipeline.
So I’m curious how others are thinking about this:
In what scenarios would you use the Python Data Source API?
Are you using it in place of landing raw data first, or alongside existing ingestion patterns?
Does it simplify your pipelines, or just shift complexity elsewhere?
Appreciate any thoughts or examples others are willing to share.
Thanks!
Tuesday - last edited Tuesday
Hi @beaglerot! Good question... I’ve been thinking about this as well.
In my view, the python data source API really shines when you're building reusable ingestion patterns. If multiple teams are hitting the same API with slight variations, wrapping that logic into a single source definition centralizes concerns like authentication, pagination, retries, and schema handling.
For one-off pipelines, though, the overhead may not justify it. A well-structured notebook or ingestion script is often simpler and easier to debug.
Regarding medallion architecture, I’d still land raw first in most cases. The Python Data Source API doesn’t remove the need for a raw layer. Keeping the original payload is critical for reprocessing, auditing, and handling upstream changes. Where the API helps is in making the transition from raw to bronze more structured, especially when used within DLT pipelines.
I think one of the biggest practical benefits is standardizing error handling and retry logic. Instead of duplicating that across pipelines, you define it once in the source abstraction. That said, it doesn’t eliminate complexity, it just shifts it. This shifts complexity from distributed pipelines into a centralized abstraction layer, which makes sense at scale, but can be overkill for smaller use cases.
Out of curiosity, are you evaluating this specifically for DLT, or more general batch ingestion? That context can shift the trade-offs quite a bit.
Tuesday
My use case is for a personal project. I'm pulling all my contacts into Databricks from the Google People API. I don't have a huge list of contacts and they don't change very often, so using the python data source API and landing the data directly into a bronze table is fine for me.
This is how I've implemented it for my purposes. *Note - the googlepeopledatasource is not shown below.
However, I want to know where and when it can be used against an API with frequently changing data with much more volume.
Would I use the API data source to write to a JSON file in a volume?
If so, how would I structure the data? One timestamped file per pull?
Would I then use autoloader and readStream to load that into Bronze?
Wednesday
Adding on to @edonaire, which are accurate.
@beaglerot , your contacts project is the right use case for the pattern you have. Small data, infrequent changes, direct read into bronze. That works. The real question you're asking is what happens when the data gets bigger and changes faster. Here's how I'd think about it.
There are two viable patterns, and the right one depends on what you need from your raw layer.
Option A: Direct to bronze via the Data Source API (no JSON landing zone)
If you're comfortable treating your bronze Delta table as the system of record, skip the intermediate JSON files entirely. Include the full API payload (or a raw_payload column) in your bronze table so you still have an immutable representation of what the API returned. Recovery and reprocessing come from Delta time travel and cloning rather than re-reading raw files.
@dp.table(
name="contacts_bronze",
comment="Raw contacts from Google People API"
)
def contacts_bronze():
return (
spark.readStream
.format("google_people")
.option("scope", "google-api")
.load()
.withColumn("ingested_at", current_timestamp())
)
This is the simplest mental model and fewest moving parts. It works well when your bottleneck is API throughput rather than storage, and when you're wiring into DLT or Workflows.
One note: this requires your custom data source to implement the SimpleDataSourceStreamReader class. If your source only supports batch reads, you'd use spark.read with a scheduled job instead of readStream.
Option B: Land raw JSON to a volume, then Auto Loader into bronze
If you need a file-level audit trail, expect upstream schema changes, or have multiple downstream teams consuming the same raw feed in different ways, land the data as files first.
df = (
spark.read.format("google_people")
.option("scope", "google-api")
.load()
.withColumn("ingested_at", current_timestamp())
.withColumn("ingest_date", to_date("ingested_at"))
)
(df.write
.mode("append")
.partitionBy("ingest_date")
.format("json")
.save("/Volumes/raw/google_people"))
Then point Auto Loader at that path:
bronze = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("/Volumes/raw/google_people")
)
A few things on file structure if you go this route:
Avoid one tiny file per pull. Aim for fewer, larger files (tens to hundreds of MB per batch) to keep Auto Loader and downstream queries efficient. Partition by date using a path like /Volumes/raw/google_people/ingest_date=YYYY-MM-DD/ rather than encoding timestamps only in filenames. Add ingestion metadata as columns: ingested_at, source_system, api_version, and optionally a batch_id for traceability.
How to decide between them?
Use Option A when you want the simplest architecture and you're comfortable with Delta as your recovery mechanism. The question to ask yourself: "If something breaks downstream, can I reprocess from Delta time travel, or do I need the original API payloads sitting in storage?"
Use Option B when the answer to that question is "I need the raw payloads," or when you have strict audit requirements, or when you don't want to hit the API again to reprocess.
In both cases, the main win of the Python Data Source API is the same: standardizing the edge. One well-tested connector that handles auth, pagination, retries, and schema, instead of N slightly different notebooks all doing their own version of that logic. It doesn't replace your ingestion architecture. It standardizes the extraction layer that feeds into it.
Cheers, Lou
Wednesday
Thanks, @Louis_Frolio! this is very helpful...
I especially liked the framing around the role of the raw layer. That makes the decision much clearer:
Option A: if Delta is enough as the recovery mechanism
Option B: if file-level auditability and reprocessing from original payloads are required
To me, that reinforces the idea that the Python Data Source API is mostly about standardizing extraction, auth, pagination, retries, and schema handling at the edge, not replacing ingestion architecture decisions.
Really appreciate you building on this.