Python Data Source API — worth using?

beaglerot
Databricks Partner

Hi all,

I’ve been looking into the Python Data Source API and wanted to get some feedback from others who may be experimenting with it.

One of the more common challenges I run into is working with applications that expose APIs but don’t have out-of-the-box connectors.

In most cases, the pattern ends up being:

  • Write Python code to call the API

  • Handle authentication, pagination, and rate limits

  • Transform the response

  • Land the data

  • Schedule and maintain the pipeline

It works, but over time it turns into a collection of custom ingestion pipelines that are difficult to standardize.

The Python Data Source API seems like an attempt to address this by allowing you to define a data source directly in Python and integrate it more natively into your workflows.

Conceptually, it feels like a step toward treating API-based data as a first-class source rather than something that always requires custom ingestion logic.

That said, I’m trying to better understand where this fits in practice.

For example, in a typical ingestion pattern, I might:

  • Land raw JSON from an API into storage

  • Use Auto Loader / streaming to process it into bronze

  • Continue through the medallion layers from there

With the Python Data Source API, it seems like you could instead define the source and write more directly into your pipeline.

So I’m curious how others are thinking about this:

  • In what scenarios would you use the Python Data Source API?

  • Are you using it in place of landing raw data first, or alongside existing ingestion patterns?

  • Does it simplify your pipelines, or just shift complexity elsewhere?

Appreciate any thoughts or examples others are willing to share.

Thanks!