- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-31-2026 07:46 AM - edited 03-31-2026 07:46 AM
Hi @beaglerot! Good question... I’ve been thinking about this as well.
In my view, the python data source API really shines when you're building reusable ingestion patterns. If multiple teams are hitting the same API with slight variations, wrapping that logic into a single source definition centralizes concerns like authentication, pagination, retries, and schema handling.
For one-off pipelines, though, the overhead may not justify it. A well-structured notebook or ingestion script is often simpler and easier to debug.
Regarding medallion architecture, I’d still land raw first in most cases. The Python Data Source API doesn’t remove the need for a raw layer. Keeping the original payload is critical for reprocessing, auditing, and handling upstream changes. Where the API helps is in making the transition from raw to bronze more structured, especially when used within DLT pipelines.
I think one of the biggest practical benefits is standardizing error handling and retry logic. Instead of duplicating that across pipelines, you define it once in the source abstraction. That said, it doesn’t eliminate complexity, it just shifts it. This shifts complexity from distributed pipelines into a centralized abstraction layer, which makes sense at scale, but can be overkill for smaller use cases.
Out of curiosity, are you evaluating this specifically for DLT, or more general batch ingestion? That context can shift the trade-offs quite a bit.