If youโve ever hacked together a one-off script to pull data from some random API into Spark, youโre exactly who the new Python Data Source API is for.
Databricks has made this API generally available on Apache Sparkโข 4.0 with Databricks Runtime 15.4 LTS+ and serverless environments. It basically turns โthat Python scriptโ into a firstโclass Spark connector โ with governance, performance, and SQL support โ without you touching JVM internals.
What is the Python Data Source API?
At a high level, itโs a way to implement Spark data sources in pure Python:
- You write a small Python class that knows how to read (and optionally write) from some system: REST API, SaaS app, ML dataset, internal service, etc.
- Spark then treats it like any other data source:
spark.read.format("your_source").option(...).load()ordf.write.format("your_source").save(). - It supports batch and streaming reads, plus batch and streaming writes, so the same connector can power ETL jobs and realโtime pipelines.
Under the hood, data moves using Apache Arrow, so you get a performant inโmemory path between Python and Spark without handโrolling serialization tricks.
Why should you care?
A few common scenarios where this becomes a gameโchanger:
1. REST and SaaS APIs
Instead of:
- Calling an API with
requests - Dumping JSON to disk
- Loading it back with
spark.read.json
โฆyou can now read directly from the API into a Spark DataFrame via a custom Python data source. The Databricks team even ships example connectors for REST APIs and CSV variants as references.
2. MLโheavy workflows
Need a dataset from HuggingFace or another ML repository? Thereโs already a HuggingFace connector built on this API that lets you pull curated ML datasets straight into a DataFrame and plug them into training pipelines.
3. Streaming from weird places
Got a streaming API (like aircraft data from OpenSky Network) that doesnโt fit nicely into Kafka, Kinesis, etc.? The API lets you build custom streaming readers in Python, so Spark Structured Streaming can consume those feeds as if they were native sources.
Governance and SQL come for free
Once your connector is wired up, you can:
- Register the data as Unity Catalog tables, so you get lineage, access control, and auditability on top of these โnonโstandardโ sources.
- Query everything from Spark SQL, just like Delta or Parquet.
This is a big deal: data from Google Sheets, proprietary APIs, catalogs, ML datasets, or streaming feeds can now live in the same governed world as your warehouse tables.
How it fits into pipelines
The Data Source API isnโt just a โreader/writer trick.โ It plugs into the broader Databricks stack:
- Jobs & ETL โ Use your connector in regular batch jobs to ingest from / write to external systems.
- Streaming โ Use it in Structured Streaming queries as a source or sink.
- Declarative Pipelines โ Implement sinks as Python data sources, so you can stream out to external services using the same abstraction.
So the same little Python connector can power both: โhourly ingest from API Xโ and โrealโtime fanโout to system Y.โ
A tiny mental model
Think of the Python Data Source API as:
โdbt adapter meets Spark connector, written in Python and governed by Unity Catalog.โ
You define how to:
- Connect and authenticate
- Discover schema / partitions
- Read data in batches or as a stream
- Optionally write data back out
Spark takes care of the rest: distribution, scalability, schema handling, and making it all accessible via DataFrames and SQL.
Getting started
If you want to play with it today:
- Spin up DBR 15.4 LTS+ or a Spark 4.0 environment on Databricks (or serverless).
- Check out the example connectors repo (REST, CSV, etc.) plus the HuggingFace connector for inspiration.
- Use the base classes in
pyspark.sql.datasourceas a template for your own implementation.
From there, itโs โjust Python.โ
Final thought
Most of us have at least one fragile glue job hiding in a repo somewhere, pulling data from โthat one systemโ into Spark. The Python Data Source API is your chance to turn those hacks into reusable, governed, communityโshareable connectors.