Databricks Community

AbhaySingh · ‎11-25-2025

If you’ve ever hacked together a one-off script to pull data from some random API into Spark, you’re exactly who the new Python Data Source API is for.

Databricks has made this API generally available on Apache Spark™ 4.0 with Databricks Runtime 15.4 LTS+ and serverless environments. It basically turns “that Python script” into a first‑class Spark connector – with governance, performance, and SQL support – without you touching JVM internals.

What is the Python Data Source API?

At a high level, it’s a way to implement Spark data sources in pure Python:

You write a small Python class that knows how to read (and optionally write) from some system: REST API, SaaS app, ML dataset, internal service, etc.
Spark then treats it like any other data source: spark.read.format("your_source").option(...).load() or df.write.format("your_source").save().
It supports batch and streaming reads, plus batch and streaming writes, so the same connector can power ETL jobs and real‑time pipelines.

Under the hood, data moves using Apache Arrow, so you get a performant in‑memory path between Python and Spark without hand‑rolling serialization tricks.

Why should you care?

A few common scenarios where this becomes a game‑changer:

1. REST and SaaS APIs

Instead of:

Calling an API with requests
Dumping JSON to disk
Loading it back with spark.read.json

…you can now read directly from the API into a Spark DataFrame via a custom Python data source. The Databricks team even ships example connectors for REST APIs and CSV variants as references.

2. ML‑heavy workflows

Need a dataset from HuggingFace or another ML repository? There’s already a HuggingFace connector built on this API that lets you pull curated ML datasets straight into a DataFrame and plug them into training pipelines.

3. Streaming from weird places

Got a streaming API (like aircraft data from OpenSky Network) that doesn’t fit nicely into Kafka, Kinesis, etc.? The API lets you build custom streaming readers in Python, so Spark Structured Streaming can consume those feeds as if they were native sources.

Governance and SQL come for free

Once your connector is wired up, you can:

Register the data as Unity Catalog tables, so you get lineage, access control, and auditability on top of these “non‑standard” sources.
Query everything from Spark SQL, just like Delta or Parquet.

This is a big deal: data from Google Sheets, proprietary APIs, catalogs, ML datasets, or streaming feeds can now live in the same governed world as your warehouse tables.

How it fits into pipelines

The Data Source API isn’t just a “reader/writer trick.” It plugs into the broader Databricks stack:

Jobs & ETL – Use your connector in regular batch jobs to ingest from / write to external systems.
Streaming – Use it in Structured Streaming queries as a source or sink.
Declarative Pipelines – Implement sinks as Python data sources, so you can stream out to external services using the same abstraction.

So the same little Python connector can power both: “hourly ingest from API X” and “real‑time fan‑out to system Y.”

A tiny mental model

Think of the Python Data Source API as:

“dbt adapter meets Spark connector, written in Python and governed by Unity Catalog.”

You define how to:

Connect and authenticate
Discover schema / partitions
Read data in batches or as a stream
Optionally write data back out

Spark takes care of the rest: distribution, scalability, schema handling, and making it all accessible via DataFrames and SQL.

Getting started

If you want to play with it today:

Spin up DBR 15.4 LTS+ or a Spark 4.0 environment on Databricks (or serverless).
Check out the example connectors repo (REST, CSV, etc.) plus the HuggingFace connector for inspiration.
Use the base classes in pyspark.sql.datasource as a template for your own implementation.

From there, it’s “just Python.”

Final thought

Most of us have at least one fragile glue job hiding in a repo somewhere, pulling data from “that one system” into Spark. The Python Data Source API is your chance to turn those hacks into reusable, governed, community‑shareable connectors.

Raman_Unifeye · ‎11-25-2025

I have seen multiple glue jobs pulling data from such systems. This is certainly a solution to simplify and bring governance to it. will look forward to implement it.#Apache-4

RG #Driving Business Outcomes with Data Intelligence