cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Articles
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Python Data Source API

AbhaySingh
Databricks Employee
Databricks Employee

 

If youโ€™ve ever hacked together a one-off script to pull data from some random API into Spark, youโ€™re exactly who the new Python Data Source API is for.

Databricks has made this API generally available on Apache Sparkโ„ข 4.0 with Databricks Runtime 15.4 LTS+ and serverless environments. It basically turns โ€œthat Python scriptโ€ into a firstโ€‘class Spark connector โ€“ with governance, performance, and SQL support โ€“ without you touching JVM internals.


What is the Python Data Source API?

At a high level, itโ€™s a way to implement Spark data sources in pure Python:

  • You write a small Python class that knows how to read (and optionally write) from some system: REST API, SaaS app, ML dataset, internal service, etc.
  • Spark then treats it like any other data source: spark.read.format("your_source").option(...).load() or df.write.format("your_source").save().
  • It supports batch and streaming reads, plus batch and streaming writes, so the same connector can power ETL jobs and realโ€‘time pipelines.

Under the hood, data moves using Apache Arrow, so you get a performant inโ€‘memory path between Python and Spark without handโ€‘rolling serialization tricks.


Why should you care?

A few common scenarios where this becomes a gameโ€‘changer:

1. REST and SaaS APIs

Instead of:

  1. Calling an API with requests
  2. Dumping JSON to disk
  3. Loading it back with spark.read.json

โ€ฆyou can now read directly from the API into a Spark DataFrame via a custom Python data source. The Databricks team even ships example connectors for REST APIs and CSV variants as references.

2. MLโ€‘heavy workflows

Need a dataset from HuggingFace or another ML repository? Thereโ€™s already a HuggingFace connector built on this API that lets you pull curated ML datasets straight into a DataFrame and plug them into training pipelines.

3. Streaming from weird places

Got a streaming API (like aircraft data from OpenSky Network) that doesnโ€™t fit nicely into Kafka, Kinesis, etc.? The API lets you build custom streaming readers in Python, so Spark Structured Streaming can consume those feeds as if they were native sources.


Governance and SQL come for free

Once your connector is wired up, you can:

  • Register the data as Unity Catalog tables, so you get lineage, access control, and auditability on top of these โ€œnonโ€‘standardโ€ sources.
  • Query everything from Spark SQL, just like Delta or Parquet.

This is a big deal: data from Google Sheets, proprietary APIs, catalogs, ML datasets, or streaming feeds can now live in the same governed world as your warehouse tables.


How it fits into pipelines

The Data Source API isnโ€™t just a โ€œreader/writer trick.โ€ It plugs into the broader Databricks stack:

  • Jobs & ETL โ€“ Use your connector in regular batch jobs to ingest from / write to external systems.
  • Streaming โ€“ Use it in Structured Streaming queries as a source or sink.
  • Declarative Pipelines โ€“ Implement sinks as Python data sources, so you can stream out to external services using the same abstraction.

So the same little Python connector can power both: โ€œhourly ingest from API Xโ€ and โ€œrealโ€‘time fanโ€‘out to system Y.โ€


A tiny mental model

Think of the Python Data Source API as:

โ€œdbt adapter meets Spark connector, written in Python and governed by Unity Catalog.โ€

You define how to:

  • Connect and authenticate
  • Discover schema / partitions
  • Read data in batches or as a stream
  • Optionally write data back out

Spark takes care of the rest: distribution, scalability, schema handling, and making it all accessible via DataFrames and SQL.


Getting started

If you want to play with it today:

  1. Spin up DBR 15.4 LTS+ or a Spark 4.0 environment on Databricks (or serverless).
  2. Check out the example connectors repo (REST, CSV, etc.) plus the HuggingFace connector for inspiration.
  3. Use the base classes in pyspark.sql.datasource as a template for your own implementation.

From there, itโ€™s โ€œjust Python.โ€


Final thought

Most of us have at least one fragile glue job hiding in a repo somewhere, pulling data from โ€œthat one systemโ€ into Spark. The Python Data Source API is your chance to turn those hacks into reusable, governed, communityโ€‘shareable connectors.

 

1 REPLY 1

Raman_Unifeye
Contributor III

I have seen multiple glue jobs pulling data from such systems. This is certainly a solution to simplify and bring governance to it. will look forward to implement it.#Apache-4


RG #Driving Business Outcomes with Data Intelligence