<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Python Data Source API in Community Articles</title>
    <link>https://community.databricks.com/t5/community-articles/python-data-source-api/m-p/140314#M803</link>
    <description>&lt;P&gt;I have seen multiple glue jobs pulling data from such systems. This is certainly a solution to simplify and bring governance to it. will look forward to implement it.#Apache-4&lt;/P&gt;</description>
    <pubDate>Tue, 25 Nov 2025 13:51:44 GMT</pubDate>
    <dc:creator>Raman_Unifeye</dc:creator>
    <dc:date>2025-11-25T13:51:44Z</dc:date>
    <item>
      <title>Python Data Source API</title>
      <link>https://community.databricks.com/t5/community-articles/python-data-source-api/m-p/140312#M802</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;ARTICLE&gt;
&lt;P&gt;If you’ve ever hacked together a one-off script to pull data from some random API into Spark, you’re exactly who the new &lt;STRONG&gt;Python Data Source API&lt;/STRONG&gt; is for.&lt;/P&gt;
&lt;P&gt;Databricks has made this API generally available on &lt;STRONG&gt;Apache Spark™ 4.0 with Databricks Runtime 15.4 LTS+ and serverless environments&lt;/STRONG&gt;. It basically turns “that Python script” into a &lt;STRONG&gt;first‑class Spark connector&lt;/STRONG&gt; – with governance, performance, and SQL support – without you touching JVM internals.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2&gt;What is the Python Data Source API?&lt;/H2&gt;
&lt;P&gt;At a high level, it’s a way to &lt;STRONG&gt;implement Spark data sources in pure Python&lt;/STRONG&gt;:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;You write a small Python class that knows how to read (and optionally write) from some system: REST API, SaaS app, ML dataset, internal service, etc.&lt;/LI&gt;
&lt;LI&gt;Spark then treats it like any other data source: &lt;CODE&gt;spark.read.format("your_source").option(...).load()&lt;/CODE&gt; or &lt;CODE&gt;df.write.format("your_source").save()&lt;/CODE&gt;.&lt;/LI&gt;
&lt;LI&gt;It supports &lt;STRONG&gt;batch and streaming reads&lt;/STRONG&gt;, plus &lt;STRONG&gt;batch and streaming writes&lt;/STRONG&gt;, so the same connector can power ETL jobs &lt;EM&gt;and&lt;/EM&gt; real‑time pipelines.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Under the hood, data moves using &lt;STRONG&gt;Apache Arrow&lt;/STRONG&gt;, so you get a performant in‑memory path between Python and Spark without hand‑rolling serialization tricks.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2&gt;Why should you care?&lt;/H2&gt;
&lt;P&gt;A few common scenarios where this becomes a game‑changer:&lt;/P&gt;
&lt;H3&gt;1. REST and SaaS APIs&lt;/H3&gt;
&lt;P&gt;Instead of:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Calling an API with &lt;CODE&gt;requests&lt;/CODE&gt;&lt;/LI&gt;
&lt;LI&gt;Dumping JSON to disk&lt;/LI&gt;
&lt;LI&gt;Loading it back with &lt;CODE&gt;spark.read.json&lt;/CODE&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;…you can now &lt;STRONG&gt;read directly from the API into a Spark DataFrame&lt;/STRONG&gt; via a custom Python data source. The Databricks team even ships &lt;STRONG&gt;example connectors for REST APIs and CSV variants&lt;/STRONG&gt; as references.&lt;/P&gt;
&lt;H3&gt;2. ML‑heavy workflows&lt;/H3&gt;
&lt;P&gt;Need a dataset from &lt;STRONG&gt;HuggingFace&lt;/STRONG&gt; or another ML repository? There’s already a &lt;STRONG&gt;HuggingFace connector&lt;/STRONG&gt; built on this API that lets you pull curated ML datasets straight into a DataFrame and plug them into training pipelines.&lt;/P&gt;
&lt;H3&gt;3. Streaming from weird places&lt;/H3&gt;
&lt;P&gt;Got a streaming API (like aircraft data from &lt;STRONG&gt;OpenSky Network&lt;/STRONG&gt;) that doesn’t fit nicely into Kafka, Kinesis, etc.? The API lets you build &lt;STRONG&gt;custom streaming readers&lt;/STRONG&gt; in Python, so Spark Structured Streaming can consume those feeds as if they were native sources.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2&gt;Governance and SQL come for free&lt;/H2&gt;
&lt;P&gt;Once your connector is wired up, you can:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Register the data as &lt;STRONG&gt;Unity Catalog tables&lt;/STRONG&gt;, so you get &lt;STRONG&gt;lineage, access control, and auditability&lt;/STRONG&gt; on top of these “non‑standard” sources.&lt;/LI&gt;
&lt;LI&gt;Query everything from &lt;STRONG&gt;Spark SQL&lt;/STRONG&gt;, just like Delta or Parquet.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;This is a big deal: data from Google Sheets, proprietary APIs, catalogs, ML datasets, or streaming feeds can now live in the same governed world as your warehouse tables.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2&gt;How it fits into pipelines&lt;/H2&gt;
&lt;P&gt;The Data Source API isn’t just a “reader/writer trick.” It plugs into the broader Databricks stack:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Jobs &amp;amp; ETL&lt;/STRONG&gt; – Use your connector in regular batch jobs to ingest from / write to external systems.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Streaming&lt;/STRONG&gt; – Use it in Structured Streaming queries as a source or sink.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Declarative Pipelines&lt;/STRONG&gt; – Implement &lt;STRONG&gt;sinks as Python data sources&lt;/STRONG&gt;, so you can stream out to external services using the same abstraction.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;So the same little Python connector can power both: “hourly ingest from API X” &lt;EM&gt;and&lt;/EM&gt; “real‑time fan‑out to system Y.”&lt;/P&gt;
&lt;HR /&gt;
&lt;H2&gt;A tiny mental model&lt;/H2&gt;
&lt;P&gt;Think of the Python Data Source API as:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;“&lt;STRONG&gt;dbt adapter meets Spark connector&lt;/STRONG&gt;, written in Python and governed by Unity Catalog.”&lt;/BLOCKQUOTE&gt;
&lt;P&gt;You define how to:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Connect and authenticate&lt;/LI&gt;
&lt;LI&gt;Discover schema / partitions&lt;/LI&gt;
&lt;LI&gt;Read data in batches or as a stream&lt;/LI&gt;
&lt;LI&gt;Optionally write data back out&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Spark takes care of the rest: distribution, scalability, schema handling, and making it all accessible via DataFrames and SQL.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2&gt;Getting started&lt;/H2&gt;
&lt;P&gt;If you want to play with it today:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Spin up &lt;STRONG&gt;DBR 15.4 LTS+ or a Spark 4.0 environment on Databricks (or serverless)&lt;/STRONG&gt;.&lt;/LI&gt;
&lt;LI&gt;Check out the &lt;STRONG&gt;example connectors repo&lt;/STRONG&gt; (REST, CSV, etc.) plus the &lt;STRONG&gt;HuggingFace connector&lt;/STRONG&gt; for inspiration.&lt;/LI&gt;
&lt;LI&gt;Use the base classes in &lt;CODE&gt;pyspark.sql.datasource&lt;/CODE&gt; as a template for your own implementation.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;From there, it’s “just Python.”&lt;/P&gt;
&lt;HR /&gt;
&lt;H2&gt;Final thought&lt;/H2&gt;
&lt;P&gt;Most of us have at least one fragile glue job hiding in a repo somewhere, pulling data from “that one system” into Spark. The Python Data Source API is your chance to turn those hacks into &lt;STRONG&gt;reusable, governed, community‑shareable connectors&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;/ARTICLE&gt;</description>
      <pubDate>Tue, 25 Nov 2025 13:20:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/python-data-source-api/m-p/140312#M802</guid>
      <dc:creator>AbhaySingh</dc:creator>
      <dc:date>2025-11-25T13:20:06Z</dc:date>
    </item>
    <item>
      <title>Re: Python Data Source API</title>
      <link>https://community.databricks.com/t5/community-articles/python-data-source-api/m-p/140314#M803</link>
      <description>&lt;P&gt;I have seen multiple glue jobs pulling data from such systems. This is certainly a solution to simplify and bring governance to it. will look forward to implement it.#Apache-4&lt;/P&gt;</description>
      <pubDate>Tue, 25 Nov 2025 13:51:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/python-data-source-api/m-p/140314#M803</guid>
      <dc:creator>Raman_Unifeye</dc:creator>
      <dc:date>2025-11-25T13:51:44Z</dc:date>
    </item>
  </channel>
</rss>

