topic Re: Is Auto Loader open source now in Apache 4.1 SDP? in Data Engineering

Is Auto Loader open source now in Apache 4.1 SDP?

ChristianRRL — Fri, 16 Jan 2026 15:13:22 GMT

With Spark Declarative Pipelines (SDP) being open source now, does this mean that the Databricks Auto Loader functionality is also open source? Is it called something else? If not, how does the open-source version handle incremental data processing and schema inference/evolution?

Reference: Spark Declarative Pipelines Programming Guide - Spark 4.1.0 Documentation

Re: Is Auto Loader open source now in Apache 4.1 SDP?

szymon_dybczak — Fri, 16 Jan 2026 17:12:04 GMT

Hi @ChristianRRL ,

No, autoloader is propriety to Databricks. It's not open sourced. Open source version of SDP uses spark structured streaming for incremental processing.
Keep in mind that Auto Loader is basically just Spark streaming under the hood with additional features for event-driven ingestion (and some other things).
Schema evolution is property of Delta protocol which is open sourced. Also spark can natively infer schema for various sources, this is not something that is unique to auto loader.
InferSchema & Schema Enforcement in Spark | by Omkar Patil | Medium

Re: Is Auto Loader open source now in Apache 4.1 SDP?

ChristianRRL — Fri, 16 Jan 2026 17:35:19 GMT

This is helpful. Follow-up question, when setting up Databricks pipelines (previously DLT Pipelines), does it require that Autoloader is used, or can we set it to use spark structured streaming? Mainly asking to see how much vendor lock-in concerns may be eased if we can use SDP without having to use Autoloader if this is a path we want to consider.

Re: Is Auto Loader open source now in Apache 4.1 SDP?

szymon_dybczak — Fri, 16 Jan 2026 18:10:13 GMT

In databricks you don't have to use auto loader when you're dealing with SDP. Think of auto loader as a very specific structred streaming source (that's source is called cloudFiles ).

So, for instance you can use traditional structred streaming approach to load csv files incrementally:

df = spark.readStream.format("csv") \ .option("header", "true") \ .schema(<schema>) \ .load(<path>)

Or you can turn on auto loader by choosing "cloudFiles" source:

df = spark.readStream.format("cloudFiles") \ .option("cloudFiles.format", "csv") \ .option("header", "true") \ .schema(<schema>) \ # provide a schema here for the files .load(<path>)

So you have freedom of choice 🙂 But if you're dealing with files on S3 bucket or ADLS I would choose auto loader any day 🙂