cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Is Auto Loader open source now in Apache 4.1 SDP?

ChristianRRL
Honored Contributor

With Spark Declarative Pipelines (SDP) being open source now, does this mean that the Databricks Auto Loader functionality is also open source? Is it called something else? If not, how does the open-source version handle incremental data processing and schema inference/evolution?

Reference: Spark Declarative Pipelines Programming Guide - Spark 4.1.0 Documentation

1 ACCEPTED SOLUTION

Accepted Solutions

szymon_dybczak
Esteemed Contributor III

Hi @ChristianRRL ,

No, autoloader is propriety to Databricks. It's not open sourced. Open source version of SDP uses spark structured streaming for incremental processing. 
Keep in mind that Auto Loader is basically just Spark streaming under the hood with additional features for event-driven ingestion (and some other things).
Schema evolution is property of Delta protocol which is open sourced. Also spark can natively infer schema for various sources, this is not something that is unique to auto loader.
InferSchema & Schema Enforcement in Spark | by Omkar Patil | Medium

View solution in original post

3 REPLIES 3

szymon_dybczak
Esteemed Contributor III

Hi @ChristianRRL ,

No, autoloader is propriety to Databricks. It's not open sourced. Open source version of SDP uses spark structured streaming for incremental processing. 
Keep in mind that Auto Loader is basically just Spark streaming under the hood with additional features for event-driven ingestion (and some other things).
Schema evolution is property of Delta protocol which is open sourced. Also spark can natively infer schema for various sources, this is not something that is unique to auto loader.
InferSchema & Schema Enforcement in Spark | by Omkar Patil | Medium

This is helpful. Follow-up question, when setting up Databricks pipelines (previously DLT Pipelines), does it require that Autoloader is used, or can we set it to use spark structured streaming? Mainly asking to see how much vendor lock-in concerns may be eased if we can use SDP without having to use Autoloader if this is a path we want to consider.

szymon_dybczak
Esteemed Contributor III

In databricks you don't have to use auto loader when you're dealing with SDP. Think of auto loader as a very specific structred streaming source (that's source is called cloudFiles ).

So, for instance you can use traditional structred streaming approach to load csv files incrementally:

df = spark.readStream.format("csv") \
    .option("header", "true") \
    .schema(<schema>) \
    .load(<path>)

Or you can turn on auto loader by choosing "cloudFiles" source:

df = spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "csv") \
  .option("header", "true") \  
  .schema(<schema>) \ # provide a schema here for the files
  .load(<path>)

So you have freedom of choice ๐Ÿ™‚ But if you're dealing with files on S3 bucket or ADLS I would choose auto loader any day ๐Ÿ™‚