Databricks Community

ChristianRRL · ‎10-07-2025

Hi there, I would appreciate some help to compare the runtime performance of two approaches to performing ELT in Databricks: spark.read vs. Autoloader. We already have a process in place to extract highly nested json data into a landing path, and from here it's getting a bit confusing understanding if there are significant pros/cons to either of these approaches.

Approach A. spark.read:

Step 1: Generate a spark dataframe from the latest json data via the following logic:

# Assume `last_read_filetimestamp` pulled from max file_modification_time in the bronze (flattened) delta table

df_raw = spark.read\
    .json(raw_data_path)\
    .select("*", "_metadata.file_path", "_metadata.file_modification_time")\
    .withColumnRenamed("_metadata.file_modification_time", "file_modification_time")\
    .withColumnRenamed("_metadata.file_path", "file_path")

if last_read_filetimestamp:
    df_raw = df_raw.filter(F.col('file_modification_time') > last_read_filetimestamp)

Step 2: Perform relevant data flattening operations on the dataframe
Step 3: Perform appropriate upsert/merge operation into target bronze table

Approach B. Autoloader:

Step 1: Run the following logic to initialize a dataframe via Autoloader

schema_hints = 'elementData.element.data MAP<STRING, STRUCT<dataPoint: MAP<STRING, STRING>, values: ARRAY<MAP<STRING, STRING>>>>'

df = (spark.readStream
    .format("cloudFiles").option("cloudFiles.format", "json")
    .option("cloudFiles.inferColumnTypes", "true")
    .option("cloudFiles.schemaHints", schema_hints)
    .option("multiLine", "true")
    .option("cloudFiles.schemaLocation", f"{raw_path}/schema")
    .option("cloudFiles.schemaEvolutionMode", "rescue")
    .load(f"{landing_path}/*/data")
    .select("*", "_metadata")
)

Step 2: Perform relevant data flattening operations on the dataframe
Step 3: Perform appropriate upsert/merge operation into target bronze table

szymon_dybczak · ‎10-07-2025

Hi @ChristianRRL ,

For that kind of ingestion scenario autoloader is a winner . It will scale much better than batch approach - especially if we are talking about large number of files.

If you configure autoloader with file notification mode it can scale to ingest millions of files an hour. The classic spark.read option would have to scan all that files to find new files to load based on your condition.

Also, since autoloader under the hood uses spark structured streaming you get all its benefits. In case of failures, Auto Loader can resume from where it left off by information stored in the checkpoint location. And you get exactly once loading semantic 🙂

ChristianRRL · ‎10-07-2025

Hey @szymon_dybczak , thank you for the quick reply! One thing I forgot to specify, at the moment we currently do not have Unity Catalog enabled, and so we would not be able to leverage the notification mode, so we are locked into using directory listing mode at least for the time being.

If we consider Autoloader using directory listing mode, is there still a significant performance difference, or do both spark.read + Autoloader behave similarly from a performance perspective? For example, wouldn't Autoloader using directory listing mode also need to scan all the files as would the spark.read method?

szymon_dybczak · ‎10-07-2025

Hi @ChristianRRL ,

Good question. So according to databricks they optimized directory listing mode for Auto Loader to discover files in cloud storage more efficiently than other Apache Spark options.

Another thing, you don't need to have Unity Catalog to enable file notification mode. I was using this mode in autoloader way before Unity Catalog 😉

Databricks Community

Performance Comparison: spark.read vs. Autoloader

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples