cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Performance Comparison: spark.read vs. Autoloader

ChristianRRL
Valued Contributor III

Hi there, I would appreciate some help to compare the runtime performance of two approaches to performing ELT in Databricks: spark.read vs. Autoloader. We already have a process in place to extract highly nested json data into a landing path, and from here it's getting a bit confusing understanding if there are significant pros/cons to either of these approaches.

Approach A. spark.read:

  • Step 1: Generate a spark dataframe from the latest json data via the following logic:
# Assume `last_read_filetimestamp` pulled from max file_modification_time in the bronze (flattened) delta table

df_raw = spark.read\
    .json(raw_data_path)\
    .select("*", "_metadata.file_path", "_metadata.file_modification_time")\
    .withColumnRenamed("_metadata.file_modification_time", "file_modification_time")\
    .withColumnRenamed("_metadata.file_path", "file_path")

if last_read_filetimestamp:
    df_raw = df_raw.filter(F.col('file_modification_time') > last_read_filetimestamp)
  • Step 2: Perform relevant data flattening operations on the dataframe
  • Step 3: Perform appropriate upsert/merge operation into target bronze table

Approach B. Autoloader:

  • Step 1: Run the following logic to initialize a dataframe via Autoloader
schema_hints = 'elementData.element.data MAP<STRING, STRUCT<dataPoint: MAP<STRING, STRING>, values: ARRAY<MAP<STRING, STRING>>>>'

df = (spark.readStream
    .format("cloudFiles").option("cloudFiles.format", "json")
    .option("cloudFiles.inferColumnTypes", "true")
    .option("cloudFiles.schemaHints", schema_hints)
    .option("multiLine", "true")
    .option("cloudFiles.schemaLocation", f"{raw_path}/schema")
    .option("cloudFiles.schemaEvolutionMode", "rescue")
    .load(f"{landing_path}/*/data")
    .select("*", "_metadata")
)
  • Step 2: Perform relevant data flattening operations on the dataframe
  • Step 3: Perform appropriate upsert/merge operation into target bronze table
3 REPLIES 3

szymon_dybczak
Esteemed Contributor III

Hi @ChristianRRL ,

For that kind of ingestion scenario autoloader is a winner . It will scale much better than batch approach - especially if we are talking about large number of files.

If you configure autoloader with file notification mode it can scale to ingest millions of files an hour. The classic spark.read option would have to scan all that files to find new files to load based on your condition.

Also, since autoloader under the hood uses spark structured streaming you get all its benefits. In case of failures, Auto Loader can resume from where it left off by information stored in the checkpoint location. And you get exactly once loading semantic 🙂

Hey @szymon_dybczak , thank you for the quick reply! One thing I forgot to specify, at the moment we currently do not have Unity Catalog enabled, and so we would not be able to leverage the notification mode, so we are locked into using directory listing mode at least for the time being.

If we consider Autoloader using directory listing mode, is there still a significant performance difference, or do both spark.read + Autoloader behave similarly from a performance perspective? For example, wouldn't Autoloader using directory listing mode also need to scan all the files as would the spark.read method?

szymon_dybczak
Esteemed Contributor III

Hi @ChristianRRL ,

Good question. So according to databricks they optimized directory listing mode for Auto Loader to discover files in cloud storage more efficiently than other Apache Spark options.

szymon_dybczak_0-1759905695977.png

Another thing, you don't need to have Unity Catalog to enable file notification mode. I was using this mode in autoloader way before Unity Catalog 😉

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now