Hi there, I would appreciate some help to compare the runtime performance of two approaches to performing ELT in Databricks: spark.read vs. Autoloader. We already have a process in place to extract highly nested json data into a landing path, and from here it's getting a bit confusing understanding if there are significant pros/cons to either of these approaches.
Approach A. spark.read:
- Step 1: Generate a spark dataframe from the latest json data via the following logic:
# Assume `last_read_filetimestamp` pulled from max file_modification_time in the bronze (flattened) delta table
df_raw = spark.read\
.json(raw_data_path)\
.select("*", "_metadata.file_path", "_metadata.file_modification_time")\
.withColumnRenamed("_metadata.file_modification_time", "file_modification_time")\
.withColumnRenamed("_metadata.file_path", "file_path")
if last_read_filetimestamp:
df_raw = df_raw.filter(F.col('file_modification_time') > last_read_filetimestamp)
- Step 2: Perform relevant data flattening operations on the dataframe
- Step 3: Perform appropriate upsert/merge operation into target bronze table
Approach B. Autoloader:
- Step 1: Run the following logic to initialize a dataframe via Autoloader
schema_hints = 'elementData.element.data MAP<STRING, STRUCT<dataPoint: MAP<STRING, STRING>, values: ARRAY<MAP<STRING, STRING>>>>'
df = (spark.readStream
.format("cloudFiles").option("cloudFiles.format", "json")
.option("cloudFiles.inferColumnTypes", "true")
.option("cloudFiles.schemaHints", schema_hints)
.option("multiLine", "true")
.option("cloudFiles.schemaLocation", f"{raw_path}/schema")
.option("cloudFiles.schemaEvolutionMode", "rescue")
.load(f"{landing_path}/*/data")
.select("*", "_metadata")
)
- Step 2: Perform relevant data flattening operations on the dataframe
- Step 3: Perform appropriate upsert/merge operation into target bronze table