Hi everyone,
I'm seeing a dramatic difference in processing times between batch and streaming (Auto Loader) approaches for reading about 250,000 files from S3 in Databricks. My goal is to read metadata from these files and register it as a table (eventually use autoloader backup option). Hereโs the comparison:
Batch approach (2 minutes for 250k files):
df = (
spark.read.format("binaryFile")
.option("recursiveFileLookup", "true")
.load(source_s3_path_default)
.select("path", "modificationTime", "length")
)
df.write.saveAsTable("some_table")
Auto Loader streaming approach (2.5 hours for 250k files):
write_stream = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "binaryFile")
.load(source_s3_path_default)
.select("path", "modificationTime", "length")
.writeStream
.outputMode("overwrite")
.option("checkpointLocation", f"{some_checkpoint}")
.trigger(availableNow=True)
.table(f"{some_table}")
)
write_stream.awaitTermination()
Why does Auto Loader take so much longer?
Same file count and S3 path
Same basic selection of columns
The only difference is using .read.format() vs .readStream.format("cloudFiles")
Am I missing something fundamental about how Auto Loader is designed for large initial loads?
Is all this overhead expected, and should I always use batch for historical loads and reserve Auto Loader only for incremental/real-time workflows?
Thanks in advance for your insights!