Hi Team,
I have 4-5 millions of files in s3 files around 1.5gb data only with 9 million records, when i try to use autoloader to read the data using read stream and writing to delta table the processing is taking too much time, it is loading from 1k to 5k rows max per batch...
code is like below
input_path is s3 folder
df_stream = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", f"{checkpoint_path}/schema/")
.option("cloudFiles.includeExistingFiles", "true")
.option("cloudFiles.fetchParallelism", "32")
.option("cloudFiles.maxFilesPerTrigger", 50000) # Adjust as needed
.option("cloudFiles.maxBytesPerTrigger", "10g") # Adjust as needed
.load(input_path)
)
# Write to Delta Table (append)
stream_query = (
df_stream.writeStream
.format("delta")
.option("checkpointLocation", checkpoint_path)
.outputMode("append")
.trigger(availableNow=True)
.toTable(delta_table)
)
any suggestions please to modify