cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Spark Streaming Loading 1kto 5k rows only delta table

bunny1174
Visitor

Hi Team,

I have 4-5 millions of files in s3 files around 1.5gb data only with 9 million records, when i try to use autoloader to read the data using read stream and writing to delta table the processing is taking too much time, it is loading from 1k to 5k rows max per batch...

code is like below
input_path is s3 folder

df_stream = (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", f"{checkpoint_path}/schema/")
.option("cloudFiles.includeExistingFiles", "true")
.option("cloudFiles.fetchParallelism", "32")
.option("cloudFiles.maxFilesPerTrigger", 50000) # Adjust as needed
.option("cloudFiles.maxBytesPerTrigger", "10g") # Adjust as needed
.load(input_path)
)


# Write to Delta Table (append)
stream_query = (
df_stream.writeStream
.format("delta")
.option("checkpointLocation", checkpoint_path)
.outputMode("append")
.trigger(availableNow=True)
.toTable(delta_table)
)
any suggestions please to modify 
1 REPLY 1

szymon_dybczak
Esteemed Contributor III

Hi @bunny1174 ,

You have 4-5 millions of files in s3 and their size is 1.5gb - this clearly indicates small files problem. You need compact those files to bigger size. There's no way your pipeline will be performant if you have such many files and theirs size is around 1-2kb.

You can read about this problem in general at following articles:

Breaking the Big Data Bottleneck: Solving Sparkโ€™s โ€œSmall Filesโ€ Problem

Tackling the Small Files Problem in Apache Spark | by Henkel Data & Analytics | Henkel Data & Analyt...

Spark Small Files Problem: Optimizing Data Processing