Autoloader with filenotification

kulkpd
Contributor

I am using DLT with filenotification and DLT job is just fetching 1 notification from SQS queue at a time. My pipeline is expected to process 500K notifications per day but it running hours behind. Any recommendations?

spark.readStream.format("cloudFiles")
.option("cloudFiles.schemaLocation","/mnt/abc/")
.option('cloudFiles.format', 'json')
.option('cloudFiles.inferColumnTypes', 'true')
.option('cloudFiles.useNotifications', True)
.option('skipChangeCommits', 'true')
.option('cloudFiles.backfillInterval', '3 hour')
.option('cloudFiles.maxFilesPerTrigger', 10000)


Logs:
NotificationFileEventFetcher: [queryId =] Fetched 1 messages from cloud queue storage.
NotificationFileEventFetcher: [queryId =] Fetched 1 messages from cloud queue storage.
NotificationFileEventFetcher: [queryId =] Fetched 1 messages from cloud queue storage.

Rdipak
New Contributor II

Can you set this value to higher number and try

cloudFiles.fetchParallelism its 1 by default

ThankscloudFiles.fetchParallelism to 100 definitely helped to read more messages from SQS.

NotificationFileEventFetcher: [queryId = 111] Fetched 100 messages from cloud queue storage

View solution in original post