Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-17-2023 09:46 AM
I am using DLT with filenotification and DLT job is just fetching 1 notification from SQS queue at a time. My pipeline is expected to process 500K notifications per day but it running hours behind. Any recommendations?
spark.readStream.format("cloudFiles")
.option("cloudFiles.schemaLocation","/mnt/abc/")
.option('cloudFiles.format', 'json')
.option('cloudFiles.inferColumnTypes', 'true')
.option('cloudFiles.useNotifications', True)
.option('skipChangeCommits', 'true')
.option('cloudFiles.backfillInterval', '3 hour')
.option('cloudFiles.maxFilesPerTrigger', 10000)
Logs:
NotificationFileEventFetcher: [queryId =] Fetched 1 messages from cloud queue storage.
NotificationFileEventFetcher: [queryId =] Fetched 1 messages from cloud queue storage.
NotificationFileEventFetcher: [queryId =] Fetched 1 messages from cloud queue storage.
Logs:
NotificationFileEventFetcher: [queryId =] Fetched 1 messages from cloud queue storage.
NotificationFileEventFetcher: [queryId =] Fetched 1 messages from cloud queue storage.
NotificationFileEventFetcher: [queryId =] Fetched 1 messages from cloud queue storage.
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-17-2023 12:14 PM
Can you set this value to higher number and try
cloudFiles.fetchParallelism its 1 by default
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-17-2023 02:48 PM
ThankscloudFiles.fetchParallelism to 100 definitely helped to read more messages from SQS.
NotificationFileEventFetcher: [queryId = 111] Fetched 100 messages from cloud queue storage