Autoingest not working with Unity Catalog in DLT pipeline
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-30-2023 05:50 PM - edited 12-30-2023 05:52 PM
Hey Everyone,
I've built a very simple pipeline with a single DLT using auto ingest, and it works, provided I don't specify the output location. When I build the same pipeline but set UC as the output location, it fails when setting up S3 notifications, which is entirely bizarre. I've looked at the logs on the DBZ side and request logs in AWS and it looks like DBZ isn't using the instance profile I've set for some reason. Further details below, any help would be greatly appreciated!
Context
- Databricks on AWS
- Deployed 1 week ago so uses all the latest features (unity catalog metastore is the default)
Things I've done
- The instance profile is set in the pipeline settings and it appears in both clusters in the JSON settings
- The same instance profile is used when setting up the pipeline without unity and it correctly creates the SNS/SQS resources without issue, so it's not a permissions thing on the role
- The cluster access settings is set to "Shared"
Things I've tried
- I set up a security credential in the target unity catalog (by copying the working instance profile) for the bucket but that didn't change anything (and it's my understanding this is only for accessing data, not used for setting up File Notification settings)
- I gave the unity IAM role full access to S3, no difference
- I rebuilt the pipeline, no effect
- Labels:
-
Workflows
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-01-2024 11:59 PM
Hey @Retired_mod ,
Thanks for the response!
I tried the above with no luck unfortunately:
- I don't have an apply_merge function in my pipeline definition, please find the pipe definition below
- I'm running DBR 14.2
- I don't think databricks connect applies here as this was all set up in the databricks UI
- Thanks for the link to that one, I read it a couple times and I've implemented all the recommendations with no luck.
DLT definition:
CREATE OR REFRESH STREAMING LIVE TABLE raw_testing
AS SELECT *
FROM cloud_files(
"s3://bucket-path",
"csv",
map(
"header", "true",
"sep", "|",
"cloudFiles.useNotifications", "true",
"inferSchema", "true"
)
);
This pipeline works as expected when using HMR as the output location but doesn't work with UC.
Any other thoughts? Is there some way i can escalate this? At this point it feels like a bug.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-02-2024 03:06 PM
Thanks @Retired_mod, UC can connect to the S3 bucket and read the data but it fails when trying to set up the bucket notifications.
I'll raise a ticket with support and post back here if I find a resolution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-02-2024 10:33 AM
@Red1 Were you able resolve this issue, if yes , what was the fix ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-02-2024 04:08 PM
Hey @Babu_Krishnan I was! I had to reach out to my Databricks support engineer directly and the resolution was to add "cloudfiles.awsAccessKey" and "cloudfiles.awsSecretKey" to the params as in the screenshot below (apologies, i don't know why the scrnsht is so grainy). he also mentioned using Databricks secret store for the credentials themselves.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-02-2024 06:13 PM
Thanks a lot @Red1. Let me try that.
But curious to know what the purpose of roleARN is. Also interested in learning how we can utilize Secret Manager to prevent passing credentials as plain text in a notebook. Thanks in advance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-02-2024 08:05 PM
@Red1 , It worked . Thanks for the details. Used Databricks secrets to store the credentials.

