cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Autoingest not working with Unity Catalog in DLT pipeline

Red1
New Contributor III

Hey Everyone,

I've built a very simple pipeline with a single DLT using auto ingest, and it works, provided I don't specify the output location. When I build the same pipeline but set UC as the output location, it fails when setting up S3 notifications, which is entirely bizarre. I've looked at the logs on the DBZ side and request logs in AWS and it looks like DBZ isn't using the instance profile I've set for some reason. Further details below, any help would be greatly appreciated!

Context

  • Databricks on AWS
  • Deployed 1 week ago so uses all the latest features (unity catalog metastore is the default)

Things I've done

  • The instance profile is set in the pipeline settings and it appears in both clusters in the JSON settings
  • The same instance profile is used when setting up the pipeline without unity and it correctly creates the SNS/SQS resources without issue, so it's not a permissions thing on the role
  • The cluster access settings is set to "Shared" 

Things I've tried

  • I set up a security credential in the target unity catalog (by copying the working instance profile) for the bucket but that didn't change anything (and it's my understanding this is only for accessing data, not used for setting up File Notification settings)
  • I gave the unity IAM role full access to S3, no difference
  • I rebuilt the pipeline, no effect

6 REPLIES 6

Red1
New Contributor III

Hey @Retired_mod ,

Thanks for the response!

I tried the above with no luck unfortunately:
- I don't have an apply_merge function in my pipeline definition, please find the pipe definition below
- I'm running DBR 14.2
- I don't think databricks connect applies here as this was all set up in the databricks UI
- Thanks for the link to that one, I read it a couple times and I've implemented all the recommendations with no luck.

DLT definition:

 

CREATE OR REFRESH STREAMING LIVE TABLE raw_testing
AS SELECT *
  FROM cloud_files(
    "s3://bucket-path",
    "csv",
    map(
      "header", "true",
      "sep", "|",
      "cloudFiles.useNotifications", "true",
      "inferSchema", "true"
    )
  );

 

This pipeline works as expected when using HMR as the output location but doesn't work with UC. 

Any other thoughts? Is there some way i can escalate this? At this point it feels like a bug.

Red1
New Contributor III

Thanks @Retired_mod, UC can connect to the S3 bucket and read the data but it fails when trying to set up the bucket notifications.

I'll raise a ticket with support and post back here if I find a resolution.

@Red1  Were you able resolve this issue, if yes , what was the fix ?

Red1
New Contributor III

Hey @Babu_Krishnan I was! I had to reach out to my Databricks support engineer directly and the resolution was to add "cloudfiles.awsAccessKey" and "cloudfiles.awsSecretKey" to the params as in the screenshot below (apologies, i don't know why the scrnsht is so grainy). he also mentioned using Databricks secret store for the credentials themselves.

Thanks a lot @Red1. Let me try that.

But curious to know what the purpose of roleARN is. Also interested in learning how we can utilize Secret Manager to prevent passing credentials as plain text in a notebook. Thanks in advance.

@Red1 , It worked . Thanks for the details. Used Databricks secrets to store the credentials. 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group