cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Autoingest not working with Unity Catalog in DLT pipeline

Red1
New Contributor III

Hey Everyone,

I've built a very simple pipeline with a single DLT using auto ingest, and it works, provided I don't specify the output location. When I build the same pipeline but set UC as the output location, it fails when setting up S3 notifications, which is entirely bizarre. I've looked at the logs on the DBZ side and request logs in AWS and it looks like DBZ isn't using the instance profile I've set for some reason. Further details below, any help would be greatly appreciated!

Context

  • Databricks on AWS
  • Deployed 1 week ago so uses all the latest features (unity catalog metastore is the default)

Things I've done

  • The instance profile is set in the pipeline settings and it appears in both clusters in the JSON settings
  • The same instance profile is used when setting up the pipeline without unity and it correctly creates the SNS/SQS resources without issue, so it's not a permissions thing on the role
  • The cluster access settings is set to "Shared" 

Things I've tried

  • I set up a security credential in the target unity catalog (by copying the working instance profile) for the bucket but that didn't change anything (and it's my understanding this is only for accessing data, not used for setting up File Notification settings)
  • I gave the unity IAM role full access to S3, no difference
  • I rebuilt the pipeline, no effect

8 REPLIES 8

Kaniz
Community Manager
Community Manager

Hi @Red1

  • You can solve the problem by specifying the schema name in the apply_merge function, such as apply_merge("{UNITY_CATALOG_NAME}. {SCHEMA_SILVER}.{SOURCE_SILVER_TABLE} >>>> {GOLD_TABLE}").
  • Error creating an external location in Unity Catalog: You can also resolve the issue by upgrading to the latest Databricks runtime version and checking the firewall settings for the storage account.
  • Databricks-connect 11.3 works with Unity Catalog: This blog post explains how to use Databricks-connect 11.3 with Unity Catalog. It also provides some troubleshooting steps, such as double-checking the credentials and permissions and verifying that the Databricks instance URL, Personal Access Token (PAT), and organization ID are correct.
  • Use Unity Catalog with your Delta Live Tables pipelines: This documentation page shows how to use Unity Catalog with Delta Live Tables pipelines. It also mentions potential issues, such as single-node mode, and how to avoid them by specifying at least one worker when configuring compute settings.

I hope this is helpful for you. Please let me know if you have any other questions or feedback. 😊

Red1
New Contributor III

Hey @Kaniz ,

Thanks for the response!

I tried the above with no luck unfortunately:
- I don't have an apply_merge function in my pipeline definition, please find the pipe definition below
- I'm running DBR 14.2
- I don't think databricks connect applies here as this was all set up in the databricks UI
- Thanks for the link to that one, I read it a couple times and I've implemented all the recommendations with no luck.

DLT definition:

 

CREATE OR REFRESH STREAMING LIVE TABLE raw_testing
AS SELECT *
  FROM cloud_files(
    "s3://bucket-path",
    "csv",
    map(
      "header", "true",
      "sep", "|",
      "cloudFiles.useNotifications", "true",
      "inferSchema", "true"
    )
  );

 

This pipeline works as expected when using HMR as the output location but doesn't work with UC. 

Any other thoughts? Is there some way i can escalate this? At this point it feels like a bug.

Kaniz
Community Manager
Community Manager

Hi @Red1,

  • One possible cause of the issue is that the Unity Catalog is not configured properly to access the cloud storage. You need to make sure that you have created an external location in the Unity Catalog and specified the correct storage account name, container name, and root path. You must also check the firewall settings and network connectivity between your Azure Databricks workspace and the cloud storage. You can find more details on how to create an external location in Unity Catalog here.
  • One more possible cause of the issue is a mismatch between the instance profile and the role you are using for your pipeline. It would be best to ensure that both your pipeline and cluster settings use the same instance profile and role and have sufficient permissions to access S3. 

I hope these suggestions will help you fix your issue. If none works, you can contact the Databricks support team for further assistance.

 

Please let me know if you have any other questions or feedback. I’m always happy to help 😊

Red1
New Contributor III

Thanks @Kaniz, UC can connect to the S3 bucket and read the data but it fails when trying to set up the bucket notifications.

I'll raise a ticket with support and post back here if I find a resolution.

Babu_Krishnan
New Contributor III

@Red1  Were you able resolve this issue, if yes , what was the fix ?

Red1
New Contributor III

Hey @Babu_Krishnan I was! I had to reach out to my Databricks support engineer directly and the resolution was to add "cloudfiles.awsAccessKey" and "cloudfiles.awsSecretKey" to the params as in the screenshot below (apologies, i don't know why the scrnsht is so grainy). he also mentioned using Databricks secret store for the credentials themselves.

Babu_Krishnan
New Contributor III

Thanks a lot @Red1. Let me try that.

But curious to know what the purpose of roleARN is. Also interested in learning how we can utilize Secret Manager to prevent passing credentials as plain text in a notebook. Thanks in advance.

@Red1 , It worked . Thanks for the details. Used Databricks secrets to store the credentials. 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.