cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How do you think continuing to use instance profile to S3 multi part upload?

Yuki
New Contributor III

My team is currently using an instance profile to upload data to S3 since we only have Hive Metastore.

I like Unity Catalog a lot, but my code uses multipart upload to S3 for efficiency.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuoverview.html

 

I want to continue using it, but I'm unsure of the best practice now because instance profiles are not recommended anymore.

https://docs.databricks.com/aws/en/admin/workspace-settings/manage-instance-profiles

 

Is it okay to use it in our case?

Or is there any way to multipart upload data to Volume?

 

Thank you.

2 REPLIES 2

LRALVA
Honored Contributor

Hi @Yuki 

Not currently — Unity Catalog Volumes do not natively support multipart upload via AWS SDK
Unity Catalog Volumes are Databricks-managed paths in S3 (or ADLS) accessed through Unity Catalog governance.
You can’t use low-level AWS SDK multipart APIs directly with Volumes (e.g., boto3.client('s3').upload_part(...)) because:
They don’t expose raw bucket paths.
They wrap access via workspace paths like volume://catalog.schema.volume/path/.
Unity Catalog volumes are intended for managed data access via Spark and file I/O, not for direct S3 multipart SDK operations.
Unity Catalog enforces fine-grained access control, so raw multipart access bypassing Databricks governance isn't supported.

Can You Still Use Instance Profiles for Multipart Uploads?
✔️ Yes, with caution
While Databricks recommends moving away from instance profiles, they are still supported for use cases like yours where low-level AWS SDK access is required (e.g., multipart upload, boto3-based apps).
Just be sure to follow least privilege IAM practices, and isolate access to only the S3 buckets involved.
Databricks' official docs confirm:
“Instance profiles are still supported but should be used for specific, advanced access cases.”

Better Practice (If Unity Catalog Adoption is Your Goal)
If you want to align more with Unity Catalog + credential passthrough, here's a hybrid approach:

1. Use Spark or DBFS for most volume-based data writes
Unity Catalog Volumes:
spark.write.csv("volume://my_catalog.my_schema.my_volume/my_table")

2. For multipart upload, use instance profile + boto3 in a secured job
Keep a specific job or notebook that:
Uses boto3 with credentials injected via instance profile
Uploads directly to raw S3 (outside UC)
Flags the output to be registered in Unity Catalog later via CREATE TABLE USING LOCATION

LR

Yuki
New Contributor III

Hi @LRALVA ,

Thank you for your excellent response. I really appreciated it.

I couldn't find the mention that says "Instance profiles are still supported but should be used for specific, advanced access cases." I will use it for now, recognizing that my case is special.

But I also want to migrate UC perfectly. Your Best Practice is helpful for me.

I understand that if we can use Spark and the data format allows it, I will use it fully.

I was deeply moved by how thoughtfully you responded.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now