topic Re: Best option for configuring Data Storage for Serverless SQL Warehouse in Warehousing & Analytics

Best option for configuring Data Storage for Serverless SQL Warehouse

Curious-mind — Mon, 02 Jun 2025 14:04:15 GMT

Hello!

I'm new to Databricks.

Assume, I need to migrate 2 Tb Oracle Datamart to Databricks on Azure. Serverless SQL Warehouse seems as a valid choice.

What is a better option ( cost vs performance) to store the data?

Should I upload Oracle Extracts to Azure BLOB and create External tables?

Or it is better to use COPY INTO FROM to create managed tables?

Data size will grow by ~1 Tb per year.

Thank you!

Re: Best option for configuring Data Storage for Serverless SQL Warehouse

Shua42 — Mon, 02 Jun 2025 15:38:27 GMT

Hi @Curious-mind ,

Welcome to using Databricks! For your use case, I think creating managed tables using COPY INTO are going to be more performative which will lead to better cost scalability as well. While external tables could initially be a bit cheaper, managed Delta tables offer significant performance and usability benefits that pay off as your data grows.

Here are a few benefits that managed tables offer over external tables:

Faster queries with indexing, caching, and Delta optimizations
Easier schema enforcement, versioning, and time travel
Seamless use with Unity Catalog, RBAC, and Serverless SQL
Better support for optimization (OPTIMIZE, VACUUM, etc.)

Re: Best option for configuring Data Storage for Serverless SQL Warehouse

Curious-mind — Mon, 02 Jun 2025 17:15:36 GMT

Hi @Shua42,

Thank you for the prompt reply.

For managed tables initial load:

Can I simply run a COPY command::

COPY INTO DELTA_TABLE FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path'
FILEFORMAT = CSV
FILES = ('f1.csv', 'f2.csv',...)

or it is better use Auto-Loader?

Some source Oracle tables can have 100M+ rows

Re: Best option for configuring Data Storage for Serverless SQL Warehouse

Shua42 — Mon, 02 Jun 2025 17:14:43 GMT

@Curious-mind

You got it. Running the COPY INTO is good for the initial load as it's optimized for bulk loads. You'll want to use Auto-loader going forward to incrementally process new rows.

Re: Best option for configuring Data Storage for Serverless SQL Warehouse

Curious-mind — Tue, 03 Jun 2025 14:10:31 GMT

If I get it right we can use Default storage for Managed Data or Set a managed storage location for a catalog :

CREATE CATALOG <catalog-name> MANAGED LOCATION 'abfss://<container-name>@<storage-account>.dfs.core.windows.net/<path>/<directory>';

What are the reasons to create our own Managed Location vs using Default one?

Re: Best option for configuring Data Storage for Serverless SQL Warehouse

Shua42 — Tue, 03 Jun 2025 18:08:35 GMT

Hi @Curious-mind ,

The main benefit around creating your own managed location is just better isolation and management. It will depend on how large your data is, but if you want the data to be stored in specific locations by catalog, rather than having any new catalog created land in the root of the metastore, than specifying a managed location is what you'd want to do.