cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Best option for configuring Data Storage for Serverless SQL Warehouse

Curious-mind
New Contributor II

Hello!

I'm new to Databricks.

Assume, I need to migrate 2 Tb Oracle Datamart to Databricks on Azure. Serverless SQL Warehouse seems as a valid choice.

What is a better option ( cost vs performance) to store the data?

Should I upload Oracle Extracts to Azure BLOB and create External tables?

Or it is better to use COPY INTO FROM to create managed tables?

Data size will grow by ~1 Tb per year.

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions

Shua42
Databricks Employee
Databricks Employee

Hi @Curious-mind ,

Welcome to using Databricks! For your use case, I think creating managed tables using COPY INTO are going to be more performative which will lead to better cost scalability as well. While external tables could initially be a bit cheaper, managed Delta tables offer significant performance and usability benefits that pay off as your data grows.

Here are a few benefits that managed tables offer over external tables:

  • Faster queries with indexing, caching, and Delta optimizations

  • Easier schema enforcement, versioning, and time travel

  • Seamless use with Unity Catalog, RBAC, and Serverless SQL

  • Better support for optimization (OPTIMIZE, VACUUM, etc.)

View solution in original post

5 REPLIES 5

Shua42
Databricks Employee
Databricks Employee

Hi @Curious-mind ,

Welcome to using Databricks! For your use case, I think creating managed tables using COPY INTO are going to be more performative which will lead to better cost scalability as well. While external tables could initially be a bit cheaper, managed Delta tables offer significant performance and usability benefits that pay off as your data grows.

Here are a few benefits that managed tables offer over external tables:

  • Faster queries with indexing, caching, and Delta optimizations

  • Easier schema enforcement, versioning, and time travel

  • Seamless use with Unity Catalog, RBAC, and Serverless SQL

  • Better support for optimization (OPTIMIZE, VACUUM, etc.)

Curious-mind
New Contributor II

If I get it right we can use Default storage for Managed Data or Set a managed storage location for a catalog :

CREATE CATALOG <catalog-name> MANAGED LOCATION 'abfss://<container-name>@<storage-account>.dfs.core.windows.net/<path>/<directory>';

What are the reasons to create our own Managed Location vs using Default one?

Shua42
Databricks Employee
Databricks Employee

Hi @Curious-mind ,

The main benefit around creating your own managed location is just better isolation and management. It will depend on how large your data is, but if you want the data to be stored in specific locations by catalog, rather than having any new catalog created land in the root of the metastore, than specifying a managed location is what you'd want to do.

Curious-mind
New Contributor II

Hi @Shua42,

Thank you for the prompt reply.

For managed tables initial load:

Can  I simply run a COPY command::

COPY INTO DELTA_TABLE FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path'
FILEFORMAT = CSV
FILES = ('f1.csv', 'f2.csv',...) 

or it is better use Auto-Loader?

Some source Oracle tables can have 100M+ rows 

 

 

Shua42
Databricks Employee
Databricks Employee

@Curious-mind 

You got it. Running the COPY INTO is good for the initial load as it's optimized for bulk loads. You'll want to use Auto-loader going forward to incrementally process new rows.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now