Databricks Community

Curious-mind · ‎06-02-2025

Hello!

I'm new to Databricks.

Assume, I need to migrate 2 Tb Oracle Datamart to Databricks on Azure. Serverless SQL Warehouse seems as a valid choice.

What is a better option ( cost vs performance) to store the data?

Should I upload Oracle Extracts to Azure BLOB and create External tables?

Or it is better to use COPY INTO FROM to create managed tables?

Data size will grow by ~1 Tb per year.

Thank you!

Shua42 · ‎06-02-2025

Hi @Curious-mind ,

Welcome to using Databricks! For your use case, I think creating managed tables using COPY INTO are going to be more performative which will lead to better cost scalability as well. While external tables could initially be a bit cheaper, managed Delta tables offer significant performance and usability benefits that pay off as your data grows.

Here are a few benefits that managed tables offer over external tables:

Faster queries with indexing, caching, and Delta optimizations
Easier schema enforcement, versioning, and time travel
Seamless use with Unity Catalog, RBAC, and Serverless SQL
Better support for optimization (OPTIMIZE, VACUUM, etc.)

View solution in original post

Shua42 · ‎06-02-2025

Hi @Curious-mind ,

Welcome to using Databricks! For your use case, I think creating managed tables using COPY INTO are going to be more performative which will lead to better cost scalability as well. While external tables could initially be a bit cheaper, managed Delta tables offer significant performance and usability benefits that pay off as your data grows.

Here are a few benefits that managed tables offer over external tables:

Faster queries with indexing, caching, and Delta optimizations
Easier schema enforcement, versioning, and time travel
Seamless use with Unity Catalog, RBAC, and Serverless SQL
Better support for optimization (OPTIMIZE, VACUUM, etc.)

Curious-mind · ‎06-03-2025

If I get it right we can use Default storage for Managed Data or Set a managed storage location for a catalog :

CREATE CATALOG <catalog-name> MANAGED LOCATION 'abfss://<container-name>@<storage-account>.dfs.core.windows.net/<path>/<directory>';

What are the reasons to create our own Managed Location vs using Default one?

Shua42 · ‎06-03-2025

Hi @Curious-mind ,

The main benefit around creating your own managed location is just better isolation and management. It will depend on how large your data is, but if you want the data to be stored in specific locations by catalog, rather than having any new catalog created land in the root of the metastore, than specifying a managed location is what you'd want to do.

Curious-mind · ‎06-02-2025

Hi @Shua42,

Thank you for the prompt reply.

For managed tables initial load:

Can I simply run a COPY command::

COPY INTO DELTA_TABLE FROM 'abfss://container@storageAccount.dfs.core.windows.net/base/path'
FILEFORMAT = CSV
FILES = ('f1.csv', 'f2.csv',...)

or it is better use Auto-Loader?

Some source Oracle tables can have 100M+ rows

Shua42 · ‎06-02-2025

@Curious-mind

You got it. Running the COPY INTO is good for the initial load as it's optimized for bulk loads. You'll want to use Auto-loader going forward to incrementally process new rows.

Databricks Community

Best option for configuring Data Storage for Serverless SQL Warehouse

Join Us as a Local Community Builder!

Free Edition Hackathon

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples

Level Up with Databricks Specialist Sessions

🌟 Community Pulse: Your Weekly Roundup! November 07 – 13, 2025

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐