cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

ETL in Databricks

KKo
Contributor III

I use Azure Databricks for ETL. I read/write data from and to raw/stage/curate folders. I write dataframe to a path (eg: /mnt/datalake/curated/....). In final step I read data from the path, convert that to dataframe and write it to the Azure SQL DB/Azure Synapse DB.

I have seen people also create databases/tables with in Databricks itself using script something like this (CREATE TABLE default.People USING DELTA LOCATION '/tmp/delta/People) , and read/write data from there.

I my case I don't create databases and tables with in databricks. What are the pros and cons of each approach. Where should I use one or the other? Any insight/documentations would be helpful. Thanks

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

when writing to another DB (a classic RDBMS or a multi node one like Synapse, Snowflake, ...) you actually have to copy the data from your storage (data lake etc) to the database.

After the copy you can work with the database as you wish.

When you create a table in Databricks (Spark), you actually create a semantic view on top of the data in your storage. So there is no copy necessary like with the DB scenario.

These 'tables' can then be queried using spark or some SQL tool like Databricks SQL, Azure Synapse Serverless, Presto, Trino, Dremio etc.

The pros and cons depend on your use case. If you already have DB and you really have to have the data in this db, well then you go for the copy scenario.

Databases can also be highly optimized (using indexes, materialized views, statistics, caching etc).

BUT the main con is that you have to copy the data and that often databases do not support unstructured data (json etc).

Also, a database has to be online to run queries and often has to be managed, which can be pricey

That is the main advantage of using ´big data sql tools' directly on top of your data lake: no extra data movement needed, there are tons of options in different price ranges. (Almost) No system management and you pay for what you use.

We use a mix of both: some data has to be copied to a physical database, other data is served by f.e. Azure Synapse Serverless or Databricks SQL.

View solution in original post

3 REPLIES 3

-werners-
Esteemed Contributor III

when writing to another DB (a classic RDBMS or a multi node one like Synapse, Snowflake, ...) you actually have to copy the data from your storage (data lake etc) to the database.

After the copy you can work with the database as you wish.

When you create a table in Databricks (Spark), you actually create a semantic view on top of the data in your storage. So there is no copy necessary like with the DB scenario.

These 'tables' can then be queried using spark or some SQL tool like Databricks SQL, Azure Synapse Serverless, Presto, Trino, Dremio etc.

The pros and cons depend on your use case. If you already have DB and you really have to have the data in this db, well then you go for the copy scenario.

Databases can also be highly optimized (using indexes, materialized views, statistics, caching etc).

BUT the main con is that you have to copy the data and that often databases do not support unstructured data (json etc).

Also, a database has to be online to run queries and often has to be managed, which can be pricey

That is the main advantage of using ´big data sql tools' directly on top of your data lake: no extra data movement needed, there are tons of options in different price ranges. (Almost) No system management and you pay for what you use.

We use a mix of both: some data has to be copied to a physical database, other data is served by f.e. Azure Synapse Serverless or Databricks SQL.

Hubert-Dudek
Esteemed Contributor III

As @Werner Stinckens​ said. Additionally you wrote ("I write dataframe to a path"), so that mentioned path you can just register as a table so you can always easy preview data and keep it nice organized in your databricks data section (hive metastore). All your script and data can stay as it is, registration is just as you mentioned CREATE TABLE... USING location

jose_gonzalez
Moderator
Moderator

Hi @Kris Koirala​ ,

Just checking if you still have any follow-up questions? please let us know.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.