Databricks

KKo · ‎02-25-2022

I use Azure Databricks for ETL. I read/write data from and to raw/stage/curate folders. I write dataframe to a path (eg: /mnt/datalake/curated/....). In final step I read data from the path, convert that to dataframe and write it to the Azure SQL DB/Azure Synapse DB.

I have seen people also create databases/tables with in Databricks itself using script something like this (CREATE TABLE default.People USING DELTA LOCATION '/tmp/delta/People) , and read/write data from there.

I my case I don't create databases and tables with in databricks. What are the pros and cons of each approach. Where should I use one or the other? Any insight/documentations would be helpful. Thanks

-werners- · ‎02-26-2022

when writing to another DB (a classic RDBMS or a multi node one like Synapse, Snowflake, ...) you actually have to copy the data from your storage (data lake etc) to the database.

After the copy you can work with the database as you wish.

When you create a table in Databricks (Spark), you actually create a semantic view on top of the data in your storage. So there is no copy necessary like with the DB scenario.

These 'tables' can then be queried using spark or some SQL tool like Databricks SQL, Azure Synapse Serverless, Presto, Trino, Dremio etc.

The pros and cons depend on your use case. If you already have DB and you really have to have the data in this db, well then you go for the copy scenario.

Databases can also be highly optimized (using indexes, materialized views, statistics, caching etc).

BUT the main con is that you have to copy the data and that often databases do not support unstructured data (json etc).

Also, a database has to be online to run queries and often has to be managed, which can be pricey

That is the main advantage of using ´big data sql tools' directly on top of your data lake: no extra data movement needed, there are tons of options in different price ranges. (Almost) No system management and you pay for what you use.

We use a mix of both: some data has to be copied to a physical database, other data is served by f.e. Azure Synapse Serverless or Databricks SQL.

View solution in original post

-werners- · ‎02-26-2022