topic Re: ETL in Databricks in Data Engineering

ETL in Databricks

KKo — Fri, 25 Feb 2022 21:03:13 GMT

I use Azure Databricks for ETL. I read/write data from and to raw/stage/curate folders. I write dataframe to a path (eg: /mnt/datalake/curated/....). In final step I read data from the path, convert that to dataframe and write it to the Azure SQL DB/Azure Synapse DB.

I have seen people also create databases/tables with in Databricks itself using script something like this (CREATE TABLE default.People USING DELTA LOCATION '/tmp/delta/People) , and read/write data from there.

I my case I don't create databases and tables with in databricks. What are the pros and cons of each approach. Where should I use one or the other? Any insight/documentations would be helpful. Thanks

Re: ETL in Databricks

-werners- — Sat, 26 Feb 2022 09:29:37 GMT

when writing to another DB (a classic RDBMS or a multi node one like Synapse, Snowflake, ...) you actually have to copy the data from your storage (data lake etc) to the database.

After the copy you can work with the database as you wish.

When you create a table in Databricks (Spark), you actually create a semantic view on top of the data in your storage. So there is no copy necessary like with the DB scenario.

These 'tables' can then be queried using spark or some SQL tool like Databricks SQL, Azure Synapse Serverless, Presto, Trino, Dremio etc.

The pros and cons depend on your use case. If you already have DB and you really have to have the data in this db, well then you go for the copy scenario.

Databases can also be highly optimized (using indexes, materialized views, statistics, caching etc).

BUT the main con is that you have to copy the data and that often databases do not support unstructured data (json etc).

Also, a database has to be online to run queries and often has to be managed, which can be pricey

That is the main advantage of using ´big data sql tools' directly on top of your data lake: no extra data movement needed, there are tons of options in different price ranges. (Almost) No system management and you pay for what you use.

We use a mix of both: some data has to be copied to a physical database, other data is served by f.e. Azure Synapse Serverless or Databricks SQL.

Re: ETL in Databricks

Hubert-Dudek — Sat, 26 Feb 2022 17:52:16 GMT

As @Werner Stinckens said. Additionally you wrote ("I write dataframe to a path"), so that mentioned path you can just register as a table so you can always easy preview data and keep it nice organized in your databricks data section (hive metastore). All your script and data can stay as it is, registration is just as you mentioned CREATE TABLE... USING location

Re: ETL in Databricks

jose_gonzalez — Mon, 11 Apr 2022 20:55:55 GMT

Hi @Kris Koirala ,

Just checking if you still have any follow-up questions? please let us know.