Databricks

KuldeepChitraka · ‎01-18-2023

We are implementing a lakehouse architecture and using Notebook to transform data from object storage. Most of the time our source is database for which there one folder per table in object storage.

We have structure like below for various notebooks

GOLD (Folder)
SILVER (Folder)
BRONZE(Folder)
- MasterRawNotebooks.py (Notebook)
  - Bronze_tables (Folder)
    - Table1_notebook.py
    - Table2_notebook.py

RAW folder contains notebook for each table which reads data from object storage and create DELTA table.

MasterRawNotebook contains one cell for each raw notebook which calls Table1_notebook, Table2_Notenook using %run

So when we execute MasterRawNotebook it runs the each notebook one by one and created table in database in databricks.

Is it a right approach?
We are are creating notebook for each table
Or should execute BRONZE notebooks parallelly.
How have you implemented the notebook pipeline when you are implementing Lakehouse Architecture.
What kind of exception handling have you done in notebooks while loading data from BRONZE to SILVER
If possible would you share folder structure , how notebooks are organized for loading , transformation etc.
Any best practices to refer?

daniel_sahal · ‎01-18-2023

@Kuldeep Chitrakar

First of all - instead of running notebooks one by one through MasterRawNotebook, you could use Workflows -> Jobs (or any other scheduler, ex. Airlfow, ADF) to run them in parallel and save some time.

Creating notebooks for each table - for loading Raw to Bronze it's possible to create one generic notebook that will do the work for you (it depends on the raw filetype, but with ex. Parquet it's doable). Write your code as generic as you can. Anyways, doint one notebook per table is also fine.

Folder structure - you need to find your own way of doing things 🙂

Here's what i'm using (it may differ project to project):

Config (Folder) - it keeps all notebooks that handle the configuration, such as authenticating with the external databases/tools; mounting storage etc.
RawToBronze (Folder) - notebooks ingesting data from Raw to Bronze
BronzeToSliver (Folder) - notebooks transforming data from Bronze to Silver
SilverToGold (Folder) - notebooks transforming data from Silver to Gold
GoldToXxx (Folder) - notebook that handles data transfer between Lakehouse and any other tool that we're using (ex. Synapse or SQL Database),
Lib.py (File) - notebook that keeps all custom-made functions/classess

jose_gonzalez · ‎04-25-2023

Adding @Vidula Khanna and @Kaniz Fatma for visibility to help with your request

Databricks

Lakehouse Architecture : How notebooks are organized and executed

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs