cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Lakehouse Architecture : How notebooks are organized and executed

KuldeepChitraka
New Contributor III

We are implementing a lakehouse architecture and using Notebook to transform data from object storage. Most of the time our source is database for which there one folder per table in object storage.

We have structure like below for various notebooks

  • GOLD (Folder)
  • SILVER (Folder)
  • BRONZE(Folder)
    • MasterRawNotebooks.py (Notebook)
      • Bronze_tables (Folder)
        • Table1_notebook.py
        • Table2_notebook.py

RAW folder contains notebook for each table which reads data from object storage and create DELTA table.

MasterRawNotebook contains one cell for each raw notebook which calls Table1_notebook, Table2_Notenook using %run

So when we execute MasterRawNotebook it runs the each notebook one by one and created table in database in databricks.

  • Is it a right approach?
  • We are are creating notebook for each table
  • Or should execute BRONZE notebooks parallelly.
  • How have you implemented the notebook pipeline when you are implementing Lakehouse Architecture.
  • What kind of exception handling have you done in notebooks while loading data from BRONZE to SILVER
  • If possible would you share folder structure , how notebooks are organized for loading , transformation etc.
  • Any best practices to refer?
2 REPLIES 2

daniel_sahal
Esteemed Contributor

@Kuldeep Chitrakar​ 

First of all - instead of running notebooks one by one through MasterRawNotebook, you could use Workflows -> Jobs (or any other scheduler, ex. Airlfow, ADF) to run them in parallel and save some time.

Creating notebooks for each table - for loading Raw to Bronze it's possible to create one generic notebook that will do the work for you (it depends on the raw filetype, but with ex. Parquet it's doable). Write your code as generic as you can. Anyways, doint one notebook per table is also fine.

Folder structure - you need to find your own way of doing things 🙂

Here's what i'm using (it may differ project to project):

  • Config (Folder) - it keeps all notebooks that handle the configuration, such as authenticating with the external databases/tools; mounting storage etc.
  • RawToBronze (Folder) - notebooks ingesting data from Raw to Bronze
  • BronzeToSliver (Folder) - notebooks transforming data from Bronze to Silver
  • SilverToGold (Folder) - notebooks transforming data from Silver to Gold
  • GoldToXxx (Folder) - notebook that handles data transfer between Lakehouse and any other tool that we're using (ex. Synapse or SQL Database),
  • Lib.py (File) - notebook that keeps all custom-made functions/classess

jose_gonzalez
Moderator
Moderator

Adding @Vidula Khanna​ and @Kaniz Fatma​ for visibility to help with your request

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.