cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Does Azure Databricks and Delta Layer make it a Lakehouse?

User16765131552
Contributor III

Even after going through many resources, I have failed to understand what constitutes a lakehouse, hence my question below.

If we have Azure Gen 2 Storage, ADF, and Azure Databricks with the possibility of converting the incoming CSV files into Delta tables can that be called a "Lakehouse" architecture or is it called a "Delta Lake"?

Or is it the "SQL analytics" engine over and above the Delta Lake layer that makes it a "Lakehouse"?

Please clarify.

1 ACCEPTED SOLUTION

Accepted Solutions

User16826994223
Honored Contributor III

Lakehouse is a concept defined with the following Parameter-

  1. Data is stored in an open standard format.
  2. Data is stored in a way which support Data Science,ML and BI loads.
  3. Delta is just a way or engine on cloud storage that provides control on data and prevent it from becoming data swamp and also add performance and provide sql like query support
  4. for lake house it is always recommended to have 3 layers,
  • Bronze - Raw data as it is from OTP
  • Silver -data in a curated format and with a filter that does not allow any junk data to silver, this layer is best suited for Data science and ML
  • gold layer-Purely aggregated data that helps in BI and can be used in Machine learning too.

View solution in original post

2 REPLIES 2

Ryan_Chynoweth
Honored Contributor III

At a high level a Lakehouse must contain the following properties:

  1. Open direct access data formats (Apache Parquet, Delta Lake etc.)
  2. First class support for machine learning and data science workloads
  3. state of the art performance

Databricks is the first Lakehouse because it meets the above three properties. Specifically, if you are using Databricks with ADLS and converting all your data (json, csv, parquet, messages etc.) into Delta tables that are available within Databricks. Then that is the making of a Lakehouse, but it still needs to be built and supported. The Databricks platform allows us to satisfy points 2 and 3 above and Delta Lake satisfies 1 ad 3 (performance relies on the engine and the storage which is why 3 is mentioned twice).

Leveraging Databricks and accessing data stored in Delta is a Lakehouse. By adding Databricks SQL (formally SQL Analytics) we allow more users to access and use the Lakehouse. In Databricks SQL users are using the same compute and data as the data engineer does in Databricks, they just have a different UI that they are familiar with. Additionally, Databricks SQL is optimized for SQL and BI workloads while the notebook environment is better for engineering and data science

As a fun read you should check our the Lakehouse whitepaper.

User16826994223
Honored Contributor III

Lakehouse is a concept defined with the following Parameter-

  1. Data is stored in an open standard format.
  2. Data is stored in a way which support Data Science,ML and BI loads.
  3. Delta is just a way or engine on cloud storage that provides control on data and prevent it from becoming data swamp and also add performance and provide sql like query support
  4. for lake house it is always recommended to have 3 layers,
  • Bronze - Raw data as it is from OTP
  • Silver -data in a curated format and with a filter that does not allow any junk data to silver, this layer is best suited for Data science and ML
  • gold layer-Purely aggregated data that helps in BI and can be used in Machine learning too.
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.