Databricks Community

User16765131552 · ‎06-18-2021

Even after going through many resources, I have failed to understand what constitutes a lakehouse, hence my question below.

If we have Azure Gen 2 Storage, ADF, and Azure Databricks with the possibility of converting the incoming CSV files into Delta tables can that be called a "Lakehouse" architecture or is it called a "Delta Lake"?

Or is it the "SQL analytics" engine over and above the Delta Lake layer that makes it a "Lakehouse"?

Please clarify.

User16826994223 · ‎06-22-2021

Lakehouse is a concept defined with the following Parameter-

Data is stored in an open standard format.
Data is stored in a way which support Data Science,ML and BI loads.
Delta is just a way or engine on cloud storage that provides control on data and prevent it from becoming data swamp and also add performance and provide sql like query support
for lake house it is always recommended to have 3 layers,

Bronze - Raw data as it is from OTP
Silver -data in a curated format and with a filter that does not allow any junk data to silver, this layer is best suited for Data science and ML
gold layer-Purely aggregated data that helps in BI and can be used in Machine learning too.

View solution in original post

Ryan_Chynoweth · ‎06-18-2021

At a high level a Lakehouse must contain the following properties:

Open direct access data formats (Apache Parquet, Delta Lake etc.)
First class support for machine learning and data science workloads
state of the art performance

Databricks is the first Lakehouse because it meets the above three properties. Specifically, if you are using Databricks with ADLS and converting all your data (json, csv, parquet, messages etc.) into Delta tables that are available within Databricks. Then that is the making of a Lakehouse, but it still needs to be built and supported. The Databricks platform allows us to satisfy points 2 and 3 above and Delta Lake satisfies 1 ad 3 (performance relies on the engine and the storage which is why 3 is mentioned twice).

Leveraging Databricks and accessing data stored in Delta is a Lakehouse. By adding Databricks SQL (formally SQL Analytics) we allow more users to access and use the Lakehouse. In Databricks SQL users are using the same compute and data as the data engineer does in Databricks, they just have a different UI that they are familiar with. Additionally, Databricks SQL is optimized for SQL and BI workloads while the notebook environment is better for engineering and data science

As a fun read you should check our the Lakehouse whitepaper.

User16826994223 · ‎06-22-2021

Lakehouse is a concept defined with the following Parameter-

Data is stored in an open standard format.
Data is stored in a way which support Data Science,ML and BI loads.
Delta is just a way or engine on cloud storage that provides control on data and prevent it from becoming data swamp and also add performance and provide sql like query support
for lake house it is always recommended to have 3 layers,

Bronze - Raw data as it is from OTP
Silver -data in a curated format and with a filter that does not allow any junk data to silver, this layer is best suited for Data science and ML
gold layer-Purely aggregated data that helps in BI and can be used in Machine learning too.

Databricks Community

Does Azure Databricks and Delta Layer make it a Lakehouse?

Congratulations Databricks Partners! You're Now Officially Recognized in the Databricks Community

Solution Accelerator Series | Measure Ad Effectiveness With Multi-Touch Attribution

Govern AI Spend at Scale: A Data-Driven Approach to AI Governance | Webinar

Databricks AMER Learning Festival | Virtual Training

Introducing the Genie Hub: Ask Questions, Share Builds, and Master Conversational Analytics