Databricks Community

lara_rachidi · ‎07-23-2024

In a nutshell

Databricks acquired Tabular, a data management company founded by Ryan Blue, Daniel Weeks, and Jason Reid. This acquisition brings the original creators of Apache Iceberg and those of Linux Foundation Delta Lake, the two leading open source lakehouse formats, together. As one, they are going to lead the way with data compatibility so that you are no longer limited by which lakehouse format your data is in.
Lakeflow helps you ingest data from databases, enterprise apps, and cloud sources, transform it into batch and real-time streaming, and confidently deploy and operate in production.
Delta Lake 4.0 is the biggest release to date, with features for reliability, performance, and ease of use
Apache Spark 4.0 There are a lot of exciting new features including ANSI mode by default, Python data source, polymorphic Python UDTF, string collation support, new VARIANT data type, streaming state store data source, structured logging, Java 17 by default, and many more

Databricks + Tabular

This acquisition brings the original creators of Apache Iceberg and those of Linux Foundation Delta Lake, the two leading open source lakehouse formats, together.

The idea is to work closely with the Iceberg and Delta Lake communities to bring interoperability to the formats themselves. Ali (Databricks CEO) and Ryan (Tabular) shared their vision of formats being abstract from the user. If the formats become similar enough, it’ll be just an implementation detail that the engines will deal with.
It’s a long journey, for sure, and that’s why Databricks introduced Delta Lake Uniform last year.
Uniform tables provide interoperability across Iceberg, Delta Lake, and Hudi and support the Iceberg restful catalog interface so that companies can use the analytics engines and tools they’re already familiar with across all their data.

Both companies share a history of championing open source formats.
TL;DR: Both formats will continue to coexist and should slowly converge over time so that it becomes an implementation detail and the user can focus on solving higher-value problems!

Introducing Databricks LakeFlow: A unified, intelligent solution for data engineering

Lakeflow is a new solution that includes native and highly scalable connectors for the following tools:

Lakeflow is a unified solution for data ingestion, transformation and orchestration. Its three components are:

Lakeflow Connect: Simple and scalable data ingestion

Lakeflow pipelines: Efficient declarative pipeline

Lakeflow jobs: Reliable orchestration for every workload

Learn more on Databricks Lakeflow

Watch the announcement from Data+AI Summit 2024

Delta Lake 4.0: The biggest release to date

One of my favorite features is Delta Lake. This new version is the biggest release yet, with features for reliability and performance.

Delta Lake Uniform: allows Delta Lake tables to be read by Iceberg clients, enabling interoperability between the two formats without requiring data duplication or migration. UniForm automatically generates Iceberg metadata asynchronously, ensuring that Iceberg clients can read Delta tables seamlessly, while also supporting other formats like Hudi.
When using Delta Uniform, you can take advantage of the most advanced features in Delta Lake like Liquid Clustering. Together those two features can provide amazing performance even when reading from Iceberg or Hudi.
Variant Data Type (Parse_JSON): it’s a new feature designed to handle semi-structured data efficiently. It allows for flexible schema evolution without requiring a pre-defined schema, making it suitable for handling data with unknown or changing schema. The variant lib has been opensourced so that other formats and engines can share a common implementation! Learn more
Delta Connect: enables the use of Delta Lake with the decoupled client-server architecture of Spark Connect, allowing for more flexible and scalable data processing.
Safe Cross cloud writes: provides a central coordinator for commits, ensuring reliable and atomic writes across multiple clouds and engines.
Type widening: is a feature that allows you to change the data type of a column in a Delta Lake table to a wider type without rewriting the underlying data files. Learn more

Watch the announcement from Data+AI Summit 2024

Apache Spark 4.0: unified, ultra-Fast, scalable, flexible and powerful

Spark Connect GA in Apache Spark 4.0 is a new client-server architecture that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. This decoupled architecture enables Spark and its open ecosystem to be leveraged from anywhere, embedded in modern data applications, IDEs, notebooks, and programming languages
ANSI mode is on by default in 4.0
Variant Data Type for semi-structure data
String Collation Support
Millisec kafka 2 kafka latency for spark streaming!
Streaming State data source which allows you to inspect the internal states of Streaming applications for debugging, profiling, testing, and troubleshooting

Follow us on Linkedin: Quentin & Youssef & Lara & Maria & Beatrice

Databricks Community

Data+AI Summit 2024 Data Engineering Announcements

In a nutshell

Databricks + Tabular

Introducing Databricks LakeFlow: A unified, intelligent solution for data engineering

Delta Lake 4.0: The biggest release to date

Apache Spark 4.0: unified, ultra-Fast, scalable, flexible and powerful

Databricks GenAI & ML Announcements — December 2024

Databricks GenAI & ML Announcements — November 2024

Databricks GenAI & ML Announcements — October 2024