Databricks Community

Aashita · ‎08-08-2023

In this blog, I would like to introduce to you the Databricks lakehouse platform and explain concepts like batch processing, streaming, apache spark at a high level and how it all ties together with structured streaming.

Table of Contents

What is Databricks?
What is Batch Processing?
What is Streaming?
What is Apache Spark?
What is Spark Structured Streaming?
Study Guide

What is Databricks?

Databricks Lakehouse platform is a unique amalgamation of the tools needed to build a Data driven organization. Let’s consider three personas in a typical organization-

Data Engineer: makes sure that the data makes it’s way from the source to destination that transforms itself along the way. A DE makes sure the data is structured and cleansed in such a way that it is ready to be consumed by the Analysts and Scientists to build Analytics and Models to produce insights.
Data Analyst: makes sure that the structured data is used to produce valuable insights for the team. A DA slices and dices the structured data according to their needs so that they can view the data from different perspectives.
Data Scientist: makes sure that the structured data is used to produce Machine Learning models that helps companies make better business decisions and scale according to the need.

As you can see how an organization depends on the above personas to create a robust pipeline of success. It’s difficult to find one tool that can satisfy all the needs. That’s where Databricks comes into the picture. This platform is designed to fulfill all the data needs ranging from Engineering to Analytics to Model Building in one single platform.

Screenshot 2023-07-28 at 2.20.11 PM.png

Source: Databricks Docs

So.. Databricks is an all in one platform that can handle your data needs from building, deploying, sharing, and maintaining data solutions at scale and that also means it’s a single platform for your entire data team to collaborate. Databricks sits on top of your existing cloud whether that is AWS, Azure and GCP or even a multi-cloud combination of these 3 clouds.

One major component of the Platform that I would like to discuss in this post is Streaming. But before that, let’s discuss batch processing.

What is Batch Processing?

Batch Processing is a process of running repetitive, high volume data jobs in a group on an ad-hoc or scheduled basis. Simply put, it is the process of collecting, storing and transforming the data at regular intervals. A common scenario is a data warehousing ETL job that runs once every night that extracts the data from the source application, applies transformational logic to it and stores it inside the destination warehouse. In this scenario, there is a 24-hour interval between two consecutive job runs and that also means there is a one day latency between our data warehouse and the source application.

Example of Batch Processing:

Let’s look at a concrete example for Batch processing. Let’s say we have 3 grocery stores owned by one retailer say Whole Foods. Whole foods keeps track of the overall revenue across all 3 stores. But instead of processing every purchase in real-time, Whole Foods processes each store’s daily revenue at the end of day in batches.

Screenshot 2023-07-28 at 2.20.57 PM.png

Batch Processing

When to use Batch Processing?

When you want to process large amounts of data.
Usually in a scheduled or ad-hoc manner
Latency could be minutes, hours or days
Batch processing is a bit lengthy and is not suitable for data that is time-sensitive
Because batch jobs run occasionally, it is a cost saving option

Pretty straightforward, right?

Now let’s move to Streaming..

What is Streaming?

Streaming is a process of syncing the source application with the destination data warehouse at the same time when the transactions are taking place at the source, usually with less than a minute latency. So we can say stream processing is real time processing. For example, a new row is added to a table at the source, the same row should reflect in the destination table within seconds after going through the entire Extract-Transform-Load pipeline.

When Databricks is coupled with platforms such as Apache Kafka, AWS Kinesis, Azure Event Hubs, etc, streaming quickly generates key insights that helps the data teams make quicker decisions.

Streaming Example:

Continuing our Whole Foods example with Streaming, data is fed into the system piece by piece as soon as any transaction/grocery sale takes place. In this scenario, streaming feeds each transaction at the store or micro-batches of transactions directly into the analytics platform instead of processing a batch of data every night. This allows the Data Analysts and Data Scientists to produce key insights in real time. This is most suitable in case of online shopping of groceries from the app.

Streaming

When to use streaming?

When you want to process small amount of data in real-time.
Data is streaming continuously.
Latency must be roughly 1 minute or less.

Now that we understand Batch and Streaming better, let’s get a bit deeper ..

What is Apache Spark?

Screenshot 2023-07-28 at 2.21.54 PM.png

Source: Databricks Docs

Apache spark is the largest open source project in data processing. It is a multi-language engine for executing data engineering, data science, and machine learning on single or multi-node clusters. It has a built-in advanced distributed SQL engine for large scale data processing. It was developed by Matei Zaharia(Databricks Co-founder) in a lab at UC Berkeley.

Here are some of the key features of Apache Spark:

High Scalability and reliability: It has a high data processing speed of about 100x faster in memory and 10x faster on the disk.
Unified Batch and Streaming API: Processes batch and streaming data using language of choice: Python and SQL
SQL Analytics: Execution of ANSI SQL queries is super fast
Data Science: Performs Exploratory Data Analysis(EDA) on large amount of data say Petabyte.
Machine Learning: Re-uses same ML code to scale to clusters and other machines.

Impressive right??

Well Databricks is built on top of Spark , so you can only imagine the performance while using the platform.

Now, let’s combine all the above concepts of Batch Processing, Streaming and Spark and move to our next topic i.e Spark Structured Streaming..

What is Spark Structured Streaming?

Structured Streaming is a high-level API for stream processing. It is a near-real time processing engine that offers end-to-end fault tolerance. It allows you to take the same operations that you perform in batch mode using Spark’s structured APIs, and run them in a streaming fashion. This can reduce latency and allow for incremental processing. The Structured Streaming engine performs the computation incrementally and continuously updates the result as streaming data arrives.

Simple example of Spark Structured Streaming :

In this example we will use Structured Streaming to maintain a running word count of text data received from a server on a socket.

#import classes and create local SparkSession

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

spark = SparkSession \
    .builder \
    .appName("StructuredNetworkWordCount") \
    .getOrCreate()

Next, let’s create a dataframe and transform it to calculate word counts

# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark \
    .readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

# Split the lines into words
words = lines.select(
   explode(
       split(lines.value, " ")
   ).alias("word")
)

# Generate running word count
wordCounts = words.groupBy("word").count()

Next, we start running the query that prints the counts.

 # Start running the query that prints the running counts to the console
query = wordCounts \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

query.awaitTermination()

After the above lines of code are executed, the streaming application will start in the background. Here, awaitTermination() is used to prevent the process from exiting while the query is active.

you can expect the output to look something like this-

Source: Structured Streaming Programming Guide

I personally started learning spark only recently. I was surprised by the amount of courses and material that is available online these days. Finding the right materials was crucial because I wanted to learn quickly but also get a solid understanding of concepts. If this blog got you a bit curious and interested in learning about Databricks and Spark, I have put together a list of resources to get your journey started.

The Spark Definitive Guide Book — https://learning.oreilly.com/library/view/spark-the-definitive/9781491912201/
Databricks Academy courses- https://www.databricks.com/learn/training/home
The Spark Programming Guide Documentation — https://spark.apache.org/docs/latest/streaming-programming-guide.html
Free Databricks Lakehouse Fundamentals training- https://www.databricks.com/learn/training/lakehouse-fundamentals

Hope this blog helped in understanding in what Databricks as a platform is, different types of data processing methods- batch and streaming, what is apache spark and the key features and lastly an introduction to structured streaming. In the next part of this series we will look at how Databricks ties these concepts together..

References:

Disclaimer: The opinions and ideas shared in this blog post are my own

Databricks Community

Beginners Guide to Databricks: Batch Processing and Streaming

What is Databricks?

What is Batch Processing?

What is Streaming?

What is Apache Spark?

Metadata-Driven ETL Framework in Databricks (Part-1)

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL