Databricks Community

mlevit · ‎01-24-2024

Inspired by the approach of Simple Wikipedia, the Databricks Simplified series aims to present various Databricks concepts and products in an easily understandable manner. This series will cover topics such as Notebooks, Unity Catalog, Delta Lake, DBSQL, MLflow and more.

- - -

Notebooks are at the heart of Databricks, and generally speaking, many of your interactions with Databricks will be through them. They’re flexible, robust, and integrated with all Databricks products.

What are they?

Much like Word documents hold your written content, notebooks keep your code. This code runs within Databricks to do things such as transforming data, training machine learning models, performing ad hoc data analyses, and more.

In contrast to documents, notebooks segment your code into individual commands. Each command can be scriptable in a distinct programming language (Python, SQL, Scala, or R), and each command can be run independently.

Can I see them in action?

The screenshot below displays a Databricks Notebook featuring two commands.

The first command, written in Python, is designed to import a CSV file from a Databricks Volume (AWS / Azure), creating a temporary view of the data. The second command utilizes SQL to access the data within the temporary view to determine the count of users per country. When run, each command's output will be displayed below the command.

Notebooks provide a clean, easy-to-read user interface for all your data-related interactions within Databricks.

So, what makes them so unique?

Multilingual

The first key feature is the capability to accommodate and run code written in various programming languages. Adherence to a single programming language is generally required when writing code on most platforms. Each language requires special tools to comprehend and run the code, thus making multilingual applications virtually nonexistent. However, Databricks Notebooks can handle and run code in multiple languages, introducing flexibility that was once very hard to achieve.

Each programming language possesses its unique strengths and weaknesses. Tasks that might be intricate or seemingly impossible in one language may be significantly simpler in another. This flexibility empowers users to construct more straightforward code to write, read, and maintain.

Open

The second important aspect is the ability to leverage code developed by the open source community, commonly called libraries. Popular programming languages like Python, Scala, and R boast extensive libraries freely available to the community. Users can implement libraries to enhance their code, improve efficiency, or simplify complex operations.

Users have the option to install libraries directly within their notebooks. As a result, these libraries can be referenced by the code within that notebook without affecting other users or notebooks within the Databricks platform. This isolation level proves remarkably useful, especially considering the complexities introduced by libraries (i.e., dependency or version conflicts).

Although libraries are a common aspect of most programming languages, numerous platforms either restrict the installation and use of external libraries or impose significant limitations. Such restrictions compel users to operate within strict constraints, diminishing their efficiency and introducing unnecessary complexity.

Interactive and Scheduled

As mentioned, notebooks contain the code required for various tasks within Databricks. The crucial next question is: How do these notebooks operate?

This is the third important aspect that sets notebooks apart. Users can use notebooks interactively while developing and testing their code. The code developed within the notebooks can be run directly within Databricks on a Databricks Job Compute cluster (a set of computation resources and configurations on which you run notebooks and jobs). Users can, therefore, create and run their code within a single pane of glass more efficiently with fewer hops between different products and services.

Once development is complete, users can promote the notebook to production (AWS / Azure) and schedule it at specified intervals (hourly, daily, weekly, and so on). The same notebook used interactively by the developer during code development is the same one scheduled to run their production workloads.

Visual

Visualisations bring a whole new perspective to your data that a table simply can’t. Databricks Notebooks can represent your data using many built-in visualizations, including bar, line, pie, histogram, scatter, map, and more.

Visualizations can emphasize outliers, bring attention to problem areas, or help visual learners better comprehend the data. Users can create one or more visualizations for each command’s result. Visualisations get automatically refreshed and updated whenever commands get re-run.

Collaborative

Similar to Google Docs, Databricks Notebooks facilitate real-time collaboration among multiple users. There's no need to share screens, send copies, or interrupt your ongoing work just to seek assistance from a team member.

Users can collaborate by writing or modifying code and leaving comments for each other. Elaborate solutions can be constructed simultaneously, with multiple users collaborating within a single notebook.

Versioned

Every notebook maintains its version history to ensure you keep track of your or your team's changes. This history is easily accessible, and viewing the historical changes will give users a side-by-side comparison of their selected version history to the current notebook state.

Users can, if they wish, restore their notebook to any version history, ensuring no code is ever lost.

Wrapping it up

Notebooks are pivotal and central in Databricks, serving as the gateway to various functionalities.

For many data engineers and data warehousing developers, notebooks offer a step up in usability, functionality, and flexibility, allowing for more interactive and iterative approaches to data manipulation and analysis. This is due to their ability to blend code, visualisations, and narrative text in a single, easily shareable document.

This makes them particularly useful for collaborative projects, as they can be used to document the data exploration and analysis process in a way that is understandable to both technical and non-technical stakeholders. A far cry from the simple SQL editor interface most data engineers and data warehousing developers are used to.

You can check out Introduction to Databricks Notebooks (AWS / Azure) for a more comprehensive look into notebooks.

Databricks Community

Databricks Simplified: Notebooks

What are they?

Can I see them in action?

So, what makes them so unique?

Multilingual

Open

Interactive and Scheduled

Visual

Collaborative

Versioned

Wrapping it up

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks