Databricks Community

Sujitha · ‎01-25-2023

Latest Blog Posts

January 13 - 20

Did you get a chance to look at the most recent blog posts?

Here are some happening content from the past week that is worth the read.

What’s New With SQL User-Defined Functions

In this blog, we describe several enhancements we have recently made to make SQL user-defined functions even more user-friendly and powerful, along with examples of how you can use them to wrap encapsulated logic in components suitable for using on your own or sharing with others. This way, you can keep queries simple while enjoying strong type-safety thanks to the Databricks SQL analyzer. Please read on for more details.

Easy Ingestion to Lakehouse With COPY INTO

This blog focuses on COPY INTO, a simple yet powerful SQL command that allows you to perform batch file ingestion into Delta Lake from cloud object stores. It's idempotent, which guarantees to ingest files with exactly-once semantics when executed multiple times, supporting incremental appends and simple transformations. It can be run once, in an ad hoc manner, and can be scheduled through Databricks Workflows. In recent Databricks Runtime releases, COPY INTO introduced new functionalities for data preview, validation, enhanced error handling, and a new way to copy into a schemaless Delta Lake table so that users can get started quickly, completing the end-to-end user journey to ingest from cloud object stores. Let's take a look at the popular COPY INTO use cases.

Streaming in Production: Collected Best Practices

The recommendations in this blog post are written from the Structured Streaming engine perspective, most of which apply to both DLT and Workflows (although DLT does take care of some of these automatically, like Triggers and Checkpoints). We group the recommendations under the headings "Before Deployment" and "After Deployment" to highlight when these concepts will need to be applied and are releasing this blog series with this split between the two. There will be additional deep-dive content for some of the sections beyond as well. We recommend reading all sections before beginning work to productionalize a streaming pipeline or application, and revisiting these recommendations as you promote it from dev to QA and eventually production.

Best Practices for Super Powering Your dbt Project on Databricks

dbt is a data transformation framework that enables data teams to collaboratively model, test and document data in data warehouses. Getting started with dbt and Databricks SQL is very simple with the native dbt-databricks adapter, support for running dbt in production in Databricks Workflows, and easy connectivity to dbt Cloud through Partner Connect. You can have your first dbt project running in production in no time at all!

However, as you start to deploy more complex dbt projects into production you will likely need to start using various advanced features like macros and hooks, dbt packages and third party tools to help improve your productivity and development workflow. In this blog post, we will share five best practices to supercharge your dbt project on Databricks.

Streaming in Production: Collected Best Practices, Part 2

In our two-part blog series titled "Streaming in Production: Collected Best Practices," this is the second article. Here we discuss the "After Deployment" considerations for a Structured Streaming Pipeline. The majority of the suggestions in this post are relevant to both Structured Streaming Jobs and Delta Live Tables (our flagship and fully managed ETL product that supports both batch and streaming pipelines).

The previous issue "Before Deployment" is covered in Collected Best Practices, Part 1 - if you haven't read the post yet, we suggest doing so first.

We still recommend reading all of the sections from both posts before beginning work to productionalize a Structured Streaming job, and hope you will revisit these recommendations again as you promote your applications from dev to QA and eventually production.