Databricks Community

tetan · ‎04-26-2024

Proposed by Databricks in 2020, the Lakehouse architecture has increasingly been embraced by the industry in the past years. Databricks has since applied Generative AI and evolved the Databricks Lakehouse Platform to the Databricks Data Intelligence Platform. Alongside a boosting demand for unlocking the power of data and AI, safeguarding sensitive information has become paramount for organizations of all sizes. With security in mind from the ground up, the Data Intelligence Platform enables customers to protect data privacy and confidentiality.

With the increasing frequency and sophistication of cyber threats, the traditional perimeter-based security model is no longer sufficient. Businesses are now turning towards more advanced and integrated solutions to protect their data assets. In November 2023, Databricks announced the general availability of Azure Databricks support for Azure confidential computing (ACC). It allows Azure Databricks customers to run workloads on Azure confidential virtual machines (VMs), which leverage AMD EPYC™ CPUs to protect data in memory in hardware-based trusted execution environments.

In this blog post, we'll delve into how ACC helps enhance data protection by encrypting data in use. Then we'll briefly introduce how to use ACC in your Azure Databricks workspace and discuss the suitable scenarios to use it. Lastly, we unveil the performance impact of using ACC instances by running a simple benchmark.

Protect Data in use

Authentication (AuthN) and Authorization (AuthZ) comprise the first layer of data protection. Users will be first authenticated by the Identity Provider (Entra ID). Then with the help of Databricks Unity Catalog, a unified data governance solution, users are only authorized time-bound access to the data. In addition to AuthN and AuthZ, encryption is the technique we rely on to ensure data privacy and confidentiality are enforced throughout the data lifecycle.

In Azure Databricks, encryption is implemented and applied to data at rest and in transit. Now with ACC introduced, we extend the protection to data in use too. Let’s take a look at how data is protected in these 3 scenarios.

Data at rest

Azure Databricks mainly uses Azure Data Lake Storage (ADLS) as its data store. Upon workspace creation, an ADLS blob container will be mounted as Databricks File System root (DBFS). DBFS root hosts many workspace objects, intermediate results, and secrets. By default, Azure Databricks will use Platform-managed keys (PMKs) to encrypt the data stored in DBFS root container, but you can also use Customer-managed keys (CMKs) to have more control over the keys. As shown in the diagram, clusters in your data plane communicate with Azure Key Vaults and securely use your specified key to encrypt the data in DBFS root.

For data residing in ADLS containers (Delta Lake table files, data files in other formats like CSV, JSON, etc.), Azure also encrypts blobs and files with PMKs by default. You are also able to specify your CMKs in Azure Key Vaults as the encryption key.

Data in transit

As shown in the diagram, Azure Databricks ensures all communication channels between users or BI apps and the control plane, between the control plane and data plane, and between the data plane and storage are encrypted through TLS/SSL protocol.

In addition to encrypting data in transit, you can further enhance data protection by implementing Azure Private Link. It provides private connectivity from Azure VNets and on-premises networks to Azure services without exposing the traffic to the public network.

Data in use

Databricks launch millions of VMs on Azure per day to help customers realize the value of their data. By introducing the support for ACC, Azure Databricks now brings the protection of data in use, namely when data is processed by VMs. Combined with the above-mentioned techniques, we can achieve end-to-end security through the data lifecycle.

Powered by SEV-SNP technology of AMD Infinity Guard, AMD confidential VMs provide this protection via full VM encryption while minimizing the performance impact at the same time, offering a balance between security and performance. In addition to Full Disk Encryption (FDE) of Azure VMs, strong memory integrity protection is brought by SEV-SNP technology. It helps prevent malicious hypervisor-based attacks like data replay, memory re-mapping, and more in order to create an isolated execution environment. This aids in protecting the confidentiality of your data even if a malicious VM finds a way into your VM’s memory, or a compromised hypervisor reaches into a guest VM.

When to use ACC

When processing data in the cloud, ACC can ensure that it processes the data only after the cloud environment is verified, helping prevent data access by cloud vendors and administrators. Hence ACC is a good solution when you need to process highly sensitive data, with confidentiality as the key requirement.

As to industry, ACC particularly benefits regulated industries such as healthcare and financial services. However, other sectors like retail, manufacturing, and energy can also leverage the power of ACC. For example, regardless of the sector, processing the following data would be a nice candidate for ACC:

Customer/user profile
User payment information
Tax, social security information
Commercial secrets like Intellectual property information
Health records

How to use ACC

You can use an ACC VM type of instance either for interactive development or for data pipelines. To use it for development, go to “All-purpose cluster” tab, and click on “Create Compute” based on the cluster policies available to you. As a side note, your platform administrators can create new or adjust existing cluster policies, to enforce the use of confidential VMs. ACC VM instances are now integrated into VM groups dropdown in Azure Databricks. In the “Worker type” dropdown you can select one of the confidential VMs, as shown in the screenshot below. Correspondingly you can also select a confidential VM as the driver. The available Azure VM series are DCasv5 for general-purpose workloads, ECasv5 and ECadsv5 for memory-intensive workloads. Note that you may see the options grey out, it could be either due to the region availability of ACC or your cluster policy.

Note: By the time of writing this blog, ACC has been enabled in 9 regions: East US, West US, North Europe, West Europe, Southeast Asia, Central India, East Asia, Switzerland North, and Japan East. There will be more to come soon.

If you want to run your data engineering pipeline on ACC clusters, you can simply edit the “Cluster” field of the job definition, or click on “Add new job cluster” (see screenshot). Then you will see the mentioned dropdowns to select confidential VM types.

Now we know how to use ACC type of instances. Before we are fully motivated to use them, we have 2 additional considerations: price and performance.

The price of ACC instances is the same as their counterpart non-confidential series. For example, DC4as v5 costs the same as D4as v5 in Pay-as-you-go mode. In the case of purchasing 1-year or 3-year reserved VMs, you get fewer discounts when selecting the ACC series than the normal series. Check the pricing page for more information.

When using ACC type of instances, encrypting and decrypting data in memory inevitably will bring more overhead. To measure this impact, we developed a simple benchmark, described in the following section.

Benchmarking on ACC instances

We will create a test Spark DataFrame with 1 million rows and randomly generated UUID as data, then compute SHA-256 hash values on it. The sample code to generate the test Dataframe and perform SHA-256 computation can be seen in the Appendix.

We run the benchmark code on 3 groups of single-node Azure Databricks clusters, varying in VM size: small, medium and large. Within each group we have 2 single-node clusters, one is normal type and the other is ACC instance type. These 2 instances have the same capacity, VM generation and optimization technologies (namely the only difference is whether it is ACC type). The details of these VMs are summarized in the following table.

Group	Instance type	Instance capacity	Databricks runtime
Small	D4as_v5 (normal)	4 cores, 16G RAM	13.3 LTS, Photon turned off
	DC4as_v5 (ACC)
Medium	D8as_v5 (normal)	8 cores, 32G RAM
	DC8as_v5 (ACC)
Large	D16as_v5 (normal)	16 cores, 64G RAM
	DC16as_v5 (ACC)

We run the code 3 times on each cluster and measure the average execution time in seconds. The results are shown in the following diagram. As we can see, the execution time when using ACC type of instances is longer than the normal type in all 3 groups. This confirms that encrypting and decrypting data in memory does bring overhead. On average the program takes 10% to 20% more time when running on ACC instances with the same capacity.

Note that this is just a simple benchmark to give us a feeling about ACC instance types. It does not cover distributed processing in multi-node clusters, which involves shuffling of data. Also, it does not include many other types of computations like joins and aggregations. For sophisticated evaluation please set up a benchmark based on your workloads.

Conclusion

By introducing Azure Confidential Compute (ACC) instance types, Azure Databricks allows customers to additionally safeguard their data whilst it is in use. Combined with protections like Unity Catalog, Customer-managed keys, and Private Link, you can achieve end-to-end security throughout the data lifecycle.

While encrypting and decrypting data in memory will incur computation overhead, ACC is a great solution when you need to process highly sensitive data, as it offers a good balance between security and performance.

Start today to identify your security-critical workloads, and use Azure Confidential Computing in Azure Databricks to protect your data!

Appendix

Sample code generating the test Dataframe and computing SHA-256 hash values on it.

import pyspark.sql.functions as F
import pyspark.sql.types as T
import uuid

def getUUID():
   return uuid.uuid4().hex

udfGetUUID = F.udf(getUUID, T.StringType())

df = (
   spark.range(0, 100000000)
   .withColumn('value', udfGetUUID())
)

def computeSHA256(df):
   start_time = timer()
   hashed_df = (
       df
       .withColumn('hashed_value', F.sha2(F.col('value'), 512))
   )
   # trigger the computations
   hashed_df.count()
   end_time = timer()
   return end_time-start_time

Sha256_time = computeSHA256(df)

Databricks Community

How to enhance data protection in Azure Databricks with Confidential Computing

Protect Data in use

Data at rest

Data in transit

Data in use

When to use ACC

How to use ACC

Benchmarking on ACC instances

Conclusion

Appendix

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks