cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks best practices for azure storage account

Dnirmania
New Contributor III

Hello Everyone

Currently, We are in process of building azure databricks and have some doubt regarding best practices to follow for azure storage account which we will be using to store data. Can anyone help me finding best practices to follow for storage account, specially in backup and recovery scenarios.

Thank you!

Dinesh Kumar

1 ACCEPTED SOLUTION

Accepted Solutions

Rjdudley
Contributor II

"It depends..."

Unfortunately, Databricks only has some general best practices, and will encourage you to connect with your account team for specific advice.  If you haven't met your account team yet, you should try and find them, they're a great resource.

AS a cloud architect and Databricks administrator, here are some specific things I recommend for storage in Azure Databricks:

  • In even a very large enterprise you can absolutely, positively do everything in a single container in a single storage account.  Some people will even recommend this with a straight face.  Absolutely, positively do not do this, your permissions will trip over each other and it will be a management headache, not to mention it will be very difficult to clean the thing up and archive old data to less expensive storage tiers when the time comes.
  • Dev environment should behave like prod environment.  Use the same architecture as much as you can, owing for differences in work in progress.  Use the same permissions, the same networking setup, everything.  You can thank me when you're not debugging in prod.
  • If you can, use Pulumi or Terraform, and set defaults for the important settings.  This also helps dev behave like prod.
  • On all containers, use infrastructure encryption with a Microsoft managed key (or a customer managed key if your enterprise has that capability, your DevOps would know).
  • Use a different storage account for each medallion, so at a minimum have a bronze, a silver, and a gold storage account.  This allows you to set access permissions differently at the storage account level.  This makes it easier to have data scientitsts working with cleansed and enriched data in gold without accidentally finding something they shouldn't in bronze.
  • Depending on where your data originates, consider having a fourth medallion, called "landing".  I recommend this especially if external partners or applications are pushing data to you.  If you're pulling data using internal applications, you can get by with landing buckets in bronze since you won't need access from the public internet.  Move files out of landing into your bronze storage account, then process from bronze.  This sounds like overkill but publicly available landing containers have been responsible for many recent data breaches, and the penalties are getting more severe.
  • In bronze, I recommend having separate containers for the raw data from each source.  You could get by with a single container and subfolders, but this gives you cleaner separation, especially if you need to drop a data source.
  • In bronze, I recommend having a landing location for raw data files, and a separate location for the DLTs of your raw data (yes, you can query directly from raw files but there are many advantages to starting with DLTs, which is a separate discussion.  I prefer to use separate containers for raw and dlt, but you could use a single container with a "raw" and "dlt" folder.
  • The containers in silver and gold require some thought, and this is where the specifics of your enterprise come into play.  Use all the recommended storage configuration.  Silver may have fewer containers than bronze or gold, all depends on what your first transformation of data looks like.  You can separate silver and gold into specific uses, like "dimensional analysis" or "marketing analytics", or you can separate by entity, or you can separate by data domain, whatever is going to work best for you.  It's sometimes easiest to start small, work with a limited but related set of data, create the Azure resources, and grow from there.
  • Besides the individual permissions, I like to use separate containers because for absolutely humongous datasets you can push the i/o of a container to its limits.  Again, this is where the specifics of your enterprise come into play.  Depending on your Azure region (again, your specifics...) containers should be easily able to handle GB/min of i/o, but if you're repeatedly querying the same few columns of data over and over, you can create partition hotspots and you need a different partition strategy.  I've hit this wall in the past, you get very specific errors and it's not likely to be an issue from the start unless you are streaming double digit terabytes.
  • You don't really "back up" cloud data in the old tape-backup sense.  At a bare minimum you want to use geo-redundant zone storage (GRZS, Data redundancy - Azure Storage | Microsoft Learn) which will replicate data across availability zones in your primary region.  You can also use geo-redundancy and a secondary region, but there are several considerations to this.  First, Databricks Accounts and Workspaces are region bound, so if you plan on using a secondary region for failover, you'll need to set up Databricks in that region also, making sure to deploy your code in both regions.  Secondly, you need to make sure your secondary region is compliant with all the same privacy laws and is preferably in the same country as your primary region.  Once again, depends on your specifics, but you typically can't use a US region as a secondary for a European primary without tripping over GDPR rules.  It is possible to run Databricks multi-region, but you should also keep in mind that Databricks is for data analysts and isn't really meant for being a transactional-operational backend database.  This would usually make Databricks a business continuity level 2 or 3, which don't usually require multi-region.  The emerging AI capabilities of Databricks may soon find use in operational systems so again, evaluate what you're going to do, you could just serve models from regions and run lower overhead.

For some references:

View solution in original post

7 REPLIES 7

filipniziol
Contributor III

Hi @Dnirmania ,

It appears that your question is primarily about best practices for Azure Storage accounts, specifically focusing on backup and recovery scenarios, rather than being directly related to Databricks. I recommend reviewing the following two Microsoft articles:

Azure Storage Redundancy: This article details the various redundancy options available in Azure Storage. Understanding these options will help you configure the appropriate level of data replication and resilience for your storage account, ensuring your data is stored securely and is highly available.

Azure Storage Disaster Recovery Guidance: This resource provides comprehensive guidance on planning and implementing a disaster recovery strategy for your Azure Storage account. It covers best practices for backup, recovery, and how to prepare for and execute a storage account failover.

Dnirmania
New Contributor III

Thanks @filipniziol for your suggestion. The articles you shared focus on general best practices and recommendations for storage accounts. What I'm looking for are Databricks-specific recommendations for configuring storage accounts.

szymon_dybczak
Esteemed Contributor III

Hi @Dnirmania ,

Best practice is to configure storage with Unity Catalog:

Connect to cloud object storage and services using Unity Catalog - Azure Databricks | Microsoft Lear...

But in your question you're asking about  backup and recovery scenarios at the storage level. Those should be handled via Azure native capabilities like Azure Backup. The same applies for DR for azure storage, you can read more at below documentation entry.
Databricks is just using object storage of given cloud provider to store data. Its your responsibility (or azure administrator) to plan backup and disaster recovery scenarios.
Azure storage disaster recovery planning and failover - Azure Storage | Microsoft Learn

 

 

Rjdudley
Contributor II

"It depends..."

Unfortunately, Databricks only has some general best practices, and will encourage you to connect with your account team for specific advice.  If you haven't met your account team yet, you should try and find them, they're a great resource.

AS a cloud architect and Databricks administrator, here are some specific things I recommend for storage in Azure Databricks:

  • In even a very large enterprise you can absolutely, positively do everything in a single container in a single storage account.  Some people will even recommend this with a straight face.  Absolutely, positively do not do this, your permissions will trip over each other and it will be a management headache, not to mention it will be very difficult to clean the thing up and archive old data to less expensive storage tiers when the time comes.
  • Dev environment should behave like prod environment.  Use the same architecture as much as you can, owing for differences in work in progress.  Use the same permissions, the same networking setup, everything.  You can thank me when you're not debugging in prod.
  • If you can, use Pulumi or Terraform, and set defaults for the important settings.  This also helps dev behave like prod.
  • On all containers, use infrastructure encryption with a Microsoft managed key (or a customer managed key if your enterprise has that capability, your DevOps would know).
  • Use a different storage account for each medallion, so at a minimum have a bronze, a silver, and a gold storage account.  This allows you to set access permissions differently at the storage account level.  This makes it easier to have data scientitsts working with cleansed and enriched data in gold without accidentally finding something they shouldn't in bronze.
  • Depending on where your data originates, consider having a fourth medallion, called "landing".  I recommend this especially if external partners or applications are pushing data to you.  If you're pulling data using internal applications, you can get by with landing buckets in bronze since you won't need access from the public internet.  Move files out of landing into your bronze storage account, then process from bronze.  This sounds like overkill but publicly available landing containers have been responsible for many recent data breaches, and the penalties are getting more severe.
  • In bronze, I recommend having separate containers for the raw data from each source.  You could get by with a single container and subfolders, but this gives you cleaner separation, especially if you need to drop a data source.
  • In bronze, I recommend having a landing location for raw data files, and a separate location for the DLTs of your raw data (yes, you can query directly from raw files but there are many advantages to starting with DLTs, which is a separate discussion.  I prefer to use separate containers for raw and dlt, but you could use a single container with a "raw" and "dlt" folder.
  • The containers in silver and gold require some thought, and this is where the specifics of your enterprise come into play.  Use all the recommended storage configuration.  Silver may have fewer containers than bronze or gold, all depends on what your first transformation of data looks like.  You can separate silver and gold into specific uses, like "dimensional analysis" or "marketing analytics", or you can separate by entity, or you can separate by data domain, whatever is going to work best for you.  It's sometimes easiest to start small, work with a limited but related set of data, create the Azure resources, and grow from there.
  • Besides the individual permissions, I like to use separate containers because for absolutely humongous datasets you can push the i/o of a container to its limits.  Again, this is where the specifics of your enterprise come into play.  Depending on your Azure region (again, your specifics...) containers should be easily able to handle GB/min of i/o, but if you're repeatedly querying the same few columns of data over and over, you can create partition hotspots and you need a different partition strategy.  I've hit this wall in the past, you get very specific errors and it's not likely to be an issue from the start unless you are streaming double digit terabytes.
  • You don't really "back up" cloud data in the old tape-backup sense.  At a bare minimum you want to use geo-redundant zone storage (GRZS, Data redundancy - Azure Storage | Microsoft Learn) which will replicate data across availability zones in your primary region.  You can also use geo-redundancy and a secondary region, but there are several considerations to this.  First, Databricks Accounts and Workspaces are region bound, so if you plan on using a secondary region for failover, you'll need to set up Databricks in that region also, making sure to deploy your code in both regions.  Secondly, you need to make sure your secondary region is compliant with all the same privacy laws and is preferably in the same country as your primary region.  Once again, depends on your specifics, but you typically can't use a US region as a secondary for a European primary without tripping over GDPR rules.  It is possible to run Databricks multi-region, but you should also keep in mind that Databricks is for data analysts and isn't really meant for being a transactional-operational backend database.  This would usually make Databricks a business continuity level 2 or 3, which don't usually require multi-region.  The emerging AI capabilities of Databricks may soon find use in operational systems so again, evaluate what you're going to do, you could just serve models from regions and run lower overhead.

For some references:

Dnirmania
New Contributor III

Thanks for sharing your knowledge with us. it will definitely help me and other data Engineers. Thanks once again  😊 

bhanu_gautam
Contributor

Thanks for sharing @Rjdudley @szymon_dybczak @filipniziol 

Regards
Bhanu Gautam

Kudos are appreciated

To follow up, you can actually back up blobs: Overview of Azure Blobs backup - Azure Backup | Microsoft Learn, including to on-premises.  Obviously on-premises capacity is a large question.  I excluded this because I question what you would accomplish with the cloud backup option that wouldn't be better served by geo-replication, but in the interest of thoroughness I felt I had to mention this option.  Up to your specific needs, though.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group