Q&A Recap from 11/30 Office Hours
Q: What is the downside of using z-ordering and auto optimize? It seems like there could be a tradeoff with writing small files (whereas it is good at reading a larger file), is that true?
A: By default, Delta Lake on Databricks collects statistics on the first 32 columns defined in your table schema. It keep track of simple statistics such as minimum and maximum values at a certain granularity that’s correlated with I/O granularity. Collecting statistics on long strings is an expensive operation that sometime can be bottleneck
Q: Is there a way to run Databricks on-prem? We have some workloads that are not allowed to go to the cloud due to data security requirements.
A: You can utilize our PVC support (private virtual cloud ) where all your control plane and dataplane can be accessed. That is likely the best approach.
Q: More info on Photon and how it is being used? we would love to read it
A: Here is an in-depth paper on Photon. You can get a more general overview here.
Q: How to organize Databricks from AWS Marketplace when you have multiple VPCs for each environment? One AWS VPC for each workspace?
A: The account console - all bits and bites of your network component will be controlled from here. Additionally, the subnets that you specify for a customer-managed VPC must be reserved for one Databricks workspace only. You cannot share these subnets with any other resources, including other Databricks workspaces. ideally, you can have 1 workspace 1 VPC
Q: I have lots of options for installing stuff on clusters: notebook-scoped libraries, cluster-scoped, init_scripts and custom docker image. In which case would you recommend each of them? Especially in the case I have a project with many many dependencies and I want to speedup cluster startup
A: If there are lots of libraries you wanted to install I suggest to use init script. managing the same will be easy too. One hack, add 10/20 sec sleep on your script before install command 🙂
Q: Is there a way to disable the use of the default DBFS storage account by users? We have a cyber policy that does not allow us to use any storage accounts with public IP's (which the default storage account has, and we can not change it). The issue is that when new users come onboard, that DBFS storage location is the default location for when they create new tables or datasets.
A: You need DBFS. but you can ask user to not mount. I am not sure if that fits with your use case but you can implement some deny rule too
Q: What might be a best approach (responsive and also cost efficient) to handle a low volume messaging input (1000 messages per day)...assume long periods of no messages....but when they come in, a sub second response to receive and process is expected.... I am assuming I would need a cluster always on to handle this even though it would not be very busy.... unless there is another way to handle near real time messaging with sporadic incoming messages?
A: you can use autoloader with trigger.available Now and run the job with some interval if the latency is accepted. This will help you save costs as well.