Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Hi community,I created a job using databricks asset bundle, but I'm worrying about how to install this dependency in the right way?because, I was testing the related job, but seems it doesn't install the torch library properly
I tried to do it manually and it works.. through databricks asset bundle no. But, I did at the end: dependencies:
- torch==2.5.1
- --index-url https://download.pytorch.org/whl/cpu It says:Error: file doesn't exi...
Hi TeamI am currently working on a project to read CSV files from an AWS S3 bucket using an Azure Databricks notebook. My ultimate goal is to set up an autoloader in Azure Databricks that reads new files from S3 and loads the data incrementally. Howe...
Thank you, @Brahmareddy , for your response. I updated the code based on your suggestion, but I'm still encountering the same error message. I even made my S3 bucket public, but no luck. Interestingly, I was able to read a CSV file from the S3 bucket...
When creating a Materialized View (MV) without a schedule, there seems to be a cost associated with the MV once it is created, even if it is not queried.The question is, once the MV is created, is there already a "hot" compute ready for use in case a...
When a Materialized View (MV) is created in Databricks without a refresh schedule, there is no “hot” compute automatically kept ready for ad-hoc refreshes. However, the MV incurs costs associated with storage (vendor cost) because it physically store...
Hello,The Spark UI Simulator is not accessible since the last few days. I was able to refer to it last week, at https://www.databricks.training/spark-ui-simulator/index.html. I already have access to partner academy (if that is any relevant). <Error...
Hello @guest0!
You can refer to this post, which addresses the same issue and outlines a potential workaround.If the issue persists, I recommend raising a ticket with the Databricks Support Team.
Hello Community,Suddenly, I have an error, when I'm doing the deploy of the new bundle to databricks changing the python script, the cluster continue to point to an old version of the py script uploaded from databricks asset bundle, why this?
We've added a solution for this problem in v0.245.0. There is opt-in "dynamic_version: true" flag on artifact to enable automated wheel patching that break the cache (Example). Once set, "bundle deploy" will transparently patch version suffix in the ...
We are running bladebridge analyzer, and we are getting to run out of memorywe tried to increase the RAM and still it gives the same error.We cannot run the analyzer against subset of metadata as it would not generate comprehensive report with how th...
Hi, I' m trying to create a terraform script that does the following:- create a policy where I specify env variables and libraries- create a cluster that inherits from that policy and uses the env variables specified in the policy.I saw in the decume...
You're correct in observing this discrepancy. When a cluster policy is defined and applied through the Databricks UI, fixed environment variables (`spark_env_vars`) specified in the policy automatically propagate to clusters created under that policy...
PyTorch uses shared memory to efficiently share tensors between its dataloader workers and its main process. However in a docker container the default size of the shared memory (a tmpfs file system mounted at /dev/shm) is 64MB, which is too small to ...
getting the below error while running a python which connects to azure sql db Database connection error: ('01000', "[01000] [unixODBC][Driver Manager]Can't open lib 'ODBC Driver 17 for SQL Server' : file not found (0) (SQLDriverConnect)") can some on...
Let's say we have an RDD like this:RDD(id: Int, measure: Int, date: LocalDate)Let's say we want to apply some function that compares 2 consecutive measures by date, outputs a number and we want to get the sum of those numbers by id. The function is b...
Hi @valde, those two approaches give the same result, but they don’t work the same way under the hood. SparkSQL uses optimized window functions that handle things like shuffling and memory more efficiently, often making it faster and lighter.On the o...
Troubleshooting and Resolution for java.io.IOException: Invalid PKCS8 data
The error java.io.IOException: Invalid PKCS8 data typically occurs when there is an issue with the private key format or its storage in Databricks secrets. Based on the provid...
Has anyone ever come across the error above?I am trying to get two tables from unity catalog and join them, the join is fairly complex as it is imitating a where not exists top 1 sql query.
Hello @VZLA Recently, I am getting the exact same error.It has a caused by as below -```Caused by: kafkashaded.org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition.```Stacktrace -ERROR: Some ...
Hi @eenaagrawal ,There isn't a specific built-in integration in Databricks to directly interact with Sharepoint. However, you can accomplish this by leveraging libraries like Office365-REST-Python-Client, which enable interaction with Sharepoint's RE...
Hi all!I need to migrate multiple notebooks from one workspace to another. Is there any way to do it without using Git?Since Manual Import and Export is difficult to do for multiple notebooks and folders, need an alternate solution.Please reply as so...
HelloI have written a python script that uses Databricks Rest API(s). I am trying to clone/ update an Azure Devops Repository inside databricks using Azure Service Principal. I am able to retrieve the credential_id for the service principal I am usin...
@nicole_lu_PM So sorry for coming back to this issue after such a long time. But I looked into it and it seems like this concept of OBO token is applicable in case we use Databricks with AWS as our cloud provider. In case of Azure most of the commen...