Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Hello, I have two workspaces, each workspace pointing to a VPC in AWS, in one of the accounts we need to remove a subnet, after removing the InvalidSubnetID.NotFound AWS error when starting the clueter, checked in Manager Account, the networl is poin...
Hi @thiagoawstest, Could you please ensure these :
The specified subnet IDs exist in the correct VPC and AWS region.The subnet IDs are properly formatted as subnet-xxxxxxxxxxxxxxxxx.The subnets are not already in use by other resources.
Hi,I just explored serverless feature in databricks and wondering how can i track cost associated with it. Is it stored in system tables? If yes, then where can i find it?And also how can i prove that it's cost is relatively less compared to classic ...
Hi @Avinash_Narala,
Databricks provides a system table called system.billing.usage (Public Preview) that allows you to monitor the cost of your serverless compute usage.This table includes user and workload attributes related to serverless compute c...
Hi,I recently came across File Trigger in Databricks and find mostly similar to Autoloader. My 1st question is why file trigger as we have autoloader.In which scenarios I can go with file triggers and autoloader.Can you please differentiate?
Hi @Avinash_Narala, The key differences between File Trigger and Autoloader in Databricks are:
Autoloader
Autoloader is a tool for ingesting files from storage and doing file discovery.It is designed for incremental data ingestion, processing new fil...
I'm attempting to fetch an Oracle Netsuite table in parallel via JDBC using the Netsuite Connect JAR, already installed on the cluster and setup correctly. I can do successfully with a single-threaded approach using the `dbtable` option:table = 'Tran...
@mtajmouati I appreciate your response. This approach resulted in a generic "bad SQL" error in Netsuite: "java.sql.SQLSyntaxErrorException: [NetSuite][SuiteAnalytics Connect JDBC Driver][OpenAccess SDK SQL Engine]Syntax Error in the SQL statement.[10...
I have a java application, packed as a jar, and will be used as jar dbx job.This application need1. read azure storage file, yaml format.2. need to get passphrase, privatekey stored in dbx, in order to access a snowflake dbmy questions are:1. how to ...
Hi @ShenghaoWu,
To access an Azure Storage file in your Java code, you can use the Azure Storage SDK for Java. This can be done within your Java application packaged as a JAR file that will be used as a dbx job. Here is an example of how to read an ...
I have a Databricks workspace in GCP and I am using the cluster with the Runtime 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12). I am trying to set the checkpoint directory location using the following command in a notebook:spark.sparkContext.set...
Hello, I am trying to access an API in databricks python notebook which is available within a restricted network. ​When I try to access that API, it's not able to find the URL used to access the API and throws an HTTP error (max retries exceeded).​d...
Hi,I am trying to search a mnt point for any empty folders and remove them. Does anyone know of a way to do this? I have tried dbutils.fs.walk but this does not seem to work.Thanks
Hi @Nathant93,
To find and remove empty folders in a mount point using PySpark, you can follow these steps:
1. List all folders in the mount pointYou can use the `dbutils.fs.ls()` function to list all the folders in the mount point:
folders = dbutil...
I saw this notebook: htmlwidgets-azure - Databricks (microsoft.com)However, it is not reproducible. I got a lot errors:there is no package called ‘R.utils’. This is easy to fix, just install the package "R.utils""can not be unloaded". This is not ...
Hi yalei, Did you have any luck fixing this issue? I am also trying to replicate the htmlwidgets notebook and am running into the same error.Unfortunately, the suggestions provided by Kaniz_Fatma below did not work.
Hello!I'm trying to do my modeling in DLT pipelines. For bronze, I created 3 streaming views. When I try to join them to create silver table, I got an error that I can't join stream and stream without watermarks. I tried adding them but then I got no...
Hello @ksenija ,
Greetings!
Streaming uses watermarks to control the threshold for how long to continue processing updates for a given state entity. Common examples of state entities include:
Aggregations over a time window.
Unique keys in a join b...
As recently announced in the summit that notebooks, jobs, workflows will run in serverless mode, how do we track/debug the compute cluster metrics in this case especially when there are performance issues while running jobs/workflows.
Databricks is planning to enable some system tables to capture some of these metrics and same can be leveraged for troubleshooting as starting point is my view
I'm just walking through a simple exercise presented in the Databricks Platform Lab notebook, in which I'm executing a remote notebook from within using the %run command. The remote notebook resides in the same directory as the Platform Lab notebook,...
The %run command is a specific Jupyter magic command.
The ipykernel used in Databricks examines the initial line of code to determine the appropriate compiler or language for execution.
To minimize the likelihood of encountering errors, it is advisab...
Hi thereIt seems there are many different ways to store / manage data in Databricks.This is the Data asset in Databricks: However data can also be stored (hyperlinks included to relevant pages):in a Lakehousein Delta Lakeon Azure Blob storagein the D...
Azure.gov does not have Unity Catalog (as of July 2024). I think previous responses missed the context of government cloud in OP's question. UC has been open sourced since this question was asked, and is a more comprehensive solution in commercial cl...
Hello, recently I've tried to upgrade my runtime env to the 13.3 LTS ML and found that it breaks my workload during applyInPandas.My job started to hang during the applyInPandas execution. Thread dump shows that it hangs on direct memory allocation: ...
The applyInPandas function may hang on Databricks Runtime 13.3 LTS ML and later versions owing to changes or inefficiencies in how the runtime handles parallel processing. Consider evaluating recent revisions or implementing alternative DataFrame ope...