Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Background - I created a SQL function with the name schema.function_name, which returns a table, in a notebook, the function works perfectly, however, I want to execute it via API using SQL Endpoint. In API, I got insufficient privileges error, so gr...
Hi,I have a table with Variant type (preview) and works well in 15.3, when i try to run a code that reference this Variant type in a DLT pipeline i get : com.databricks.sql.transaction.tahoe.DeltaUnsupportedTableFeatureException: [DELTA_UNSUPPORTED_F...
I can indeed confirm that adding some additional table properties to the @Dlt attribute in the DLT pipeline definition resolved the earlier issues. Thanks for pointing this out.
I tried running this code:```def save_file(name, obj with open(name, 'wb') as pickle.dump(obj, f)``` One file was saved in the local file system, but the second was too large and so I need to save in the dbfs file system. Unfortunately, I d...
To save a Python object to the Databricks File System (DBFS), you can use the dbutils.fs module to write files to DBFS. Since you are dealing with a Python object and not a DataFrame, you can use the pickle module to serialize the object and then wri...
I'm running a databricks job involving multiple tasks and would like to run the job with different set of task parameters. I can achieve that by edit each task and and change the parameter values. However, it gets very manual when I have a lot of tas...
Dear Team, For now, I found a solution. Disconnect the bundle source on Databricks, edit the parameters that you want to run. After execution, redeploy your code again from repository.
I've created a completely fresh project with a completely empty workspaceLocally I have the databricks CLI version 0.230.0 installedI rundatabricks bundle init default-pythonI have auth set up with a PAT generated by an account which has workspace ad...
Ok, I feel silly. Despite reading the other messages in this thread, I didn't twig to the fact that I had in fact added the subfolder I had created the DAB in to my top-level project .gitignore since I was just playing around and didn't want to comm...
I am trying to read a .xlsx file using ps.read_excel() and having #N/A as a value for string type columns. But in the dataframe, i am getting "null" inplace of #N/A . Is there any option , using which we can read #N/A as a string in .xlsx file
Did you get a solution or workaround for this error, as I am also facing the same even after using dtype = str, na_filter= False, keep_default_na = False ?
I know that UC enabled shared access mode clusters do not allow init script usage and I have tried multiple workarounds to use the required init script in the cluster(pyodbc-install.sh, in my case) including installing the pyodbc package as a workspa...
Hello all,Below workaround was efficient to me1) pyodbc-install.sh is uploaded in a Volume 2) the shared cluster is able to navigate to the Volume to select the init script3) the Databricks runtime is 15.4 LTS4) the Allowlist has been updated to allo...
Hey Databricks! Trying to use the pyodbc init script in a Volume in UC on a shared compute cluster but receive error: "[01000] [unixODBC][Driver Manager]Can't open lib 'ODBC Driver 17 for SQL Server' : file not found (0) (SQLDriverConnect)"). I fo...
Hello all,Below workaround was efficient to me1) pyodbc-install.sh is uploaded in a Volume 2) the shared cluster is able to navigate to the Volume to select the init script3) the Databricks runtime is 15.4 LTS4) the Allowlist has been updated to allo...
I am working in the AWS GLUE service, where we are trying to migrate data from S3 to salesforce using Salesforce write target tool(Using Salesforce connection). The actual process has to be, once the process is done, the salesforce provides the jobId...
Hi community,I am using a pyspark udf. The function is being imported from a repo (in the repos section) and registered as a UDF in a the notebook. I am getting a PythonException error when the transformation is run. This is comming from the databric...
I faced this issue when i was running data ingestion on unity catalog table where the cluster access mode was shared.i changed it to `Single user` and re-ran it again, now it is working.
Hi there,my company is reasonably new to using Databricks, and we're running our first PoCs. Some of the data we have structured/reasonably structured, so it drops into a bucket, we point a notebook at it, and all is well and DeltaThe problem is ari...
Hi Toby,Managing diverse, unstructured data can be challenging. At Know2Ledge (ShareArchiver), we specialize in unstructured data management to streamline this process.To handle your scenario efficiently:1️⃣Pre-Process Before Ingestion – Use AI-power...
To the Databricks Team (or whoever is pretending to care),Let me get this straight. You offer a "Community Edition" to supposedly help people learn, right? Well, congratulations, you've created the most frustrating, useless signup process I've ever s...
Hello @sachin_kanchan!
I understand the frustration, and I appreciate you sharing your experience. The Community Edition is meant to provide a smooth experience, and this shouldn’t be happening.
We usually ask users to drop an email to help@databrick...
I have my own Autoloader repo and this repo is responsible for ingestion data from landing layer(ADLS) and load data into raw layer in Databricks. In that repo, I created a couple of workflows, and run these workflows with serverless cluster. and I u...
The recommended approach for accessing cloud storage is to create Databricks storage credentials. These storage credentials can refer to entra service principals, managed identities, etc. After a credential is created, create an external location. Wh...
So, I am doing 4 spatial join operation on the files with the following sizes:Base_road_file which is 1gigabyteTelematics file which is 1.2 gigsstate boundary file , BH road file, client_geofence file and kpmg_geofence_file which are not too large My...
We recommend using spatial frameworks to speed up things like spatial joins, point-in-polygon, etc, like databricks mosaic or apache sedona. Without these frameworks, many of these operations result in unoptimized and explosive crossjoins.
If I'm running a scheduled batch Autoloader query which read from csv files on S3 and incrementally loads a delta table, how can I determine if new rows were added? I'm currently trying to do this from the streaming query.lastProgress as follows. s...