Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
After trying to run spark_udf = mlflow.pyfunc.spark_udf(spark, model_uri=logged_model,env_manager="virtualenv")We get the following error:org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 145.0 failed 4 times, most re...
Hi,Thanks in advance.I am new in DLT, the scenario is i need to read the data from cloud storage(ADLS) and load it into my bronze table. and read it from bronz table -> do some DQ checks and load the cleaned data into my silver table. finally populat...
Hi,We are trying to build and upsert logic for a Delta table for that we are writing a merge command between streaming dataframe and delta table dataframe. Please find the below code merge_sql = f""" Merge command come here"""spark.sql(merg...
I am using the databricks VSCode extension to sync my local repository to Databricks Workspaces. I have everything configured such that smaller syncs work fine, but a full sync of my repository leads to the following error:Sync Error: Post "https://<...
Hi Team,We have the requirement to have metadata(Unity catalog) in one AWS account and data storage(Delta tables under data) in another account, is it possible to do that , Do we face any technical/Security issue??
I'm trying to figure out the cost breakdown for the Databricks usage for my team.When I go into the Databricks administration console and click Usage when I select to show the usage By SKU it just displays the type of cluster but not the name of it. ...
Please check the below docs for usage related informations.
The Billable Usage Logs:
https://docs.databricks.com/en/administration-guide/account-settings/usage.html
You can filter them using tags for more precise information which you are looking for...
Dear Community,I am testing pyspark code via pytest using VS code and Databricks Connect.SparkSession is initiated from Databricks Connect: from databricks.connect import DatabricksSessionspark = DatabricksSession.builder.getOrCreate()I am receiving...
Hi , We have a spark job write data in delta table for last 90 date partition. We have enabled spark.databricks.delta.autoCompact.enabled and delta.autoOptimize.optimizeWrite. Job takes 50 mins to complete. In that logic takes 12 mins and optimizewri...
Is there anyway to accomplish this ? I have an existing Delta Table and a separate Delta Live Table pipelines and I would like to merge data from a DLT to my existing Delta Table. Is this doable or completely impossible ?
Merging data from a Delta Live Table (DLT) into an existing Delta Table is possible with careful planning. Transition data from DLT to Delta Table through batch processing, data transformation, and ETL processes, ensuring schema compatibility.
Hello, I am having issues saving a spark dataframe generated in a databricks notebook to an s3 bucket. The dataframe contains approximately 1.1M rows and 5 columns. The error is as follows: org.apache.spark.SparkException: Job aborted due to stage fa...
Is there a way to use Compute Policies to force Delta Live Tables to use specific Databricks Runtime and PySpark versions? While trying to leverage some of the functions in PySpark 3.5.0, I don't seem to be able to get Delta Live Tables to use Databr...
Hi,I am trying to access excel file that is stored in Azure Blob storage via Databricks.In my understanding, it is not possible to access using Pyspark. So accessing through Pandas is the option,Here is my code.%pip install openpyxlimport pandas as p...
BLOCK_OFFSET_INSIDE_BLOCK ROW_OFFSET_INSIDE_BLOCK command is not working in spark, but these command is running in hive , when running in spark it get failed with invalid column like that
i created one function using jar file which is present in the cluster location, but when executing the hive query it is showing error as no handler for udf/udaf/udtf . this queries is running fine in hd insight clusters but when running in databricks...
Hello, is it possible to just update parameter values in different workspaces?YAML source code taken from workflow jobs always create a new job. I'd like to just change/update parameter values when I deploy bundle to different workspaces/environments...