cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

Sangeethagk
by New Contributor
  • 1206 Views
  • 1 replies
  • 0 kudos

TypeError: ColSpec.__init__() got an unexpected keyword argument 'required'

Hi Team, one of my customer is facing the below issue.. Anyone faced this issue before ? Any help would be appreciated.import mlflowmlflow.set_registry_uri("databricks-uc")catalog_name = "system"embed = mlflow.pyfunc.spark_udf(spark, f"models:/system...

  • 1206 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @Sangeethagk, It looks like you’re encountering a couple of issues related to mlflow.pyfunc.spark_udf() and model dependencies. TypeError: ColSpec.init() got an unexpected keyword argument ‘required’: This error occurs when you’re using mlflo...

  • 0 kudos
ttamas
by New Contributor III
  • 710 Views
  • 2 replies
  • 1 kudos

Get the triggering task's name

Hi,I have tasks that depend on each other. I would like to get variables from task1 that triggers task2.This is how I solved for my problem:Following suggestion in https://community.databricks.com/t5/data-engineering/how-to-pass-parameters-to-a-quot-...

  • 710 Views
  • 2 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Hi @ttamas, Thank you for sharing your approach! It’s true that handling task dependencies and passing values between tasks in Databricks can sometimes be complex. Databricks now supports dynamic value references in notebooks. Instead of using dbutil...

  • 1 kudos
1 More Replies
Kjetil
by New Contributor III
  • 930 Views
  • 3 replies
  • 2 kudos

Resolved! Autoloader to concatenate CSV files that updates regularly into a single parquet dataframe.

I have multiple large CSV files. One or more of these files changes now and then (a few times a day). The changes in the CSV files are both of type update and append (so both new rows) and updates of old. I want to combine all CSV files into a datafr...

  • 930 Views
  • 3 replies
  • 2 kudos
Latest Reply
jose_gonzalez
Moderator
  • 2 kudos

Hi @Kjetil, Please let us know if you still have issue or if @-werners- response could be mark as a best solution. Thank you

  • 2 kudos
2 More Replies
KSI
by New Contributor II
  • 745 Views
  • 1 replies
  • 0 kudos

Variant datatype

I'm checking on variant datatype and noted that whenever a JSON string is stored as a variant datatype in order to filter and value it needs to be casted: i.eSELECT sum(jsondatavar:Value::double )FROM tableWHERE jsondatavar:customer ::int= 1000Here j...

  • 745 Views
  • 1 replies
  • 0 kudos
Latest Reply
Mounika_Tarigop
New Contributor III
  • 0 kudos

Could you please try using SQL functions:  SELECT SUM(CAST(get_json_object(jsondatavar, '$.Value') AS DOUBLE)) AS total_value FROM table WHERE CAST(get_json_object(jsondatavar, '$.customer') AS INT) = 1000

  • 0 kudos
Sweta
by New Contributor II
  • 332 Views
  • 1 replies
  • 2 kudos

Optimized option to write updates to Aurora PostgresDB from Databricks/spark

Hello All,    We want to update our postgres tables from our spark structured streaming workflow on Databricks. We are using foreachbatch utility to write to this sink. I want to understand an optimized way to do this at near real time latency avoidi...

  • 332 Views
  • 1 replies
  • 2 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 2 kudos

Hi @Sweta, Your question about optimizing updates to PostgreSQL tables from a Spark structured streaming workflow is quite relevant, and I’m glad you’re exploring different approaches. Running updates directly on the main table in PostgreSQL using ...

  • 2 kudos
Jiri_Koutny
by New Contributor III
  • 4857 Views
  • 11 replies
  • 3 kudos

Delay in files update on filesystem

Hi, I noticed that there is quite a significant delay (2 - 10s) between making a change to some file in Repos via Databricks file edit window and propagation of such change to the filesystem. Our engineers and scientists use YAML config files. If the...

  • 4857 Views
  • 11 replies
  • 3 kudos
Latest Reply
Irka
New Contributor II
  • 3 kudos

Is there a solution to this?BTW, the "ls" command trick didn't work for me

  • 3 kudos
10 More Replies
Prajwal_082
by New Contributor II
  • 643 Views
  • 3 replies
  • 0 kudos

Overwriting a delta table using DLT

Hello,We are trying to ingest bunch of csv files that we receive on daily basis using DLT, we chose streaming table for this purpose since streaming table is append only records keep adding up on a daily basis which will cause multiple rows in downst...

  • 643 Views
  • 3 replies
  • 0 kudos
Latest Reply
giuseppegrieco
New Contributor III
  • 0 kudos

In your scenario, if the data loaded on day 2 also includes all the data from day 1, you can still apply a "remove duplicates" logic. For instance, you could compute a hashdiff by hashing all the columns and use this to exclude rows you've already se...

  • 0 kudos
2 More Replies
Kjetil
by New Contributor III
  • 812 Views
  • 1 replies
  • 1 kudos

Resolved! Read and process large CSV files that updates regularly

I've got a lot of large CSV files (> 1 GB) that updates regularly (stored in Data Lake Gen 2). The task is to concatenate these files into a single dataframe that is written to parquet format. However, since these files updates very often I get a rea...

  • 812 Views
  • 1 replies
  • 1 kudos
Latest Reply
daniel_sahal
Esteemed Contributor
  • 1 kudos

@Kjetil Since they are getting updated often then IMO making a copy would make sense.What you could try is to create Microsoft.Storage.BlobCreated event to replicate the .CSV into the secondary bucket.However, best practice would be to have some kind...

  • 1 kudos
virementz
by New Contributor II
  • 958 Views
  • 4 replies
  • 0 kudos

Cluster Failed to Start - Cluster scoped init scrip failed: Script exit status is non-zero

i have been using cluster scoped init script for around 1 year already and everything is working fine. But suddenly, Databricks cluster has failed to restart since last week Thursday (13th June 2024). It returns this error:” Failed to add 2 container...

virementz_0-1718854375979.png virementz_1-1718854446857.png
  • 958 Views
  • 4 replies
  • 0 kudos
Latest Reply
Wojciech_BUK
Valued Contributor III
  • 0 kudos

Just maybe - there is no outbound connection on DEV from Cluster VNET to URL you are trying to get ? You can spin ALl purpose cluster and try testing connection with %sh magic command

  • 0 kudos
3 More Replies
Phani1
by Valued Contributor II
  • 270 Views
  • 1 replies
  • 0 kudos

Databricks Cross platform data access

 Hi Team,We have a requirement . data storage on the S3 platform, while our databricks is hosted on Azure.Our objective is to access the data from the S3 location.Could you kindly provide us with the most suitable approach for this scenario? ex- exte...

  • 270 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @Phani1, To access data from an S3 location in Azure Databricks, you have a few options: Using AWS Keys and Secret Scopes: Configure Spark properties to set your AWS keys stored in secret scopes as environment variables.Create a secret scope t...

  • 0 kudos
skirock
by New Contributor
  • 519 Views
  • 1 replies
  • 0 kudos

DLT live tables error while reading file from datalake gen2

I am getting following error while running cell in python.  Same file is run fine when i upload json file into databrics and then give this path to df.read syntex while reading it.   When i use DLT for same file which is in datalake it gives me follo...

  • 519 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

  Hi @skirock,  PySpark provides built-in testing utilities for unit testing. These utilities are standalone and can work with any test framework or CI test pipeline.For simple ad-hoc validation cases, consider using functions. like assertDataFram...

  • 0 kudos
Mahesh_Yadav
by New Contributor
  • 440 Views
  • 1 replies
  • 0 kudos

How to Export lineage data directly from unity catalog without using system tables

I have been trying to check if there is any direct way to export lineage hierarchy data in data bricks.I have tried to build a workaround solution by accessing system tables as per this link:Monitor usage with system tables - Azure Databricks | Micro...

  • 440 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @Mahesh_Yadav, You can capture lineage data directly from Unity Catalog in Databricks without relying on system tables. Here’s how you can achieve this: Using Unity Catalog: Go to your Databricks landing page.Click “New” in the sidebar and sel...

  • 0 kudos
ChingizK
by New Contributor III
  • 351 Views
  • 1 replies
  • 0 kudos

Exclude a job from bundle deployment in PROD

My question is regarding Databricks Asset Bundles. I have defined a databricks.yml file the following way: bundle: name: my_bundle_name include: - resources/jobs/*.yml targets: dev: mode: development default: true workspace: ...

  • 351 Views
  • 1 replies
  • 0 kudos
Latest Reply
giuseppegrieco
New Contributor III
  • 0 kudos

Hello, if you want, you can deploy specific jobs only in the development environment. Since you have only two environments, a straightforward approach is to modify your jobs YAML definition as follows:resources: jobs: # Define the jobs to be de...

  • 0 kudos
HASSAN_UPPAL123
by New Contributor II
  • 2105 Views
  • 1 replies
  • 0 kudos

Resolved! Getting com.databricks.client.jdbc.Driver is not found error while connecting to databricks

Hi Community,I need help regarding the class not found issue. I'm trying to connect databricks in python via jaydebeapi, provided proper class name `com.databricks.client.jdbc.Driver` and jarpath `databricks-jdbc-2.6.34.jar`, but i'm getting com.data...

  • 2105 Views
  • 1 replies
  • 0 kudos
Latest Reply
User16502773013
Contributor II
  • 0 kudos

Hello @HASSAN_UPPAL123 The class name is correct, for the jar please try downloading the latest from here This issue may also be a classpath issue were the jar is not exported correctly in your client setup, I see similar issues/ suggested solutions ...

  • 0 kudos
Avinash_Narala
by Contributor
  • 5203 Views
  • 4 replies
  • 1 kudos

Resolved! export notebook

Hi,I want to export notebook in python programming.is there a way to leverage databricks cli in python.Or any other way to export the notebook to my local PC. 

  • 5203 Views
  • 4 replies
  • 1 kudos
Latest Reply
Pri-databricks
New Contributor II
  • 1 kudos

Is there a way to export a notebook though terraform? If so please provide examples?With terraform-provider-databricks.exe we are able to export all the notebooks from workspace but not a single notebook. Any suggestions ? 

  • 1 kudos
3 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels