Data Engineering

Forum Posts

Sorted by:

by nag_kanchan • New Contributor III

11-28-2023 12:59:00 AM

400 Views
0 replies
0 kudos

Applying SCD in DLT using 3 different tables at source

My organization has recently started using Delta Live Tables in Databricks for data modeling. One of the dimensions I am trying to model takes data from 3 existing tables in the data lake and needs to be slowly changing dimensions (SCD Type 1).This a...

Data Engineering

SCD

400 Views
0 replies
0 kudos

11-28-2023 12:59:00 AM

by MichaelO • New Contributor III

11-08-2023 6:52:28 AM

483 Views
1 replies
0 kudos

gateway.create route for open source models

Am I able to use gateway.create_route in mlflow for open source LLM models?I'm aware of the syntax for propietary models like for openAI: from mlflow import gateway gateway.create_route( name=OpenAI_embeddings_route_name...

Data Engineering

llm

mlflow

483 Views
1 replies
0 kudos

11-08-2023 6:52:28 AM

View Replies

Latest Reply

Kaniz
Community Manager

11-28-2023 12:49:41 AM

0 kudos

Hi @MichaelO, Certainly! The MLflow AI Gateway provides a way to manage and deploy models, including both proprietary and open source ones. Let’s explore how you can create a route for an open source model using the MLflow AI Gateway. What is the ML...

0 kudos

11-28-2023 12:49:41 AM

by Magnus • Contributor

11-01-2023 7:06:58 AM

1398 Views
2 replies
2 kudos

Resolved! FIELD_NOT_FOUND when selecting field not part of original schema

Hi,I'm implementing a DLT pipeline using Auto Loader to ingest json files. The json files contains an array called Items that contains records and two of the fields in the records wasn't part of the original schema, but has been added later. Auto Loa...

Data Engineering

Auto Loader

Delta Live Tables

1398 Views
2 replies
2 kudos

11-01-2023 7:06:58 AM

View Replies

Latest Reply

Kaniz
Community Manager

11-27-2023 4:11:33 AM

2 kudos

Hi @Magnus , It seems you’re encountering an issue with schema evolution in your DLT pipeline using Auto Loader. Let’s explore how you can improve your notebook implementation. Schema Inference and Evolution: Auto Loader can automatically detect...

2 kudos

11-27-2023 4:11:33 AM

1 More Replies

by coltonflowers • New Contributor III

11-07-2023 9:13:29 AM

1373 Views
1 replies
1 kudos

Resolved! MLFlow Spark UDF Error

After trying to run spark_udf = mlflow.pyfunc.spark_udf(spark, model_uri=logged_model,env_manager="virtualenv")We get the following error:org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 145.0 failed 4 times, most re...

Data Engineering

1373 Views
1 replies
1 kudos

11-07-2023 9:13:29 AM

View Replies

Latest Reply

Kaniz
Community Manager

11-28-2023 12:16:44 AM

1 kudos

Hi @coltonflowers , The error you’re encountering seems to be related to a connection issue. Let’s explore some potential solutions: Check Network Connectivity: Ensure that the machine running your Spark job has proper network connectivity. Veri...

1 kudos

11-28-2023 12:16:44 AM

by Phani1 • Valued Contributor

11-06-2023 2:16:59 AM

414 Views
1 replies
0 kudos

Unity catalog accounts

Hi Team,We have the requirement to have metadata(Unity catalog) in one AWS account and data storage(Delta tables under data) in another account, is it possible to do that , Do we face any technical/Security issue??

Data Engineering

414 Views
1 replies
0 kudos

11-06-2023 2:16:59 AM

View Replies

Latest Reply

Kaniz
Community Manager

11-28-2023 12:13:51 AM

0 kudos

Hi @Phani1, Let’s address your requirement regarding Unity Catalog metadata and Delta tables storage in separate AWS accounts. Unity Catalog Accounts: Unity Catalog (UC) is a fine-grained governance solution for data and AI on the Databricks Lakeho...

0 kudos

11-28-2023 12:13:51 AM

by Venu_DE1 • New Contributor

11-06-2023 4:55:03 AM

650 Views
1 replies
0 kudos

Issue with merge command between streaming dataframe and delta table

Hi,We are trying to build and upsert logic for a Delta table for that we are writing a merge command between streaming dataframe and delta table dataframe. Please find the below code merge_sql = f""" Merge command come here"""spark.sql(merg...

Data Engineering

650 Views
1 replies
0 kudos

11-06-2023 4:55:03 AM

View Replies

Latest Reply

Kaniz
Community Manager

11-28-2023 12:12:24 AM

0 kudos

Hi @Venu_DE1, The error message you’re encountering indicates that you’re trying to execute a query with streaming sources, but you’re missing the necessary .start() method for your streaming DataFrame. Let’s address this issue step by step: Stre...

0 kudos

11-28-2023 12:12:24 AM

by William_Scardua • Valued Contributor

11-27-2023 3:47:52 PM

775 Views
1 replies
0 kudos

What is the Data Quality Framework do you use/recomend ?

Hi guys,In your opinion what is the best Data Quality Framework (or techinique) do you recommend ?

Data Engineering

dataquality

775 Views
1 replies
0 kudos

11-27-2023 3:47:52 PM

View Replies

Latest Reply

Kaniz
Community Manager

11-27-2023 8:09:16 PM

0 kudos

Hi @William_Scardua, Certainly! Data quality is a critical aspect in any organization, ensuring that data is accurate, consistent, and reliable. Here are some key components of a robust data quality framework: Data Governance: Establish policies,...

0 kudos

11-27-2023 8:09:16 PM

by William_Scardua • Valued Contributor

11-27-2023 4:00:01 PM

2576 Views
1 replies
0 kudos

Pyspark or Scala ?

Hi guys,Many people use pyspark to develop their pipelines, in your opinion in which cases is it better to use one or the other? Or is it better to choose a single language?Thanks

Data Engineering

2576 Views
1 replies
0 kudos

11-27-2023 4:00:01 PM

View Replies

Latest Reply

Kaniz
Community Manager

11-27-2023 8:06:09 PM

0 kudos

PySpark and Scala are both powerful tools for data processing and pipeline development in the big data ecosystem. Let’s explore their strengths and use cases: PySpark:Python API for Spark: PySpark allows you to harness the simplicity of Python while...

0 kudos

11-27-2023 8:06:09 PM

by choi_2 • New Contributor II

11-26-2023 4:39:36 PM

10574 Views
2 replies
0 kudos

Resolved! maintaining cluster and databases in Databricks Community Edition

I am using the Databricks Community Edition, but the cluster usage is limited to 2 hours and it automatically terminates. So I have to attach the cluster every time to run the notebook again. As I read other discussions, I learned it is not something...

Data Engineering

communityedition

10574 Views
2 replies
0 kudos

11-26-2023 4:39:36 PM

View Replies

Latest Reply

Kaniz
Community Manager

11-26-2023 10:28:23 PM

0 kudos

Hi @choi_2 , I understand the challenges you’re facing with Databricks Community Edition (CE) and the limitations it imposes on cluster usage. While CE provides a micro-cluster and a notebook environment, it does have some restrictions. Let’s add...

0 kudos

11-26-2023 10:28:23 PM

1 More Replies

by Feather • New Contributor III

11-06-2023 2:49:03 PM

3930 Views
14 replies
9 kudos

Resolved! DLT pipeline MLFlow UDF error

I am running this notebook via the dlt pipeline in preview mode.everything works up until the predictions table that should be created with a registered model inferencing the gold table. This is the error: com databricks spark safespark UDFException...

Data Engineering

3930 Views
14 replies
9 kudos

11-06-2023 2:49:03 PM

View Replies

Latest Reply

BarryC
New Contributor III

11-14-2023 9:21:58 PM

9 kudos

Hi @Feather Have you also tried specifying the version of the library as well?

9 kudos

11-14-2023 9:21:58 PM

13 More Replies

by icyflame92 • New Contributor II

11-27-2023 4:50:58 AM

1941 Views
4 replies
3 kudos

Resolved! Access storage account with private endpoint

Hi, I need guidance on connecting Databricks (not VNET injected) to a storage account with Private Endpoint.We have a client who created Databricks with (public ip and not VNET Injected). It’s using a managed VNET in the Databricks managed resource g...

Data Engineering

ADLS

azure

1941 Views
4 replies
3 kudos

11-27-2023 4:50:58 AM

View Replies

Latest Reply

Kaniz
Community Manager

11-27-2023 8:47:33 AM

3 kudos

I want to express my gratitude for your effort in selecting the most suitable solution. It's great to hear that your query has been successfully resolved. Thank you for your contribution.

3 kudos

11-27-2023 8:47:33 AM

3 More Replies

by jx1226 • New Contributor

11-27-2023 4:12:20 AM

1404 Views
1 replies
0 kudos

Connect to storage with private endpoint from workspace EnableNoPublicIP=No and VnetInjection=No

We know that Databricks with VNET injection (our own VNET) allows is to connect to blob storage/ ADLS Gen2 over private endpoints and peering. This is what we typically do.We have a client who created Databricks with EnableNoPublicIP=No (secure clust...

Data Engineering

1404 Views
1 replies
0 kudos

11-27-2023 4:12:20 AM

View Replies

Latest Reply

Kaniz
Community Manager

11-27-2023 8:45:48 AM

0 kudos

Hi @jx1226 , Certainly! Let’s break down your requirements and explore the options for connecting your Databricks workspace to blob storage and ADLS Gen2 using private endpoints. Workspace Configuration: Your client’s Databricks workspace is set ...

0 kudos

11-27-2023 8:45:48 AM

by deng_dev • New Contributor III

11-27-2023 1:44:03 AM

1806 Views
1 replies
0 kudos

py4j.protocol.Py4JJavaError: An error occurred while calling o359.sql. : java.util.NoSuchElementExce

Hi!We are creating table in streaming job every micro-batch using spark.sql('create or replace table ... using delta as ...') command. This query includes combining data from multiple tables.Sometimes our job fails with error:py4j.Py4JException: An e...

Data Engineering

1806 Views
1 replies
0 kudos

11-27-2023 1:44:03 AM

View Replies

Latest Reply

Kaniz
Community Manager

11-27-2023 8:34:15 AM

0 kudos

Hi @deng_dev , The error message you’re encountering, java.util.NoSuchElementException: key not found: Filter (isnotnull(uuid#42326735) AND isnotnull(actor_uuid#42326740)), indicates that there’s an issue with the query execution. Let’s address thi...

0 kudos

11-27-2023 8:34:15 AM

by oosterhuisf • New Contributor II

11-27-2023 1:26:37 AM

678 Views
2 replies
0 kudos

break production using a shallow clone

Hi,If you create a shallow clone using the latest LTS, and drop the clone using a SQL warehouse (either current or preview), the source table is broken beyond repair. Data reads and writes still work, but vacuum will remain forever broken. I've attac...

Data Engineering

678 Views
2 replies
0 kudos

11-27-2023 1:26:37 AM

View Replies

Latest Reply

oosterhuisf
New Contributor II

11-27-2023 8:33:18 AM

0 kudos

To add to that: the manual does not state that this might happen

0 kudos

11-27-2023 8:33:18 AM

1 More Replies

by Michael_Galli • Contributor II

11-27-2023 12:27:29 AM

514 Views
1 replies
1 kudos

Resolved! Many dbutils.notebook.run interations in a workflow -> Failed to checkout Github repository Error

Hi all,I have a workflow that runs one single notebook with dbutils.notebook.run() and different parameters in one long loop.At some point, I do have random git erros in the notebook run:com.databricks.WorkflowException: com.databricks.NotebookExecut...

Data Engineering

514 Views
1 replies
1 kudos

11-27-2023 12:27:29 AM

View Replies

Latest Reply

Kaniz
Community Manager

11-27-2023 8:27:28 AM

1 kudos

Hi @Michael_Galli, It appears that you’re encountering GitHub-related issues during your notebook runs in Databricks. Let’s address this step by step: GitHub API Limit: Databricks enforces rate limits for all REST API calls, including those rela...

1 kudos

11-27-2023 8:27:28 AM

User

Count

1602

736

344

284

247

Databricks

Forum Posts

Applying SCD in DLT using 3 different tables at source

gateway.create route for open source models

Resolved! FIELD_NOT_FOUND when selecting field not part of original schema

Resolved! MLFlow Spark UDF Error

Unity catalog accounts

Issue with merge command between streaming dataframe and delta table

What is the Data Quality Framework do you use/recomend ?

Pyspark or Scala ?

Resolved! maintaining cluster and databases in Databricks Community Edition

Resolved! DLT pipeline MLFlow UDF error

Resolved! Access storage account with private endpoint

Connect to storage with private endpoint from workspace EnableNoPublicIP=No and VnetInjection=No

py4j.protocol.Py4JJavaError: An error occurred while calling o359.sql. : java.util.NoSuchElementExce

break production using a shallow clone

Resolved! Many dbutils.notebook.run interations in a workflow -> Failed to checkout Github repository Error

Best way to parse Google Analytics data in Databri...

DELTA_EXCEED_CHAR_VARCHAR_LIMIT

Not able to set run_as service_principal_name

Pyspark operations slowness in CLuster 14.3LTS as ...

[Databricks Assets Bundles] Workflow trigger on fi...