My organization has recently started using Delta Live Tables in Databricks for data modeling. One of the dimensions I am trying to model takes data from 3 existing tables in the data lake and needs to be slowly changing dimensions (SCD Type 1).This a...
Am I able to use gateway.create_route in mlflow for open source LLM models?I'm aware of the syntax for propietary models like for openAI: from mlflow import gateway
gateway.create_route(
name=OpenAI_embeddings_route_name...
Hi @MichaelO, Certainly! The MLflow AI Gateway provides a way to manage and deploy models, including both proprietary and open source ones.
Let’s explore how you can create a route for an open source model using the MLflow AI Gateway.
What is the ML...
Hi,I'm implementing a DLT pipeline using Auto Loader to ingest json files. The json files contains an array called Items that contains records and two of the fields in the records wasn't part of the original schema, but has been added later. Auto Loa...
Hi @Magnus , It seems you’re encountering an issue with schema evolution in your DLT pipeline using Auto Loader.
Let’s explore how you can improve your notebook implementation.
Schema Inference and Evolution:
Auto Loader can automatically detect...
After trying to run spark_udf = mlflow.pyfunc.spark_udf(spark, model_uri=logged_model,env_manager="virtualenv")We get the following error:org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 145.0 failed 4 times, most re...
Hi @coltonflowers , The error you’re encountering seems to be related to a connection issue.
Let’s explore some potential solutions:
Check Network Connectivity:
Ensure that the machine running your Spark job has proper network connectivity. Veri...
Hi Team,We have the requirement to have metadata(Unity catalog) in one AWS account and data storage(Delta tables under data) in another account, is it possible to do that , Do we face any technical/Security issue??
Hi @Phani1, Let’s address your requirement regarding Unity Catalog metadata and Delta tables storage in separate AWS accounts.
Unity Catalog Accounts:
Unity Catalog (UC) is a fine-grained governance solution for data and AI on the Databricks Lakeho...
Hi,We are trying to build and upsert logic for a Delta table for that we are writing a merge command between streaming dataframe and delta table dataframe. Please find the below code merge_sql = f""" Merge command come here"""spark.sql(merg...
Hi @Venu_DE1, The error message you’re encountering indicates that you’re trying to execute a query with streaming sources, but you’re missing the necessary .start() method for your streaming DataFrame.
Let’s address this issue step by step:
Stre...
Hi @William_Scardua, Certainly! Data quality is a critical aspect in any organization, ensuring that data is accurate, consistent, and reliable.
Here are some key components of a robust data quality framework:
Data Governance: Establish policies,...
Hi guys,Many people use pyspark to develop their pipelines, in your opinion in which cases is it better to use one or the other? Or is it better to choose a single language?Thanks
PySpark and Scala are both powerful tools for data processing and pipeline development in the big data ecosystem. Let’s explore their strengths and use cases: PySpark:Python API for Spark: PySpark allows you to harness the simplicity of Python while...
I am using the Databricks Community Edition, but the cluster usage is limited to 2 hours and it automatically terminates. So I have to attach the cluster every time to run the notebook again. As I read other discussions, I learned it is not something...
Hi @choi_2 ,
I understand the challenges you’re facing with Databricks Community Edition (CE) and the limitations it imposes on cluster usage. While CE provides a micro-cluster and a notebook environment, it does have some restrictions.
Let’s add...
I am running this notebook via the dlt pipeline in preview mode.everything works up until the predictions table that should be created with a registered model inferencing the gold table. This is the error: com databricks spark safespark UDFException...
Hi, I need guidance on connecting Databricks (not VNET injected) to a storage account with Private Endpoint.We have a client who created Databricks with (public ip and not VNET Injected). It’s using a managed VNET in the Databricks managed resource g...
I want to express my gratitude for your effort in selecting the most suitable solution. It's great to hear that your query has been successfully resolved. Thank you for your contribution.
We know that Databricks with VNET injection (our own VNET) allows is to connect to blob storage/ ADLS Gen2 over private endpoints and peering. This is what we typically do.We have a client who created Databricks with EnableNoPublicIP=No (secure clust...
Hi @jx1226 , Certainly! Let’s break down your requirements and explore the options for connecting your Databricks workspace to blob storage and ADLS Gen2 using private endpoints.
Workspace Configuration:
Your client’s Databricks workspace is set ...
Hi!We are creating table in streaming job every micro-batch using spark.sql('create or replace table ... using delta as ...') command. This query includes combining data from multiple tables.Sometimes our job fails with error:py4j.Py4JException: An e...
Hi @deng_dev , The error message you’re encountering, java.util.NoSuchElementException: key not found: Filter (isnotnull(uuid#42326735) AND isnotnull(actor_uuid#42326740)), indicates that there’s an issue with the query execution.
Let’s address thi...
Hi,If you create a shallow clone using the latest LTS, and drop the clone using a SQL warehouse (either current or preview), the source table is broken beyond repair. Data reads and writes still work, but vacuum will remain forever broken. I've attac...
Hi all,I have a workflow that runs one single notebook with dbutils.notebook.run() and different parameters in one long loop.At some point, I do have random git erros in the notebook run:com.databricks.WorkflowException: com.databricks.NotebookExecut...
Hi @Michael_Galli, It appears that you’re encountering GitHub-related issues during your notebook runs in Databricks.
Let’s address this step by step:
GitHub API Limit:
Databricks enforces rate limits for all REST API calls, including those rela...