cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

wkgcls
by New Contributor
  • 78 Views
  • 2 replies
  • 1 kudos

Resolved! DQX usage outside Databricks

Hello, When evaluating data quality frameworks for PySpark pipelines, I came across DQX. I noticed it's available on PyPI (databricks-labs-dqx) and GitHub, which is great for accessibility.However, I'm trying to understand the licensing requirements....

  • 78 Views
  • 2 replies
  • 1 kudos
Latest Reply
wkgcls
New Contributor
  • 1 kudos

Thanks a lot for the quick response, @ManojkMohan! This was very helpful.I'll keep this in mind.

  • 1 kudos
1 More Replies
liquibricks
by New Contributor II
  • 69 Views
  • 3 replies
  • 2 kudos

Moving tables between pipelines in production

We are testing an ingestion from kafka to databricks using a streaming table. The streaming table was created by a DAB deployed to "production" which runs as a service principal. This means the service principal is the "owner" of the table.We now wan...

  • 69 Views
  • 3 replies
  • 2 kudos
Latest Reply
nayan_wylde
Esteemed Contributor
  • 2 kudos

You’ve hit two limitations:Streaming tables don’t allow SET OWNER – ownership cannot be changed.Lakeflow pipeline ID changes require pipeline-level permissions – if you’re not the pipeline owner, you can’t run ALTER STREAMING TABLE ... SET PIPELINE_I...

  • 2 kudos
2 More Replies
Suheb
by New Contributor II
  • 79 Views
  • 4 replies
  • 3 kudos

When working with large data sets in Databricks, what are best practices to avoid memory out-of-memo

How can I optimize Databricks to handle large datasets without running into memory or performance problems?

  • 79 Views
  • 4 replies
  • 3 kudos
Latest Reply
tarunnagar
New Contributor III
  • 3 kudos

Hey! Great question — I’ve run into this issue quite a few times while working with large datasets in Databricks, and out-of-memory errors can be a real headache. One of the biggest things that helps is making sure your cluster configuration matches ...

  • 3 kudos
3 More Replies
Marcus_S
by New Contributor II
  • 2852 Views
  • 2 replies
  • 0 kudos

Change in UNRESOLVED_COLUMN error behavior in Runtime 14.3 LTS

I've noticed a change in how Databricks handles unresolved column references in PySpark when using All-purpose compute (not serverless).In Databricks Runtime 14.3 LTS, referencing a non-existent column like this:df = spark.table('default.example').se...

Marcus_S_2-1748270966823.png
  • 2852 Views
  • 2 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

Databricks has recently changed how unresolved column references are handled in PySpark on All-purpose compute clusters. In earlier Databricks Runtime (DBR) 14.3 LTS builds, referencing a non-existent column—such as:   python df = spark.tabl...

  • 0 kudos
1 More Replies
Asaph
by New Contributor
  • 3915 Views
  • 8 replies
  • 1 kudos

Issue with databricks.sdk - AccountClient Service Principals API

Hi everyone,I’ve been trying to work with the databricks.sdk Python library to manage service principals programmatically. However, I’m running into an issue when attempting to create a service principal using the AccountClient class. Below is the co...

  • 3915 Views
  • 8 replies
  • 1 kudos
Latest Reply
MarlonFojas
New Contributor
  • 1 kudos

I am using the Python SDK and to authenticate I am using a SP and a Secret. Here is the code that worked for me in Azure Databricks notebook.from databricks.sdk import AccountClient acct_client = AccountClient( host="https://accounts.azuredatabr...

  • 1 kudos
7 More Replies
Ramana
by Valued Contributor
  • 773 Views
  • 6 replies
  • 1 kudos

Resolved! Serverless Compute - Spark - Jobs failing with Max iterations (1000) reached for batch Resolution

Hello Community,We have been trying to migrate our jobs from Classic Compute to Serverless Compute. As part of this process, we face several challenges, and this is one of them.When we try to execute the existing jobs with Serverless Compute, if the ...

Ramana_1-1757620107637.png Ramana_0-1757620075091.png
  • 773 Views
  • 6 replies
  • 1 kudos
Latest Reply
Ramana
Valued Contributor
  • 1 kudos

In Serverless Version 4, Databricks fixed this issue.

  • 1 kudos
5 More Replies
akuma643
by New Contributor II
  • 3724 Views
  • 3 replies
  • 1 kudos

The authentication value "ActiveDirectoryManagedIdentity" is not valid.

Hi Team,i am trying to connect to SQL server hosted in azure vm using Entra id authentication from Databricks.("authentication", "ActiveDirectoryManagedIdentity")Below is the notebook script i am using. driver = "com.microsoft.sqlserver.jdbc.SQLServe...

  • 3724 Views
  • 3 replies
  • 1 kudos
Latest Reply
mark_ott
Databricks Employee
  • 1 kudos

You are encountering an error because the default SQL Server JDBC driver bundled with Databricks may not fully support the authentication value "ActiveDirectoryManagedIdentity"—this option requires at least version 10.2.0 of the Microsoft SQL Server ...

  • 1 kudos
2 More Replies
cdn_yyz_yul
by New Contributor III
  • 133 Views
  • 4 replies
  • 1 kudos

delta as streaming source, can the reader reads only newly appended rows?

Hello everyone,In our implementation of Medallion Architecture, we want to stream changes with spark structured streaming. I would like some advice on how to use delta table as source correctly, and if there is performance (memory usage) concern in t...

  • 133 Views
  • 4 replies
  • 1 kudos
Latest Reply
mark_ott
Databricks Employee
  • 1 kudos

In your scenario using Medallion Architecture with Delta tables as both streaming source and sink, it is important to understand Spark Structured Streaming behavior and performance characteristics, especially with joins and memory usage. Here is a di...

  • 1 kudos
3 More Replies
Shubhankar_123
by New Contributor
  • 89 Views
  • 1 replies
  • 0 kudos

Internal error 500 on databricks vector search endpoint

We are facing an internal 500 error accessing the vector search endpoint through streamlit application, if I refresh the application sometimes the error goes away, it has now started to become an usual occurrence. If I try to query the endpoint using...

Shubhankar_123_0-1762828283546.png Shubhankar_123_1-1762828283550.png Shubhankar_123_2-1762828283551.png Shubhankar_123_3-1762828283553.png
  • 89 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

The intermittent Internal 500 errors you’re experiencing when accessing the vector search endpoint through a Streamlit app on Databricks—while direct console queries work—suggest an issue with the interaction between your Streamlit app’s environment ...

  • 0 kudos
SumitB14
by New Contributor
  • 63 Views
  • 1 replies
  • 0 kudos

Databricks Nested Json Flattening

Hi Databricks Community,I am facing an issue while exploding nested JSON data.In the content column, I have dynamic nested JSON, and I am using the below approach to parse and explode it.from pyspark.sql import SparkSessionfrom pyspark.sql.functions ...

  • 63 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

You are encountering an AttributeError related to strip, which likely means that some entries for activity.value are not strings (maybe None or dicts) and your code expects all to be strings before calling .strip(). This kind of problem can arise if ...

  • 0 kudos
ShivangiB1
by New Contributor III
  • 74 Views
  • 2 replies
  • 0 kudos

DATABRICKS LAKEFLOW SQL SERVER INGESTION PIPELINE ERROR

Hey Team,I am getting below error while creating pipeline :com.databricks.pipelines.execution.extensions.managedingestion.errors.ManagedIngestionNonRetryableException: [INGESTION_GATEWAY_DDL_OBJECTS_MISSING] DDL objects missing on table 'coedb.dbo.so...

  • 74 Views
  • 2 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

The error you are seeing means Databricks cannot capture DDL (table definition) changes, even though CDC (Change Data Capture) and CT (Change Tracking) are enabled. You must run the specific DDL support objects script for Databricks ingestion and the...

  • 0 kudos
1 More Replies
shubham007
by New Contributor III
  • 86 Views
  • 2 replies
  • 0 kudos

Urgency: How to do Data Migration task using Databricks Lakebridge tool ?

Dear community expert,I have completed two phases Analyzer & Converter of Databricks Lakebridge but stuck at migrating data from source to target using lakebridge. I have watched BrickBites Series on Lakebridge but did not find on how to migrate data...

  • 86 Views
  • 2 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

To migrate tables and views from Snowflake (source) to Databricks (target) using Lakebridge, you must export your data from Snowflake into a supported cloud storage (usually as Parquet files), then import these files into Databricks Delta tables. Lak...

  • 0 kudos
1 More Replies
Ajay-Pandey
by Databricks MVP
  • 8174 Views
  • 8 replies
  • 0 kudos

How we can send databricks log to Azure Application Insight ?

Hi All,I want to send databricks logs to azure application insight.Is there any way we can do it ??Any blog or doc will help me.

  • 8174 Views
  • 8 replies
  • 0 kudos
Latest Reply
loic
Contributor
  • 0 kudos

Hello,I finally used he AppInsights agent from OpenTelemetry which is documented in the official Microsoft documentation here:https://learn.microsoft.com/en-us/azure/azure-monitor/app/opentelemetry-enable?tabs=javaBelow is an adaptation of this "Get ...

  • 0 kudos
7 More Replies
pooja_bhumandla
by New Contributor III
  • 306 Views
  • 3 replies
  • 1 kudos

When to Use and when Not to Use Liquid Clustering?

 Hi everyone,I’m looking for some practical guidance and experiences around when to choose Liquid Clustering versus sticking with traditional partitioning + Z-ordering.From what I’ve gathered so far:For small tables (<10TB), Liquid Clustering gives s...

  • 306 Views
  • 3 replies
  • 1 kudos
Latest Reply
mark_ott
Databricks Employee
  • 1 kudos

Deciding between Liquid Clustering and traditional partitioning with Z-ordering depends on table size, query patterns, number of clustering columns, and file optimization needs. For tables under 10TB with queries consistently filtered on 1–2 columns,...

  • 1 kudos
2 More Replies
DatabricksEngi1
by Contributor
  • 190 Views
  • 4 replies
  • 0 kudos

Resolved! MERGE operation not performing data skipping with liquid clustering on key columns

 Hi, I need some help understanding a performance issue.I have a table that reads approximately 800K records every 30 minutes in an incremental manner.Let’s say its primary key is:timestamp, x, y This table is overwritten every 30 minutes and serves ...

  • 190 Views
  • 4 replies
  • 0 kudos
Latest Reply
bianca_unifeye
New Contributor II
  • 0 kudos

MERGE is not a pure read plus filter operationEven though Liquid Clustering organizes your data by key ranges and writes min/max stats, the MERGE engine has to identify both matches and non-matches.That means the query planner must:Scan all candidate...

  • 0 kudos
3 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels