Data Engineering

Forum Posts

Sorted by:

by kanikvijay9 • New Contributor III

3 hours ago

19 Views
2 replies
5 kudos

Optimizing Delta Table Writes for Massive Datasets in Databricks

Problem StatementIn one of my recent projects, I faced a significant challenge: Writing a huge dataset of 11,582,763,212 rows and 2,068 columns to a Databricks managed Delta table.The initial write operation took 22.4 hours using the following setup:...

Data Engineering

19 Views
2 replies
5 kudos

3 hours ago

View Replies

Latest Reply

kanikvijay9
New Contributor III

40m ago

5 kudos

Hey @Louis_Frolio ,Thank you for the thoughtful feedback and great suggestions!A few clarifications:AQE is already enabled in my setup, and it definitely helped reduce shuffle overhead during the write.Regarding Column Pruning, in this case, the fina...

5 kudos

40m ago

1 More Replies

by Coffee77 • Contributor

6 hours ago

30 Views
4 replies
1 kudos

Databricks Asset Bundles - High Level Diagrams Flow

Hi guys!Working recently in fully understanding (and helping others...) Databricks Asset Bundles (DAB) and having fun creating some diagrams with DAB flow at high level. First one contains flow with a simple deployment in PROD and second one contains...

Data Engineering

30 Views
4 replies
1 kudos

6 hours ago

View Replies

Latest Reply

Coffee77
Contributor

an hour ago

1 kudos

Updated first high-level diagram. Now, it looks like this way:DAB High Level Diagram v1.1

1 kudos

an hour ago

3 More Replies

by eyalholzmann • Visitor

7 hours ago

27 Views
1 replies
0 kudos

Does VACUUM on Delta Lake also clean Iceberg metadata when using Iceberg Uniform feature?

I'm working with Delta tables using the Iceberg Uniform feature to enable Iceberg-compatible reads. I’m trying to understand how metadata cleanup works in this setup.Specifically, does the VACUUM operation—which removes old Delta Lake metadata based ...

Data Engineering

27 Views
1 replies
0 kudos

7 hours ago

View Replies

Latest Reply

Louis_Frolio
Databricks Employee

an hour ago

0 kudos

Great question @eyalholzmann , In Databricks Delta Lake with the Iceberg Uniform feature, VACUUM operations on the Delta table do NOT automatically clean up the corresponding Iceberg metadata. The two metadata layers are managed separately, and unde...

0 kudos

an hour ago

by RobsonNLPT • Contributor III

02-18-2025 3:33:16 AM

3712 Views
2 replies
0 kudos

Databricks Rest API Statement Execution - External Links

Hi.I've tested the adb Rest Api to execute queries on databricks sql serverless. Using INLINE as disposition I have the json array with my correct results but using EXTERNAL_LINKS I have the chunks but the external_link (URL starting with http://stor...

Data Engineering

3712 Views
2 replies
0 kudos

02-18-2025 3:33:16 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

2 hours ago

0 kudos

The issue you're experiencing with Databricks SQL Serverless REST API in EXTERNAL_LINKS mode—where the external_link URL (http://storage-proxy.databricks.com/...) does not work, but you can access chunks directly via the /api/2.0/sql/statements/{stat...

0 kudos

2 hours ago

1 More Replies

by rbee • New Contributor II

02-25-2025 7:43:51 AM

5004 Views
1 replies
0 kudos

Connect to Sql server analysis services(SSAS) server to run DAX query using python

Hi, I have a powerbi server which I'm able to connect through SSMS. I tried using pyodbc to connect to same in databricks, but it is throwing below error.OperationalError: ('HYT00', '[HYT00] [Microsoft][ODBC Driver 17 for SQL Server]Login timeout exp...

Data Engineering

5004 Views
1 replies
0 kudos

02-25-2025 7:43:51 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

2 hours ago

0 kudos

Your understanding is correct: Power BI’s data model is stored in an Analysis Services (SSAS) engine, not a traditional SQL Server database. This means that while SSMS may connect to Power BI Premium datasets via XMLA endpoints, attempting to use pyo...

0 kudos

2 hours ago

by RobsonNLPT • Contributor III

02-27-2025 8:10:51 AM

4490 Views
4 replies
0 kudos

Connection timeout when connecting to MongoDB using MongoDB Connector for Spark 10.x

Hi.I'm testing a databricks connection to a mongo cluster V7 (azure cluster) using the library org.mongodb.spark:mongo-spark-connector_2.13:10.4.1I can connect using compass but I get a timeout error using my adb notebookMongoTimeoutException: Timed ...

Data Engineering

4490 Views
4 replies
0 kudos

02-27-2025 8:10:51 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

2 hours ago

0 kudos

The error you’re seeing — MongoTimeoutException referencing localhost:27017 — suggests your Databricks cluster is trying to connect to MongoDB using the wrong address or that it cannot properly reach the MongoDB cluster endpoint from the notebook, ev...

0 kudos

2 hours ago

3 More Replies

by kertsman_nm • New Contributor

03-05-2025 3:53:07 PM

3804 Views
1 replies
0 kudos

Trying to use Broadcast to run Presidio distrubuted

Hello,I am currently evaluating using Microsoft's Presidio de-identification libraries for my organization and would like to see if we can take advantage to Sparks broadcast capabilities, but I am getting an error message:"[BROADCAST_VARIABLE_NOT_LOA...

Data Engineering

3804 Views
1 replies
0 kudos

03-05-2025 3:53:07 PM

View Replies

Latest Reply

mark_ott
Databricks Employee

2 hours ago

0 kudos

You’re encountering the [BROADCAST_VARIABLE_NOT_LOADED] error because Databricks in shared access mode cannot use broadcast variables with non-serializable Python objects (such as your Presidio engines) due to cluster architecture limitations. The cl...

0 kudos

2 hours ago

by 6502 • New Contributor III

03-19-2025 2:56:17 AM

3223 Views
1 replies
0 kudos

Schema change and OpenSearch

Let me be crystal clear: Schema Change and OpenSeach do not fit well together. However, the data pushed to it are processed and always have the same schema. The problem here is that Spark is reading a CDC feed, which is subject to Schema Change becau...

Data Engineering

3223 Views
1 replies
0 kudos

03-19-2025 2:56:17 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

2 hours ago

0 kudos

You are encountering a common issue in Databricks Delta Lake streaming when working with Change Data Capture (CDC) feeds: schema evolution, especially with column mapping enabled, is not fully supported automatically in streaming reads—that includes ...

0 kudos

2 hours ago

by van45678 • New Contributor

03-19-2025 3:40:54 AM

3826 Views
2 replies
0 kudos

Getting connection reset issue while connecting to a SQL server

Hello All,I am unable to connect to a SQL server instance that is installed in a on-premise network from databricks. I am able to successfully ping the server from the notebook using this command [nc -vz <hostname> <port>] which means I am able to e...

Data Engineering

Databricks

sqlserver

timeout

3826 Views
2 replies
0 kudos

03-19-2025 3:40:54 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

2 hours ago

0 kudos

The error you are encountering, "com.microsoft.sqlserver.jdbc.SQLServerException: Connection reset," even after a successful nc (netcat) connection, is a common but nuanced problem when connecting Databricks to an on-premise SQL Server. Although your...

0 kudos

2 hours ago

1 More Replies

by meghana_tulla • New Contributor III

05-23-2025 6:12:02 AM

2215 Views
1 replies
0 kudos

Issue: UCX Assessment Installation Error in Databricks Automation Script

Hi I'm experiencing a problem when installing UCX Assessment through an automation script in Databricks. The script fails with this error:13:38:06 WARNING [databricks.labs.ucx.hive_metastore.tables] {listing_tables_0} failed-table-crawl: listing tabl...

Data Engineering

2215 Views
1 replies
0 kudos

05-23-2025 6:12:02 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

2 hours ago

0 kudos

The manual installation of UCX Assessment in Databricks works with the default <ALL> values, but automation scripts that set WORKSPACE_GROUPS="<ALL>" DATABASES="<ALL>" often encounter a SCHEMA_NOT_FOUND error related to 'ALL' not being recognized as ...

0 kudos

2 hours ago

by GJ2 • New Contributor II

02-17-2025 5:22:02 AM

10368 Views
10 replies
1 kudos

Install the ODBC Driver 17 for SQL Server

Hi,I am not a Data Engineer, I want to connect to ssas. It looks like it can be connected through pyodbc. however looks like I need to install "ODBC Driver 17 for SQL Server" using the following command. How do i install the driver on the cluster an...

Data Engineering

10368 Views
10 replies
1 kudos

02-17-2025 5:22:02 AM

View Replies

Latest Reply

Coffee77
Contributor

2 hours ago

1 kudos

Why not using SQL Server Access Connector directly instead of ODBC driver? With first option you don't need to install anything. Using that in some of my projects and working pretty fine to connect Azure SQL databases.

1 kudos

2 hours ago

9 More Replies

by aav331 • New Contributor

Thursday

128 Views
3 replies
1 kudos

Resolved! Unable to install libraries from requirements.txt in a Serverless Job and spark_python_task

I am running into the following error while trying to deploy a serverless job running a spark_python_task with GIT as the source for the code. The Job was deployed as part of a DAB from a Github Actions Runner.Run failed with error message Library i...

Data Engineering

128 Views
3 replies
1 kudos

Thursday

View Replies

Latest Reply

Louis_Frolio
Databricks Employee

yesterday

1 kudos

@aav331 , if you are happy with the result please "Accept as Solution." This will help others who may be in the same boat. Cheers, Louis.

1 kudos

yesterday

2 More Replies

by carlos_tasayco • Contributor

06-30-2025 12:09:26 PM

2401 Views
1 replies
0 kudos

DATABRICKS CLI SYNC SPECIFIC FILES

Hello,I am struggling with this problem I need to update databricks repo, to only sync some files according to documentation is possible:https://learn.microsoft.com/en-us/azure/databricks/dev-tools/cli/sync-commands#only-sync-specific-filesIn my work...

Data Engineering

2401 Views
1 replies
0 kudos

06-30-2025 12:09:26 PM

View Replies

Latest Reply

mark_ott
Databricks Employee

yesterday

0 kudos

You are trying to use the --include-from option with your .gitignore file to only sync specific files with the databricks sync command, but you are observing that all files get synced, not just the expected ones. The key issue is how the include/excl...

0 kudos

yesterday

by TomDeas • New Contributor

07-01-2025 5:46:38 AM

1959 Views
1 replies
0 kudos

Resource Throttling; Large Merge Operation - Recent Engine Change?

Morning all, hope you can help as I've been stumped for weeks.Question: have there been recent changes to the Databricks query engine, or Photon (etc) which may impact large sort operations?I have a Jobs pipeline that runs a series of notebooks which...

Data Engineering

MERGE

Performance Optimisation

Photon

Query Plan

serverless

1959 Views
1 replies
0 kudos

07-01-2025 5:46:38 AM

View Replies

Latest Reply

mark_ott
Databricks Employee

yesterday

0 kudos

There have indeed been recent changes to the Databricks query engine and Photon, especially during the June 2025 platform releases, which may influence how large sort operations and resource allocation are handled in SQL pipelines similar to yours. S...

0 kudos

yesterday

by hgm251 • New Contributor

Thursday

104 Views
2 replies
2 kudos

badrequest: cannot create online table is being deprecated. creating new online table is not allowed

Hello!This seems so sudden that we cannot create online tables anymore? Is there a workaround to being able to create online tables temporarily as we need more time to move to synced tables? #online_tables

Data Engineering

104 Views
2 replies
2 kudos

Thursday

View Replies

Latest Reply

nayan_wylde
Esteemed Contributor

Thursday

2 kudos

Yes, the Databricks online tables (legacy) are being deprecated, and after January 15, 2026, you will no longer be able to access or create them.https://docs.databricks.com/aws/en/machine-learning/feature-store/migrate-from-online-tablesHere are few ...

2 kudos

Thursday

1 More Replies

Databricks Community

Forum Posts

Optimizing Delta Table Writes for Massive Datasets in Databricks

Databricks Asset Bundles - High Level Diagrams Flow

Does VACUUM on Delta Lake also clean Iceberg metadata when using Iceberg Uniform feature?

Databricks Rest API Statement Execution - External Links

Connect to Sql server analysis services(SSAS) server to run DAX query using python

Connection timeout when connecting to MongoDB using MongoDB Connector for Spark 10.x

Trying to use Broadcast to run Presidio distrubuted

Schema change and OpenSearch

Getting connection reset issue while connecting to a SQL server

Issue: UCX Assessment Installation Error in Databricks Automation Script

Install the ODBC Driver 17 for SQL Server

Resolved! Unable to install libraries from requirements.txt in a Serverless Job and spark_python_task

DATABRICKS CLI SYNC SPECIFIC FILES

Resource Throttling; Large Merge Operation - Recent Engine Change?

badrequest: cannot create online table is being deprecated. creating new online table is not allowed

Join Us as a Local Community Builder!

Delta live table not showing in workspace (Azure d...

Unable to install libraries from requirements.txt ...

Databricks Bundle Validation Error After CLI Upgra...

DABs with multi github sources

DLT Streaming With Watermark fails, suggesting I s...