cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

suchitpathak08
by New Contributor
  • 391 Views
  • 3 replies
  • 0 kudos

Urgent Assistance Needed – Unity Catalog Storage Access Failure & VM SKU Availability (Databricks on

Hi everyone,I’m running into two blocking issues while trying to run a Delta Live Tables (DLT) pipeline on Databricks (Azure). I’m hoping someone can help me understand what’s going wrong.1. Unity Catalog cannot access underlying ADLS storageEvery DL...

  • 391 Views
  • 3 replies
  • 0 kudos
Latest Reply
bianca_unifeye
Contributor
  • 0 kudos

DLT pipelines always spin up job compute, and Azure is strict about SKU availability per region & per subscription. Most common causes Quota for that VM family is set to 2 vCPUsDatabricks shows: “Estimated available: 2” “QuotaExceeded” The SKU exists...

  • 0 kudos
2 More Replies
Suheb
by Contributor
  • 192 Views
  • 1 replies
  • 2 kudos

How can I efficiently archive old data in Delta tables without slowing queries?

How can I remove or move older rows from my main Delta table so that queries on recent data are faster, while still keeping access to the historical data if needed?

  • 192 Views
  • 1 replies
  • 2 kudos
Latest Reply
Coffee77
Contributor III
  • 2 kudos

Hi Suheb, when using delta tables with databricks, whenever you use proper liquid clustering indexes or partitions, you should get a good performance in comparison to relational engines to deal with big data volumes.However, you can also separate tab...

  • 2 kudos
Suheb
by Contributor
  • 418 Views
  • 4 replies
  • 4 kudos

What are common pitfalls when migrating large on-premise ETL workflows to Databricks and how did you

When moving your big data pipelines from local servers to Databricks, what problems usually happen, and how did you fix them?

  • 418 Views
  • 4 replies
  • 4 kudos
Latest Reply
tarunnagar
Contributor
  • 4 kudos

Migrating large on-premise ETL workflows to Databricks often goes wrong when teams try to “lift and shift” legacy logic directly into Spark. Poor data layout, tiny files, and inefficient partitioning can quickly hurt performance, so restructuring dat...

  • 4 kudos
3 More Replies
pooja_bhumandla
by New Contributor III
  • 383 Views
  • 2 replies
  • 1 kudos

Error: Executor Memory Issue with Broadcast Joins in Structured Streaming – Unable to Store 69–80 MB

Hi Community,I encountered the following error:      Failed to store executor broadcast spark_join_relation_1622863 (size = Some(67141632)) in BlockManager              with storageLevel=StorageLevel(memory, deserialized, 1 replicas)in a Structured S...

pooja_bhumandla_0-1764236942720.png
  • 383 Views
  • 2 replies
  • 1 kudos
Latest Reply
Yogesh_Verma_
Contributor II
  • 1 kudos

What Spark Does During a Broadcast Join-Spark identifies the smaller table (say 80MB).The driver collects this small table to a single JVM.The driver serializes the table into a broadcast variable.The broadcast variable is shipped to all executors.Ex...

  • 1 kudos
1 More Replies
mkkao924
by New Contributor II
  • 401 Views
  • 3 replies
  • 1 kudos

Best practice to handle SQL table archives?

Many of our source data are setup in a way that the main table only keep small amount of data, and historical data are move to another archive table with very similar schema.My goal is have one table in Databricks, maybe with a flag to indicate if th...

  • 401 Views
  • 3 replies
  • 1 kudos
Latest Reply
Coffee77
Contributor III
  • 1 kudos

I would need to dive deeper in your scenario but it sounds to me a strategy could be:1) Create a view in your SQL Server database with "current data" UNION "historical data". You can set an additional boolean field with True in first query and False ...

  • 1 kudos
2 More Replies
Direo
by Contributor II
  • 357 Views
  • 4 replies
  • 0 kudos

[DAB] registered_model aliases not being applied to Unity Catalog despite successful deploy

HiI'm experiencing an issue with Databricks Asset Bundles where model aliases defined in the bundle configuration are not being applied to Unity Catalog, even though the deployment succeeds and the Terraform state shows the aliases are set.Environmen...

  • 357 Views
  • 4 replies
  • 0 kudos
Latest Reply
iyashk-DB
Databricks Employee
  • 0 kudos

Can you try by explicitly adding: databricks model-versions get-by-alias <catalog>.<schema>.<model> staging

  • 0 kudos
3 More Replies
JackR
by New Contributor II
  • 380 Views
  • 1 replies
  • 2 kudos

Resolved! Inconsistent behaviour when using read_files to read UTF-8 BOM encoded csv

I have a simple piece of code to read a csv file from an AWS s3 bucket: SELECT    *  FROM    read_files(      myfile,      format => 'csv',      header => true,      inferSchema => true,      mode => 'FAILFAST') It's a large file with over 100 column...

  • 380 Views
  • 1 replies
  • 2 kudos
Latest Reply
bianca_unifeye
Contributor
  • 2 kudos

Short version: this is (unfortunately) a Databricks quirk, not you going mad. The SQL read_files path and the PySpark spark.read.csv path do not use the exact same schema inference code, and CSVs with a UTF-8 BOM hit a corner case where read_files fa...

  • 2 kudos
RIDBX
by Contributor
  • 810 Views
  • 6 replies
  • 1 kudos

Pushing data from databricks (cloud) to Oracle (on-prem) instance?

Pushing data from databricks (cloud) to Oracle (on-prem) instance?===================================================Thanks for reviewing my threads. I find some threads on this subject dated in 2022 by @ Ajay-PandeyDatabricks to Oracle  We find many...

  • 810 Views
  • 6 replies
  • 1 kudos
Latest Reply
iyashk-DB
Databricks Employee
  • 1 kudos

Option 1: Spark JDBC write from Databricks to Oracle (recommended for “push”/ingestion) Use the built‑in Spark JDBC writer with Oracle’s JDBC driver. It’s the most direct path for writing into on‑prem Oracle and gives you control over batching, paral...

  • 1 kudos
5 More Replies
ChrisHunt
by New Contributor III
  • 792 Views
  • 9 replies
  • 1 kudos

Resolved! Databricks external table lagging behind source files

I have a databricks external table which is pointed at an S3 bucket which contains an ever-growing number of parquet files (currently around 2000 of them). Each row in the file is timestamped to indicate when it was written. A new parquet file is add...

  • 792 Views
  • 9 replies
  • 1 kudos
Latest Reply
ChrisHunt
New Contributor III
  • 1 kudos

Thanks for your answers.I got a solution in the end, but it was more weirdness. A colleague fired the same query at the same database on his machine, and got the latest data! So I rebooted my PC and opened a new Databricks session, and I got the late...

  • 1 kudos
8 More Replies
adriennn
by Valued Contributor
  • 4809 Views
  • 3 replies
  • 5 kudos

Resolved! SQL Warehouse - Table does not support overwrite by expression:

I'm copying data from a foreign catalog using a replace where logic in the target table, this work fine for two other tables. But for a specific one, I keep getting this error:Table does not support overwrite by expression: DeltaTableV2(org.apache.sp...

  • 4809 Views
  • 3 replies
  • 5 kudos
Latest Reply
aakashnand-kt
New Contributor III
  • 5 kudos

Thank you @adriennn I encountered same issue and your post helped me resolved this. I agree that the error message given by databricks is not so helpful and I wasted my time in investigating table properties a lot before I found your post

  • 5 kudos
2 More Replies
dbernstein_tp
by New Contributor III
  • 393 Views
  • 4 replies
  • 2 kudos

Resolved! Naming question about SQL server database schemas

I have an MS SQL server database that has several schemas we need to ingest data from. Call them "SCHEMA1" tables and "SCHEMA2" tables. Let's call the server S and the database D. In unity catalog I have a catalog called "staging" where the staging (...

  • 393 Views
  • 4 replies
  • 2 kudos
Latest Reply
dbernstein_tp
New Contributor III
  • 2 kudos

Thanks for the responses! @K_Anudeep suggestion makes sense in the context of our current lakehouse architecture so I think I will migrate to that.

  • 2 kudos
3 More Replies
dkhodyriev1208
by New Contributor II
  • 441 Views
  • 4 replies
  • 2 kudos

Spark SQL INITCAP not capitalizing letters after periods in abbreviations

Using SELECT INITCAP("text (e.g., text, text, etc.)") abbreviations with periods like e.g. are not being fully capitalized.Current behavior:Input: "text (e.g., text, text, etc.)"Output: "Text (e.g., Text, Text, Etc.)"Expected behavior:Output: "Text ...

  • 441 Views
  • 4 replies
  • 2 kudos
Latest Reply
iyashk-DB
Databricks Employee
  • 2 kudos

Yes similar to what @Coffee77 has told, you can alternatively create an SQL function and use it directly with the custom logic using the regexp: CREATE OR REPLACE FUNCTION PROPER_WITH_ABBREVIATIONS(input STRING)RETURNS STRINGRETURN regexp_replace(INI...

  • 2 kudos
3 More Replies
Swathik
by New Contributor III
  • 396 Views
  • 1 replies
  • 0 kudos

Resolved! Best Practices for implementing DLT, Autoloader in Workflows

I am in the process of designing a Medallion architecture where the data sources include REST API calls, JSON files, SQL Server, and Azure Event Hubs.For the Silver and Gold layers, I plan to leverage Delta Live Tables (DLT). However, I am seeking gu...

  • 396 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

The optimal approach for implementing the Bronze layer in a Medallion architecture with Delta Live Tables (DLT) involves balancing batch and streaming ingestion patterns, especially when combining DLT and Autoloader. The trigger(availableNow=True) op...

  • 0 kudos
SupunK
by New Contributor II
  • 366 Views
  • 1 replies
  • 2 kudos

Databricks always loads built-in BigQuery connector (0.22.2), can’t override with 0.43.x

I am using Databricks Runtime 15.4 (Spark 3.5 / Scala 2.12) on AWS.My goal is to use the latest Google BigQuery connector because I need the direct write method (BigQuery Storage Write API):option("writeMethod", "direct")This allows writing directly ...

  • 366 Views
  • 1 replies
  • 2 kudos
Latest Reply
mark_ott
Databricks Employee
  • 2 kudos

There is no supported way on Databricks Runtime 15.4 to override or replace the built-in BigQuery connector to use your own version (such as 0.43.x) in order to access the direct write method. Databricks clusters come preloaded with their own managed...

  • 2 kudos
Mathias_Peters
by Contributor II
  • 180 Views
  • 1 replies
  • 0 kudos

Question on how to properly write a dataset of custom objects to MonogDB

Hi, I am implementing a Spark Job in Kotlin (unfortunately a must-have) which reads from and writes to MongoDB. The reason for this is to reuse existing code in a MapFunction. The result of applying that map is a DataSet of type Consumer, a custom ob...

  • 180 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

You are correct—when you pass a BsonDocument to Spark's MongoDB connector using .write().format("mongodb"), Spark treats unknown types as generic serialized blobs, leading to documents stored as a single binary field (as you observed) rather than as ...

  • 0 kudos
Labels