cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Brahmareddy
by Esteemed Contributor
  • 2594 Views
  • 4 replies
  • 9 kudos

Future of Movie Discovery: How I Built an AI Movie Recommendation Agent on Databricks Free Edition

As a data engineer deeply passionate about how data and AI can come together to create real-world impact, I’m excited to share my project for the Databricks Free Edition Hackathon 2025 — Future of Movie Discovery (FMD). Built entirely on Databricks F...

  • 2594 Views
  • 4 replies
  • 9 kudos
Latest Reply
AlbertaBode
New Contributor III
  • 9 kudos

Really cool project! The mood-based movie matching and conversational memory make the whole discovery experience feel way more intuitive. It’s interesting because most people still browse platforms manually — like on streaming App — but your system s...

  • 9 kudos
3 More Replies
isai-ds
by New Contributor
  • 1128 Views
  • 1 replies
  • 0 kudos

Salesforce LakeFlow connect - Deletion Salesforce records

Hello, I am new in databricks and related to data engineering. I am running a POC to sync data between a Salesforce sandbox and Databricks using LakeFlow connect.I already make the connection and i successfully sync data between salesforce and databr...

  • 1128 Views
  • 1 replies
  • 0 kudos
Latest Reply
Saritha_S
Databricks Employee
  • 0 kudos

Hi @isai-ds  Could you please refer to the document below?  https://www.databricks.com/blog/introducing-salesforce-connectors-lakehouse-federation-and-lakeflow-connect https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/salesforce-faq

  • 0 kudos
Dom1
by New Contributor III
  • 1046 Views
  • 3 replies
  • 2 kudos

Pull JAR from private Maven repository (Azure Artifactory)

Hi,I currently struggle on the following task:We want to push our code to a private repository (Azure Artifactory) and then pull it from databricks when the job runs. It currently works only with wheels inside a PyPi repo in the artifactory. I found ...

  • 1046 Views
  • 3 replies
  • 2 kudos
Latest Reply
Prajapathy_NKR
Contributor
  • 2 kudos

Hi @Dom1 ,One solution which i had implemented is to use API to connect to artifact and download the latest artifact to driver's storage (when you use curl to download the file, it gets downloaded in the disk of the driver), later moved it to the req...

  • 2 kudos
2 More Replies
Johan_Van_Noten
by New Contributor III
  • 1010 Views
  • 3 replies
  • 2 kudos

Long-running Python http POST hangs

As one of the steps in my data engineering pipeline, I need to perform a POST request to a http (not -s) server.This all works fine, except for the situation described below: it then hangs indefinitely.Environment:Azure Databricks Runtime 13.3 LTSPyt...

  • 1010 Views
  • 3 replies
  • 2 kudos
Latest Reply
siva-anantha
Databricks Partner
  • 2 kudos

Hello,IMHO, having a HTTP related task in a Spark cluster is an anti-pattern. This kind of code executes at the Driver, it will be synchronous and adds overhead. This is one of the reasons, DLT (or SDP - Spark Declarative Pipeline) does not have REST...

  • 2 kudos
2 More Replies
adhi_databricks
by Contributor
  • 7511 Views
  • 2 replies
  • 2 kudos

Resolved! How Are You Using Local IDEs (VS Code / Cursor/ Whatever) to Develop & Run Code in Databricks?

Hi everyone,I’m trying to set up a smooth local-development workflow for Databricks and would love to hear how others are doing it.My Current SetupI do most of my development in Cursor (VS Code-based editor) because the AI agents make coding much fas...

  • 7511 Views
  • 2 replies
  • 2 kudos
Latest Reply
siva-anantha
Databricks Partner
  • 2 kudos

@adhi_databricks: I want to add my perspective when it comes to pure local development (without Databricks connect).I wanted to setup a local development environment without connecting to Databricks workspace/cloud storage; develop PySpark code in VS...

  • 2 kudos
1 More Replies
mdungey
by New Contributor II
  • 1088 Views
  • 3 replies
  • 0 kudos

Deleting Lakeflow pipelines impact on objects within.

I've seen hidden in some forums that Databricks are working on a fix so that when you delete a LDP pipeline it doesn't delete the underlying objects(streaming tables, mat views etc..).  Can anyone from an official source confirm this and maybe give s...

  • 1088 Views
  • 3 replies
  • 0 kudos
Latest Reply
Raman_Unifeye
Honored Contributor III
  • 0 kudos

yes, I would take that as a pinch of salt

  • 0 kudos
2 More Replies
mafzal669
by New Contributor
  • 343 Views
  • 1 replies
  • 0 kudos

Admin user creation

Hi,I have created an azure account using my personal email id. I want to create this email id as Group Id in databricks admin console. But when I am adding a new user it says the user with this email id already exist. Could someone please help. I use...

  • 343 Views
  • 1 replies
  • 0 kudos
Latest Reply
Raman_Unifeye
Honored Contributor III
  • 0 kudos

As the User IDs and Group IDs share the same namespace in Databricks you cannot create a Group with the same email address that is already registered as a User in your Databricks account.You better rename the group.

  • 0 kudos
suchitpathak08
by New Contributor
  • 1031 Views
  • 3 replies
  • 0 kudos

Urgent Assistance Needed – Unity Catalog Storage Access Failure & VM SKU Availability (Databricks on

Hi everyone,I’m running into two blocking issues while trying to run a Delta Live Tables (DLT) pipeline on Databricks (Azure). I’m hoping someone can help me understand what’s going wrong.1. Unity Catalog cannot access underlying ADLS storageEvery DL...

  • 1031 Views
  • 3 replies
  • 0 kudos
Latest Reply
bianca_unifeye
Databricks MVP
  • 0 kudos

DLT pipelines always spin up job compute, and Azure is strict about SKU availability per region & per subscription. Most common causes Quota for that VM family is set to 2 vCPUsDatabricks shows: “Estimated available: 2” “QuotaExceeded” The SKU exists...

  • 0 kudos
2 More Replies
Suheb
by Contributor
  • 471 Views
  • 1 replies
  • 2 kudos

How can I efficiently archive old data in Delta tables without slowing queries?

How can I remove or move older rows from my main Delta table so that queries on recent data are faster, while still keeping access to the historical data if needed?

  • 471 Views
  • 1 replies
  • 2 kudos
Latest Reply
Coffee77
Honored Contributor II
  • 2 kudos

Hi Suheb, when using delta tables with databricks, whenever you use proper liquid clustering indexes or partitions, you should get a good performance in comparison to relational engines to deal with big data volumes.However, you can also separate tab...

  • 2 kudos
Suheb
by Contributor
  • 1175 Views
  • 4 replies
  • 4 kudos

What are common pitfalls when migrating large on-premise ETL workflows to Databricks and how did you

When moving your big data pipelines from local servers to Databricks, what problems usually happen, and how did you fix them?

  • 1175 Views
  • 4 replies
  • 4 kudos
Latest Reply
tarunnagar
Contributor
  • 4 kudos

Migrating large on-premise ETL workflows to Databricks often goes wrong when teams try to “lift and shift” legacy logic directly into Spark. Poor data layout, tiny files, and inefficient partitioning can quickly hurt performance, so restructuring dat...

  • 4 kudos
3 More Replies
pooja_bhumandla
by Databricks Partner
  • 1546 Views
  • 2 replies
  • 1 kudos

Error: Executor Memory Issue with Broadcast Joins in Structured Streaming – Unable to Store 69–80 MB

Hi Community,I encountered the following error:      Failed to store executor broadcast spark_join_relation_1622863 (size = Some(67141632)) in BlockManager              with storageLevel=StorageLevel(memory, deserialized, 1 replicas)in a Structured S...

pooja_bhumandla_0-1764236942720.png
  • 1546 Views
  • 2 replies
  • 1 kudos
Latest Reply
Yogesh_Verma_
Contributor II
  • 1 kudos

What Spark Does During a Broadcast Join-Spark identifies the smaller table (say 80MB).The driver collects this small table to a single JVM.The driver serializes the table into a broadcast variable.The broadcast variable is shipped to all executors.Ex...

  • 1 kudos
1 More Replies
mkkao924
by New Contributor II
  • 1542 Views
  • 3 replies
  • 1 kudos

Best practice to handle SQL table archives?

Many of our source data are setup in a way that the main table only keep small amount of data, and historical data are move to another archive table with very similar schema.My goal is have one table in Databricks, maybe with a flag to indicate if th...

  • 1542 Views
  • 3 replies
  • 1 kudos
Latest Reply
Coffee77
Honored Contributor II
  • 1 kudos

I would need to dive deeper in your scenario but it sounds to me a strategy could be:1) Create a view in your SQL Server database with "current data" UNION "historical data". You can set an additional boolean field with True in first query and False ...

  • 1 kudos
2 More Replies
Direo
by Contributor II
  • 873 Views
  • 4 replies
  • 0 kudos

[DAB] registered_model aliases not being applied to Unity Catalog despite successful deploy

HiI'm experiencing an issue with Databricks Asset Bundles where model aliases defined in the bundle configuration are not being applied to Unity Catalog, even though the deployment succeeds and the Terraform state shows the aliases are set.Environmen...

  • 873 Views
  • 4 replies
  • 0 kudos
Latest Reply
iyashk-DB
Databricks Employee
  • 0 kudos

Can you try by explicitly adding: databricks model-versions get-by-alias <catalog>.<schema>.<model> staging

  • 0 kudos
3 More Replies
JackR
by New Contributor II
  • 1468 Views
  • 1 replies
  • 2 kudos

Resolved! Inconsistent behaviour when using read_files to read UTF-8 BOM encoded csv

I have a simple piece of code to read a csv file from an AWS s3 bucket: SELECT    *  FROM    read_files(      myfile,      format => 'csv',      header => true,      inferSchema => true,      mode => 'FAILFAST') It's a large file with over 100 column...

  • 1468 Views
  • 1 replies
  • 2 kudos
Latest Reply
bianca_unifeye
Databricks MVP
  • 2 kudos

Short version: this is (unfortunately) a Databricks quirk, not you going mad. The SQL read_files path and the PySpark spark.read.csv path do not use the exact same schema inference code, and CSVs with a UTF-8 BOM hit a corner case where read_files fa...

  • 2 kudos
ChrisHunt
by New Contributor III
  • 1589 Views
  • 9 replies
  • 1 kudos

Resolved! Databricks external table lagging behind source files

I have a databricks external table which is pointed at an S3 bucket which contains an ever-growing number of parquet files (currently around 2000 of them). Each row in the file is timestamped to indicate when it was written. A new parquet file is add...

  • 1589 Views
  • 9 replies
  • 1 kudos
Latest Reply
ChrisHunt
New Contributor III
  • 1 kudos

Thanks for your answers.I got a solution in the end, but it was more weirdness. A colleague fired the same query at the same database on his machine, and got the latest data! So I rebooted my PC and opened a new Databricks session, and I got the late...

  • 1 kudos
8 More Replies
Labels