cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

lecarusin
by Visitor
  • 47 Views
  • 4 replies
  • 1 kudos

Help regarding a python notebook and s3 file structure

Hello all, I am new to this forum, so please forgive if I am posting in the wrong location (I'd appreciate if the post is moved by mods or am told where to post).I am looking for help with an optimization of a python code I have. This python notebook...

  • 47 Views
  • 4 replies
  • 1 kudos
Latest Reply
arunpalanoor
New Contributor II
  • 1 kudos

I am not sure if I fully understand how your data pipeline is setup, but have you considered incremental data loading say using something similar to "COPY INTO" method which would only read your incremental load, and then apply a 90 day filter on top...

  • 1 kudos
3 More Replies
shubham007
by New Contributor III
  • 35 Views
  • 1 replies
  • 0 kudos

Urgency: How to do Data Migration task using Databricks Lakebridge tool ?

Dear community expert,I have completed two phases Analyzer & Converter of Databricks Lakebridge but stuck at migrating data from source to target using lakebridge. I have watched BrickBites Series on Lakebridge but did not find on how to migrate data...

  • 35 Views
  • 1 replies
  • 0 kudos
Latest Reply
bianca_unifeye
New Contributor II
  • 0 kudos

Lakebridge doesn’t copy data. It covers Assessment → Conversion (Analyzer/Converter) → Reconciliation.The fastest way is to use Lakehouse Federation. Create a Snowflake connection in Unity Catalog and run federated queries from Databricks. For perman...

  • 0 kudos
DatabricksEngi1
by New Contributor III
  • 25 Views
  • 1 replies
  • 0 kudos

MERGE operation not performing data skipping with liquid clustering on key columns

 Hi, I need some help understanding a performance issue.I have a table that reads approximately 800K records every 30 minutes in an incremental manner.Let’s say its primary key is:timestamp, x, y This table is overwritten every 30 minutes and serves ...

  • 25 Views
  • 1 replies
  • 0 kudos
Latest Reply
bianca_unifeye
New Contributor II
  • 0 kudos

MERGE is not a pure read plus filter operationEven though Liquid Clustering organizes your data by key ranges and writes min/max stats, the MERGE engine has to identify both matches and non-matches.That means the query planner must:Scan all candidate...

  • 0 kudos
Akshay_Petkar
by Valued Contributor
  • 43 Views
  • 1 replies
  • 0 kudos

Advanced Data Engineering Event and Free Certification Voucher

Hi everyone,In the past couple of years, Databricks has organized an Advanced Data Engineering event where attendees received a 100% free certification voucher under their organization account after attending the session.I wanted to check if this eve...

  • 43 Views
  • 1 replies
  • 0 kudos
Latest Reply
bianca_unifeye
New Contributor II
  • 0 kudos

I’m only aware of the Databricks Learning Festival, which typically offers a 50% discount voucher for certification, rather than a full-voucher.I couldn’t find any official confirmation of a 100% free voucher for an “Advanced Data Engineering” event ...

  • 0 kudos
cdn_yyz_yul
by New Contributor II
  • 17 Views
  • 1 replies
  • 0 kudos

delta as streaming source, can the reader reads only newly appended rows?

Hello everyone,In our implementation of Medallion Architecture, we want to stream changes with spark structured streaming. I would like some advice on how to use delta table as source correctly, and if there is performance (memory usage) concern in t...

  • 17 Views
  • 1 replies
  • 0 kudos
Latest Reply
bianca_unifeye
New Contributor II
  • 0 kudos

First of all, you are using append-only reads, which means that every time your stream triggers, Spark will process the entire Delta snapshot rather than just the changes.That’s why you’re observing the memory usage increase after each run, it’s not ...

  • 0 kudos
vikram_p
by Visitor
  • 16 Views
  • 1 replies
  • 0 kudos

Generate embeddings for 50 million rows in dataframe

Hello All,I have dataframe with 5 million rows and before we can setup vector search endpoint against index, we want to generate embeddings column for each of those rows. Please suggest whats an optimal way to do this?We are in development phase so w...

  • 16 Views
  • 1 replies
  • 0 kudos
Latest Reply
bianca_unifeye
New Contributor II
  • 0 kudos

The easiest and most reliable way to generate embeddings for millions of rows is to let Databricks Vector Search compute them automatically during synchronization from a Delta table.Vector Search can generate embeddings for you, keep them updated whe...

  • 0 kudos
Divya_Bhadauria
by New Contributor III
  • 22 Views
  • 1 replies
  • 0 kudos

Does Databricks Runtime 7.3+ include built-in Hadoop S3 connector configurations?

I came across the KB article S3 connection reset error, which mentions not using the following Spark settings for the Hadoop S3 connector for DBR 7.3 and above:spark.hadoop.fs.s3.impl com.databricks.s3a.S3AFileSystem spark.hadoop.fs.s3n.impl com.data...

  • 22 Views
  • 1 replies
  • 0 kudos
Latest Reply
hasnat_unifeye
  • 0 kudos

No, you don’t need to set those on DBR 7.3 and above.From 7.3+ Databricks already uses the newer Hadoop S3A connector by default, so those com.databricks.s3a.S3AFileSystem settings are not part of the default config and shouldn’t be added.If they are...

  • 0 kudos
ShivangiB1
by New Contributor III
  • 26 Views
  • 1 replies
  • 0 kudos

DATABRICKS LAKEFLOW SQL SERVER INGESTION PIPELINE ERROR

Hey Team,I am getting below error while creating pipeline :com.databricks.pipelines.execution.extensions.managedingestion.errors.ManagedIngestionNonRetryableException: [INGESTION_GATEWAY_DDL_OBJECTS_MISSING] DDL objects missing on table 'coedb.dbo.so...

  • 26 Views
  • 1 replies
  • 0 kudos
Latest Reply
ShivangiB1
New Contributor III
  • 0 kudos

Hey Team, can anyone help on this

  • 0 kudos
Surya-Prathap
by New Contributor
  • 80 Views
  • 2 replies
  • 1 kudos

Output Not Displaying in Databricks Notebook on All-Purpose Compute Cluster

Hello All,I’m encountering an issue where output from standard Python commands such as print() or display(df) is not showing up correctly when running notebooks on an All-Purpose Compute cluster.Cluster DetailsCluster Type: All-Purpose ComputeRuntime...

  • 80 Views
  • 2 replies
  • 1 kudos
Latest Reply
Sahil_Kumar
Databricks Employee
  • 1 kudos

Hi Surya, Do you face this issue only with DBR 17.3 all-purpose clusters? Did you try with lower DBRs? If not, please try and let me know. Also, from the Run menu, try “Clear state and outputs,” then re‑run the cell on the same cluster to rule out st...

  • 1 kudos
1 More Replies
spd_dat
by New Contributor III
  • 3543 Views
  • 2 replies
  • 0 kudos

Can you default to `execution-count: none` when stripping notebook outputs?

When committing to a git folder, IPYNB outputs are usually stripped, unless allowed by an admin setting and toggled by .databricks/commit_outputs. This sets the{"execution-count": 0, ... }within the IPYNB metadata. Is there a way to set it instead to...

  • 3543 Views
  • 2 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

Databricks does not currently allow you to default to "execution_count": null (or "none") when stripping notebook outputs during a commit. The platform sets "execution_count": 0 as the default when outputs are stripped through their Git integration, ...

  • 0 kudos
1 More Replies
pooja_bhumandla
by New Contributor III
  • 229 Views
  • 2 replies
  • 1 kudos

When to Use and when Not to Use Liquid Clustering?

 Hi everyone,I’m looking for some practical guidance and experiences around when to choose Liquid Clustering versus sticking with traditional partitioning + Z-ordering.From what I’ve gathered so far:For small tables (<10TB), Liquid Clustering gives s...

  • 229 Views
  • 2 replies
  • 1 kudos
Latest Reply
mark_ott
Databricks Employee
  • 1 kudos

Deciding between Liquid Clustering and traditional partitioning with Z-ordering depends on table size, query patterns, number of clustering columns, and file optimization needs. For tables under 10TB with queries consistently filtered on 1–2 columns,...

  • 1 kudos
1 More Replies
ayush667787878
by New Contributor
  • 3301 Views
  • 2 replies
  • 1 kudos

not able to install library in normal site while in community version it working please help

I am not able to install library in normal version while in community editioin i am able to add libray using compute how to install in normal databricks same as community edition.   

SCR-20250211-qxiz.png ayush667787878_0-1739282136368.png
  • 3301 Views
  • 2 replies
  • 1 kudos
Latest Reply
mark_ott
Databricks Employee
  • 1 kudos

To install libraries in the normal (paid) version of Databricks, use the cluster management interface to add libraries to your compute resources. The process is similar to the Community Edition, but workspace policies and cluster access mode may rest...

  • 1 kudos
1 More Replies
ask005
by New Contributor
  • 2050 Views
  • 1 replies
  • 0 kudos

How to write ObjectId value using Spark connector 10.2.2

In pySpark mongo connector while updating records how to handle _id as objectId.spark 3.2.4scala2.13sparkMongoConnector 2.12-10.2.2

  • 2050 Views
  • 1 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

To write an ObjectId value using Spark Mongo Connector 10.2.2 in PySpark while updating records, you must convert the ObjectId string into a special format. The Spark Mongo Connector does not automatically recognize a string as an ObjectId; it will o...

  • 0 kudos
dndeng
by New Contributor II
  • 254 Views
  • 4 replies
  • 0 kudos

Query to calculate cost of task from each job by day

I am trying to find the cost per Task in each Job every time it was executed (daily) but currently getting very huge numbers due to duplicates, can someone help me ?   WITH workspace AS ( SELECT account_id, workspace_id, workspace_name,...

  • 254 Views
  • 4 replies
  • 0 kudos
Latest Reply
mark_ott
Databricks Employee
  • 0 kudos

You are seeing inflated cost numbers because your query groups by many columns—especially run_id, task_key, usage_start_time, and usage_end_time—without addressing possible duplicate row entries that arise from your joins, especially with the system....

  • 0 kudos
3 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels