cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Sam500
by New Contributor II
  • 67 Views
  • 3 replies
  • 0 kudos

Databricks Serverless Costs

Our power BI reports consume real-time data , and for that the only option remains is Databricks serverless,but serverrless is expensive option, how to control the costs for serverless , and any other alternatives. Thank you.

  • 67 Views
  • 3 replies
  • 0 kudos
Latest Reply
kim533
Visitor
  • 0 kudos

If your Power BI reports require near real-time data, Serverless SQL can be convenient but expensive at scale. To reduce costs, optimize queries, use aggregation tables, limit data scanned, enable caching, and avoid overly frequent Power BI refreshes...

  • 0 kudos
2 More Replies
Nidhig631
by Databricks MVP
  • 100 Views
  • 5 replies
  • 0 kudos

DISTINCT is the major bottleneck because of the heavy shuffle.

Need some advice from the community.I am processing around 100 million records using:df.select(required_cols).distinct().write.saveAsTable(...)The source has 1000+ columns, but I'm selecting only 20 columns before applying DISTINCT.I have already ena...

  • 100 Views
  • 5 replies
  • 0 kudos
Latest Reply
Ashwin_DSA
Databricks Employee
  • 0 kudos

Hi @Nidhig631, DISTINCT is still an exact global deduplication step, and that means Spark has to shuffle rows so identical values can meet in the same place. So, what you are seeing is normal. Selecting 20 columns instead of 1000 definitely reduces t...

  • 0 kudos
4 More Replies
RGSLCA
by New Contributor II
  • 88 Views
  • 1 replies
  • 0 kudos

Selective overwrite on Partition and Liquid clustered tables

Hi,I have created 2 identical tables but one is partitioned and the one is a Liquid Clustered with Auto Clustering.I inserted 30M rows x 2 (60M) for two dates , date 1 = 2026-06-01 and date = 2026-06-02 , then I overwrite the date 2026-06-02 with a s...

  • 88 Views
  • 1 replies
  • 0 kudos
Latest Reply
balajij8
Contributor III
  • 0 kudos

Hi, the current way is not optimal. You can follow belowINSERT query ran with mostly 43 tasks, creating 43 output files. Since the Liquid clustered table has no organization (clusterBy "[]") - dates are randomly scattered across files.Partition table...

  • 0 kudos
Ramana
by Valued Contributor II
  • 2717 Views
  • 6 replies
  • 4 kudos

Resolved! Serverless Compute - pySpark - Any alternative for rdd.getNumPartitions()

Hello Community,We have been trying to migrate our jobs from Classic Compute to Serverless Compute. As part of this process, we face several challenges, and this is one of them.When we read CSV or JSON files with multiLine=true, the load becomes sing...

  • 2717 Views
  • 6 replies
  • 4 kudos
Latest Reply
Ramana
Valued Contributor II
  • 4 kudos

spark_partition_id is the closest and most performant function available as an alternative, and I migrated to use this function. So far, no issues.https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.spark_p...

  • 4 kudos
5 More Replies
Ramana
by Valued Contributor II
  • 1328 Views
  • 3 replies
  • 0 kudos

Resolved! Serverless Compute - Python - Custom Emails via SMTP (smtplib.SMTP(host_name)) - Any alternative?

Hello Community,We have been trying to migrate our jobs from Classic Compute to Serverless Compute. As part of this process, we face several challenges, and this is one of them.We have several scenarios where we need to send an inline email via Pytho...

  • 1328 Views
  • 3 replies
  • 0 kudos
Latest Reply
Ramana
Valued Contributor II
  • 0 kudos

The solution we implemented as an alternative for email sending from Serverless is via the Microsoft Graph API.https://learn.microsoft.com/en-us/graph/api/user-sendmail?view=graph-rest-1.0&tabs=python 

  • 0 kudos
2 More Replies
Nick_Hughes
by New Contributor III
  • 17244 Views
  • 4 replies
  • 1 kudos

Best way to generate fake data using underlying schema

HiWe are trying to generate fake data to run our tests. For example, we have a pipeline that creates a gold layer fact table form 6 underlying source tables in our silver layer. We want to generate the data in a way that recognises the relationships ...

  • 17244 Views
  • 4 replies
  • 1 kudos
Latest Reply
muhammedrasin
New Contributor
  • 1 kudos

Hi @Nick_Hughes ,I am very late to the party, but I was digging in the internet to find more people discussing a relatable problem for which I am on my way building a definitive solution, and came across your post from 3 years ago. Times have changed...

  • 1 kudos
3 More Replies
RGSLCA
by New Contributor II
  • 409 Views
  • 7 replies
  • 0 kudos

Sizing Tables and delt logs/CDF

Hi,I need to compare the sizes of my delta tables , what's the correct approach ?Table size reported by analyze  command ? , but how do I check the delta log size , if I enable CDF .. how do I know the CDF log size(the overhead it adds) ? , kind of l...

  • 409 Views
  • 7 replies
  • 0 kudos
Latest Reply
Vikram10
New Contributor II
  • 0 kudos

Hi @RGSLCA DESCRIBE DETAIL is the best starting point if you're comparing Delta table sizes, but it's important to understand what it reports. The sizeInBytes value represents only the latest active snapshot of the table, not the total storage consum...

  • 0 kudos
6 More Replies
nidhin
by New Contributor III
  • 118 Views
  • 2 replies
  • 1 kudos

Lakeflow SDP (DLT) produce external tables, or only UC-managed

As I understand it, streaming tables and materialized views produced by Lakeflow Spark Declarative Pipelines (DLT) are always Unity Catalog managed tables , there's no LOCATION/path option on create_streaming_table or apply_changes.Is that correct? A...

  • 118 Views
  • 2 replies
  • 1 kudos
Latest Reply
Ashwin_DSA
Databricks Employee
  • 1 kudos

Hi @nidhin, What you’re saying is basically correct for a Unity Catalog-enabled Lakeflow Spark Declarative Pipelines setup. In that model, pipelines publish streaming tables and materialized views into the target catalog and schema, the data is store...

  • 1 kudos
1 More Replies
A0s01gy
by New Contributor II
  • 97 Views
  • 0 replies
  • 0 kudos

STTM as a Metadata Contract in Databricks

One pattern I keep seeing in data engineering projects:STTM is treated as documentation.But in reality, STTM can become much more than that.A well-structured Source-to-Target Mapping can act as a metadata contract between business, engineering, QA, a...

  • 97 Views
  • 0 replies
  • 0 kudos
A0s01gy
by New Contributor II
  • 303 Views
  • 2 replies
  • 0 kudos

Resolved! From STTM to Databricks Pipelines: Can Metadata Become the Source Code of Data Engineering?

I’ve been exploring a metadata-driven approach to data engineering through a project called Data Engineering Copilot.The idea is to treat Source-to-Target Mapping (STTM) documents as structured metadata rather than static documentation.Instead of man...

  • 303 Views
  • 2 replies
  • 0 kudos
Latest Reply
rdokala
New Contributor III
  • 0 kudos

This is a good discussion topic, but from my experience right now it is both meta data driven and most traditional excel based STMs.A few observations:How most teams manage STTM todayLevel 1 (Most Common)STTM in Excel, Word, or Confluence.Engineers m...

  • 0 kudos
1 More Replies
nidhin
by New Contributor III
  • 128 Views
  • 1 replies
  • 1 kudos

SQL Warehouse stuck on "Cluster Start-up Delayed

Hi everyone,I'm running into an issue with my Starter Warehouse on Databricks and would appreciate any help or pointers.Problem: My SQL Warehouse has been stuck in a Starting state with the following warning:Cluster Start-up Delayed. Please wait whil...

  • 128 Views
  • 1 replies
  • 1 kudos
Latest Reply
rdokala
New Contributor III
  • 1 kudos

This typically points to delayed compute provisioning behind the SQL Warehouse, often due to temporary capacity/resource availability or a transient startup issue.A few things I would try:1. Stop and restart the SQL WarehouseIf it has been stuck for ...

  • 1 kudos
emorgoch
by New Contributor II
  • 142 Views
  • 1 replies
  • 0 kudos

Managing IPYNB cell timestamps in source control

We're in the process of converting over our Databricks notebooks from .py file to .ipynb. We have disabled storing notebook output in source control at the workspace level.However, what we're discovering is that every cell in our notebooks has 3 time...

emorgoch_0-1781635989625.png
  • 142 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ashwin_DSA
Databricks Employee
  • 0 kudos

Hi @emorgoch, Thanks for raising this. This appears to be a regression rather than expected behaviour. Internally, the issue has been identified around .ipynb handling in Git folders, and the intended fix is to stop serialising these execution timest...

  • 0 kudos
MVMZ
by New Contributor
  • 224 Views
  • 1 replies
  • 0 kudos

Resolved! Table history time travel

I have noticed what seems to be unexpected behavior with the history of Unity Catalog managed tables and would like to understand whether this is expected.As a test, I created a table with two versions:Version 0Version 1 (created approximately 200 ho...

  • 224 Views
  • 1 replies
  • 0 kudos
Latest Reply
Ashwin_DSA
Databricks Employee
  • 0 kudos

Hi @MVMZ, What you’re seeing is expected for Unity Catalog managed tables. The key detail is that for Unity Catalog managed tables, Databricks blocks time travel queries when the requested version is older than delta.deletedFileRetentionDuration, whi...

  • 0 kudos
Labels