cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

subhas_hati
by New Contributor
  • 5132 Views
  • 1 replies
  • 1 kudos

Partition Size:

HiI have chosen the default partition size 128 MB. I am reading a 3.8 GB file and checking the size of partition using df.rdd.getNumPartitions() as given below. I find the partition size: 159 MB. Why the partition size after reading the file differ ?...

  • 5132 Views
  • 1 replies
  • 1 kudos
Latest Reply
Sidhant07
Databricks Employee
  • 1 kudos

Hi @subhas_hati , The partition size of a 3.8 GB file read into a DataFrame differs from the default partition size of 128 MB, resulting in a partition size of 159 MB, due to the influence of the spark.sql.files.openCostInBytes configuration.• spark....

  • 1 kudos
Avinash_Narala
by Valued Contributor II
  • 618 Views
  • 1 replies
  • 0 kudos

Python udf to pyspark udf conversion

Hi,I want to convert my python udf to pyspark udf, is there any guide/article on suggesting the best practices and avoid miscalculations if any 

  • 618 Views
  • 1 replies
  • 0 kudos
Latest Reply
Sidhant07
Databricks Employee
  • 0 kudos

Hi, Can you please share some more details. Let me know if this helps? https://docs.databricks.com/en/udf/unity-catalog.html

  • 0 kudos
norbitek
by New Contributor II
  • 1088 Views
  • 1 replies
  • 0 kudos

Identity column and impact on performance

Hi,I want to define identity column in the Delta table.Based on documentation:"Declaring an identity column on a Delta table disables concurrent transactions. Only use identity columns in use cases where concurrent writes to the target table are not ...

  • 1088 Views
  • 1 replies
  • 0 kudos
Latest Reply
Sidhant07
Databricks Employee
  • 0 kudos

The use of an identity column in a Delta table affects the execution of a MERGE statement by disabling concurrent transactions. This constraint means that when performing operations such as upserting or deleting data, the identity column enforces tha...

  • 0 kudos
JulianKrüger
by New Contributor
  • 1584 Views
  • 1 replies
  • 0 kudos

Limited concurrent running DLT's within a pipeline

Hi Champions!We are currently experiencing a few unexplainable limitations when executing pipelines with > 50 DLT tables. It looks like, that there is some calculation in the background in place, to determine the maximum number of concurrent running ...

JulianKrger_0-1737625978665.png JulianKrger_1-1737626627178.png JulianKrger_2-1737626698802.png
Data Engineering
dlt
pipeline
  • 1584 Views
  • 1 replies
  • 0 kudos
Latest Reply
Sidhant07
Databricks Employee
  • 0 kudos

Hi @JulianKrüger , • The "num_task_slots" parameter in Databricks Delta Live Tables (DLT) pipelines is related to the concurrency of tasks within a pipeline. It determines the number of concurrent tasks that can be executed. However, this parameter d...

  • 0 kudos
sparkplug
by New Contributor III
  • 1189 Views
  • 4 replies
  • 1 kudos

Databricks logging of SQL queries to DBFS

HiOur costs has suddenly spiked due to logging of a lot of SQL query outputs to DBFS. We haven't made any changes to enable this. How can we disable this feature?

  • 1189 Views
  • 4 replies
  • 1 kudos
Latest Reply
sparkplug
New Contributor III
  • 1 kudos

I don't get any output when running the following, I have the destination set to dbfs . But it was only supposed to be for cluster logs and not for query execution outputs to be stored in DBFS. Any idea if this is expected behavior.spark.conf.get("sp...

  • 1 kudos
3 More Replies
ashraf1395
by Honored Contributor
  • 2382 Views
  • 3 replies
  • 0 kudos

Resolved! Adding if statements or try/catch block in sql based dlt pipelines

We have complete sql based dlt pipelines. Where bronze tables are read from UC volumes. There can be situations when no new data comes so of the endpoints in the UC volume. in that case the SQL code blocks gets failed which results in failing the ent...

  • 2382 Views
  • 3 replies
  • 0 kudos
Latest Reply
Rjdudley
Honored Contributor
  • 0 kudos

Is there a reason why you can't use Autoloader for this?  That would only trigger the pipeline when new files arrive.

  • 0 kudos
2 More Replies
Brianben
by New Contributor III
  • 2421 Views
  • 3 replies
  • 0 kudos

Resolved! Personal access token retention period change

Hi all,Recently we recognized that the maximun retention period of personal access token has changed to 730days from never. From official document, I didn't find the effective date of this changes, do someone know about it?Besides, we have some Power...

  • 2421 Views
  • 3 replies
  • 0 kudos
Latest Reply
Sidhant07
Databricks Employee
  • 0 kudos

Hi @Brianben , The maximum retention period for personal access tokens (PATs) in Databricks has indeed changed to 730 days (two years). Unfortunately, the exact effective date of this change is not specified in the official documentation available. R...

  • 0 kudos
2 More Replies
turagittech
by Contributor
  • 4742 Views
  • 6 replies
  • 0 kudos

Accessing blob from databricks 403 Error Request Not authorized

Hi,I am trying to access a blob storage container to retrieve files. It's throwing this error.: Operation failed: "This request is not authorized to perform this operation using this resource type.", 403, GET,I have tried sas key at container and sto...

  • 4742 Views
  • 6 replies
  • 0 kudos
Latest Reply
turagittech
Contributor
  • 0 kudos

I have tested that and no improvement. I have also tried with a Service principal to see if another error message occurred to find the issue.I do have a question, must we use the dfs url to access blob soreage with abfss? Is that only enabled when yo...

  • 0 kudos
5 More Replies
noorbasha534
by Valued Contributor II
  • 4786 Views
  • 2 replies
  • 0 kudos

Lakehouse Monitoring & Expectations

DearsHas anyone successfully used at scale the lakehouse monitoring & expectations features together to measure data quality of data tables - example, to conduct freshness checks, consistency checks etc.Appreciate if you could share the lessons learn...

  • 4786 Views
  • 2 replies
  • 0 kudos
Latest Reply
Satyadeepak
Databricks Employee
  • 0 kudos

Not sure if you are still looking for the same. Here is a medium article - https://piethein.medium.com/data-quality-within-lakehouses-0c9417ce0487  that you can see the detailed implementation

  • 0 kudos
1 More Replies
RobsonNLPT
by Contributor III
  • 833 Views
  • 2 replies
  • 0 kudos

Spark Configurations with Serverless Compute

I have some few problems to convert my notebooks run run with serverless compute.Right now I can't set my delta userMetadata at session and  scope level using spark or sql.Setting userMetadata at dataframe write operation is ok using the option: opti...

  • 833 Views
  • 2 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hi @RobsonNLPT, There is an internal feature request for this use-case. https://databricks.aha.io/ideas/ideas/DB-I-12401 and it's under idea and not ETA on its implementation yet.

  • 0 kudos
1 More Replies
ankitmit
by New Contributor III
  • 942 Views
  • 2 replies
  • 0 kudos

Unknown Location of files for tables created using DLT

Hi all,I created catalog and schema using managed location  But I don’t see any catalogs directory within the s3 bucket path mentioned in the image above.  Also, I created a schema with managed location, and I expected all the tables created within t...

ankitmit_0-1738140152440.png ankitmit_1-1738140152445.png ankitmit_2-1738140152449.png
Data Engineering
Databricks
dlt
Storage
  • 942 Views
  • 2 replies
  • 0 kudos
Latest Reply
VZLA
Databricks Employee
  • 0 kudos

Hello, thank you for your question. Could you confirm that your issue is regarding the location of files for tables created using Delta Live Tables (DLT) when utilizing managed storage locations at the catalog and schema levels? Specifically, it seem...

  • 0 kudos
1 More Replies
Dulce42
by New Contributor II
  • 1410 Views
  • 1 replies
  • 0 kudos

Exports history chats from genie space

Hi community!In the last days I search about how I can export the history chats from my genie space, but I couldn't find something Some of you will have done this exercise so you can guide me?

  • 1410 Views
  • 1 replies
  • 0 kudos
Latest Reply
VZLA
Databricks Employee
  • 0 kudos

Hi, thank you for the question! I haven't done this myself, but for context are you referring to AI/BI Genie Space? e.g.: https://docs.databricks.com/en/genie/index.htmlhttps://learn.microsoft.com/en-us/azure/databricks/genie/ If so, then it doesn't ...

  • 0 kudos
gilt
by New Contributor III
  • 2972 Views
  • 9 replies
  • 2 kudos

Auto Loader ignores data with modifiedBefore

Hello, I am trying to ingest CSV data with Auto Loader from an Azure Data Lake. I want to perform batch ingestion by using a scheduled job and the following trigger:  .trigger(availableNow=True) The CSV files are generated by Azure Synapse Link. If m...

  • 2972 Views
  • 9 replies
  • 2 kudos
Latest Reply
kostoska
New Contributor II
  • 2 kudos

Databricks should resolve this and introduce two options: soft modifiedBefore and hard modifiedBefore (files that are going to be ingored forever). In addition, this is not explained in the documentation, so it is a bit frustrating as it is not intui...

  • 2 kudos
8 More Replies
aliacovella
by Contributor
  • 2315 Views
  • 3 replies
  • 1 kudos

Resolved! Custom Checkpointing

The following is my scenario:I need to query on a daily basis from an external table that maintains a row versionI would like to be able to query for all records where the row version is greater than the max previously processed row version. The sour...

  • 2315 Views
  • 3 replies
  • 1 kudos
Latest Reply
jeremy98
Honored Contributor
  • 1 kudos

Hi, I totally agree with VZLA, within my internal team we have a similar issue and we used a table to track the latest versions of each table, since we haven't a streaming process in our side. DLT pipelines could be a choice, but depends also if you ...

  • 1 kudos
2 More Replies
ashraf1395
by Honored Contributor
  • 2434 Views
  • 3 replies
  • 0 kudos

Resolved! Databricks Workflow design

I have 7 - 8 different dlt pipelines which have to be run at the same time according to their batch type i.e. hourly and daily. Right now they are triggered effectively according to their batch type. I want to move to a next stage where I want to clu...

  • 2434 Views
  • 3 replies
  • 0 kudos
Latest Reply
ashraf1395
Honored Contributor
  • 0 kudos

Hi @VZLA , I got the idea. There will be a small change in the way, we will use it. Since we don't schedule the workflow in databricks we trigger it using the API. So I will pass a job parameter along with the trigger according to the timestamp wheth...

  • 0 kudos
2 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels