cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

pooja_bhumandla
by New Contributor II
  • 296 Views
  • 3 replies
  • 0 kudos

Auto tuning of file size

Why maxFileSize and minFileSize are different from targetFileSize after optimization? What is the significance of targetFileSize? "numRemovedFiles": "2099","numRemovedBytes": "29658974681","p25FileSize": "29701688","numDeletionVectorsRemoved": "0","m...

  • 296 Views
  • 3 replies
  • 0 kudos
Latest Reply
loui_wentzel
New Contributor III
  • 0 kudos

there could be several different reasons, but mainly, it's because grouping arbitrary data into some target file-size is well... arbitrary.Imagine I gave you a large container of sand and some emtpy buckets, and asked you to move the sand from the co...

  • 0 kudos
2 More Replies
SreedharVengala
by New Contributor III
  • 27733 Views
  • 11 replies
  • 7 kudos

PGP Encryption / Decryption in Databricks

Is there a way to Decrypt / Encrypt Blob files in Databricks using Key stored in Key Vault. What libraries need to be used? Any code snippets? Links?

  • 27733 Views
  • 11 replies
  • 7 kudos
Latest Reply
Junpei_Liang
New Contributor II
  • 7 kudos

anyone has update on this?

  • 7 kudos
10 More Replies
Ramki
by New Contributor
  • 183 Views
  • 1 replies
  • 0 kudos

Lakeflow clarification

Are there options to modify the streaming table after it has been created by the Lakeflow pipeline? In the use case I'm trying to solve, I need to add delta.enableIcebergCompatV2 and delta.universalFormat.enabledFormats to the target streaming table....

  • 183 Views
  • 1 replies
  • 0 kudos
Latest Reply
lingareddy_Alva
Honored Contributor II
  • 0 kudos

Hi @Ramki Yes, you can modify a streaming table created by a LakeFlow pipeline, especially when the pipeline is in triggered mode (not running continuously).In your case, you want to add the following Delta table properties: TBLPROPERTIES ( 'delta....

  • 0 kudos
michelleliu
by New Contributor III
  • 959 Views
  • 3 replies
  • 2 kudos

Resolved! DLT Performance Issue

I've been seeing patterns in DLT process time in all my pipelines, as in attached screenshot. Each data point is an "update" that's set to "continuous". The process time keeps increasing until a point and drops back to what it's desired to be. This w...

  • 959 Views
  • 3 replies
  • 2 kudos
Latest Reply
lingareddy_Alva
Honored Contributor II
  • 2 kudos

Hi @michelleliu This sawtooth pattern in DLT processing times is actually quite common and typically indicates one of several underlying issues. Here are the most likely causes and solutions:Common Causes1. Memory Pressure & Garbage CollectionProcess...

  • 2 kudos
2 More Replies
alau131
by New Contributor
  • 301 Views
  • 2 replies
  • 2 kudos

How to dynamically have the parent notebook call on a child notebook?

Hi! I would please like help on how to dynamically call one notebook from another in Databricks and have the parent notebook get the dataframe results from the child notebook. Some background info is that I have a main python notebook and multiple SQ...

  • 301 Views
  • 2 replies
  • 2 kudos
Latest Reply
jameshughes
New Contributor III
  • 2 kudos

What you are looking to do is really not the intent of notebooks and you cannot pass complex data types between notebooks. You would need to persist your data frame from the child notebook so your parent notebook could retrieve the results after the ...

  • 2 kudos
1 More Replies
Abel_Martinez
by Contributor
  • 18095 Views
  • 10 replies
  • 10 kudos

Resolved! Why I'm getting connection timeout when connecting to MongoDB using MongoDB Connector for Spark 10.x from Databricks

I'm able to connect to MongoDB using org.mongodb.spark:mongo-spark-connector_2.12:3.0.2 and this code:df = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", jdbcUrl)It works well, but if I install last MongoDB Spark Connector ve...

  • 18095 Views
  • 10 replies
  • 10 kudos
Latest Reply
ravisharma1024
New Contributor II
  • 10 kudos

I was facing the same issue, now It is resolved, and thanks to @Abel_Martinez.I am using this like below code:df = spark.read.format("mongodb") \.option('spark.mongodb.read.connection.uri', "mongodb+srv://*****:*****@******/?retryWrites=true&w=majori...

  • 10 kudos
9 More Replies
vanverne
by New Contributor II
  • 1460 Views
  • 3 replies
  • 1 kudos

Assistance with Capturing Auto-Generated IDs in Databricks SQL

Hello,I am currently working on a project where I need to insert multiple rows into a table and capture the auto-generated IDs for each row. I am using databricks sql connector. Here is a simplified version of my current workflow:I create a temporary...

  • 1460 Views
  • 3 replies
  • 1 kudos
Latest Reply
vanverne
New Contributor II
  • 1 kudos

Thanks for the reply, Alfonso. I noticed you mentioned "Below are a few alternatives...", however, I am not seeing those. Please let me know if I am missing something. Also, do you know if Databricks is working on supporting the RETURNING clause soon...

  • 1 kudos
2 More Replies
Yannic
by New Contributor
  • 373 Views
  • 1 replies
  • 0 kudos

Delete a directory in DBFS recursively from Azure

I have an Azure storage mounted to DBFS. I want to delete a directory inside recursively. I tried both,dbutils.fs.rm(f"/mnt/data/to/delete", True)and%fs rm -r /mnt/data/to/delete In both cases I get the following exception:AzureException: hadoop_azur...

  • 373 Views
  • 1 replies
  • 0 kudos
Latest Reply
lingareddy_Alva
Honored Contributor II
  • 0 kudos

Hi @Yannic Azure Blob Storage doesn't have true directories - it simulates them through blob naming conventions,which can cause issues with recursive deletion operations.Try below one. Delete Files First, Then Directorydef delete_directory_recursive(...

  • 0 kudos
Sainath368
by New Contributor III
  • 283 Views
  • 1 replies
  • 0 kudos

Data Skipping- Partitioned tables

Hi all,I have a question- how can we modify delta.dataSkippingStatsColumns and compute statistics for a partitioned delta table in Databricks? I want to understand the process and best practices for changing this setting and ensuring accurate statist...

  • 283 Views
  • 1 replies
  • 0 kudos
Latest Reply
paolajara
Databricks Employee
  • 0 kudos

Hi, delta.dataSkippingStatsColumns specifies a coma-separated list of column names used by Delta Lake to collect statistics. It will improve the performance by skipping those columns since it will supersede the default behavior of analyzing the first...

  • 0 kudos
GeKo
by Contributor
  • 1048 Views
  • 8 replies
  • 4 kudos

Resolved! how to specify the runtime version for serverless job

Hello,if I understood correctly.... using a serverless cluster comes always with the latest runtime version, by default.Now I need to stick to e.g. runtime version 15.4 for a certain job, which gets deployed via asset bundles. How do I specify/config...

Data Engineering
assetbundle
serverless
  • 1048 Views
  • 8 replies
  • 4 kudos
Latest Reply
GeKo
Contributor
  • 4 kudos

  • 4 kudos
7 More Replies
JanAkhi919
by New Contributor
  • 1060 Views
  • 1 replies
  • 1 kudos

How agentic ai is different from Ai agents

How agentic ai different from ai agents

  • 1060 Views
  • 1 replies
  • 1 kudos
Latest Reply
Renu_
Contributor III
  • 1 kudos

Hi @JanAkhi919, Agentic AI and AI agents are both types of artificial intelligence, but they work differently and are meant for different purposes.Agentic AI is more like a smart helper that can solve problems on its own. It doesn’t need step-by-step...

  • 1 kudos
Avinash_Narala
by Valued Contributor II
  • 1739 Views
  • 9 replies
  • 1 kudos

Redshift Stored Procedure Migration to Databricks

Hi,I want to migrate Redshift SQL Stored Procedures to databricks.As databricks doesn't support the concept of SQL Stored Procedures. How can I do so?

  • 1739 Views
  • 9 replies
  • 1 kudos
Latest Reply
nayan_wylde
Valued Contributor III
  • 1 kudos

Databricks docs shows procedures are in public preview and requires runtime 17.0 and above.https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-ddl-create-procedure

  • 1 kudos
8 More Replies
JothyGanesan
by New Contributor III
  • 1222 Views
  • 4 replies
  • 1 kudos

Resolved! Streaming data - Merge in Target - DLT

We have streaming inputs coming from streaming tables and also the table from apply_changes.In our target there is only one table which needs to be merged with all the sources. Each source provides different columns in our target table. Challenge: Ev...

  • 1222 Views
  • 4 replies
  • 1 kudos
Latest Reply
vd1
New Contributor II
  • 1 kudos

This can cause concurrent writes issues? Updating same table from multiple streams?

  • 1 kudos
3 More Replies
RevathiTiger
by New Contributor II
  • 2395 Views
  • 3 replies
  • 1 kudos

Expectations vs Great expectations with Databricks DLT pipelines

Hi All,We are working on creating a DQ framework on DLT pipelines in Databricks. Databricks DLT pipelines reads incoming data from Kafka / Files sources. once data is ingested Data validation must happen on top of the ingested data. Customer is evalu...

  • 2395 Views
  • 3 replies
  • 1 kudos
Latest Reply
chanukya-pekala
Contributor
  • 1 kudos

If you have decided to use DLT, it handles micro batching and checkpointing for you. But typically, we can take more control, if you rewrite the logic using Autoloader or Structured Streaming by custom checkpointing the file directory and maintain yo...

  • 1 kudos
2 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels