cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Tommabip
by Databricks Partner
  • 3102 Views
  • 3 replies
  • 2 kudos

Resolved! Databricks Cluster Policies

Hi, I' m trying to create a terraform script that does the following:- create a policy where I specify env variables and libraries- create a cluster that inherits from that policy and uses the env variables specified in the policy.I saw in the decume...

  • 3102 Views
  • 3 replies
  • 2 kudos
Latest Reply
Louis_Frolio
Databricks Employee
  • 2 kudos

You're correct in observing this discrepancy. When a cluster policy is defined and applied through the Databricks UI, fixed environment variables (`spark_env_vars`) specified in the policy automatically propagate to clusters created under that policy...

  • 2 kudos
2 More Replies
Alex_Persin
by New Contributor III
  • 11057 Views
  • 6 replies
  • 8 kudos

How can the shared memory size (/dev/shm) be increased on databricks worker nodes with custom docker images?

PyTorch uses shared memory to efficiently share tensors between its dataloader workers and its main process. However in a docker container the default size of the shared memory (a tmpfs file system mounted at /dev/shm) is 64MB, which is too small to ...

  • 11057 Views
  • 6 replies
  • 8 kudos
Latest Reply
stevewb
New Contributor III
  • 8 kudos

Bump again... does anyone have a solution for this?

  • 8 kudos
5 More Replies
valde
by New Contributor
  • 1148 Views
  • 1 replies
  • 0 kudos

Window function VS groupBy + map

Let's say we have an RDD like this:RDD(id: Int, measure: Int, date: LocalDate)Let's say we want to apply some function that compares 2 consecutive measures by date, outputs a number and we want to get the sum of those numbers by id. The function is b...

  • 1148 Views
  • 1 replies
  • 0 kudos
Latest Reply
Renu_
Valued Contributor II
  • 0 kudos

Hi @valde, those two approaches give the same result, but they don’t work the same way under the hood. SparkSQL uses optimized window functions that handle things like shuffling and memory more efficiently, often making it faster and lighter.On the o...

  • 0 kudos
Nathant93
by New Contributor III
  • 2880 Views
  • 2 replies
  • 0 kudos

(java.util.concurrent.ExecutionException) Boxed Error

Has anyone ever come across the error above?I am trying to get two tables from unity catalog and join them, the join is fairly complex as it is imitating a where not exists top 1 sql query.

  • 2880 Views
  • 2 replies
  • 0 kudos
Latest Reply
pk13
New Contributor II
  • 0 kudos

Hello @VZLA Recently, I am getting the exact same error.It has a caused by as below -```Caused by: kafkashaded.org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition.```Stacktrace -ERROR: Some ...

  • 0 kudos
1 More Replies
eenaagrawal
by Databricks Partner
  • 6801 Views
  • 1 replies
  • 0 kudos
  • 6801 Views
  • 1 replies
  • 0 kudos
Latest Reply
SP_6721
Honored Contributor II
  • 0 kudos

Hi @eenaagrawal ,There isn't a specific built-in integration in Databricks to directly interact with Sharepoint. However, you can accomplish this by leveraging libraries like Office365-REST-Python-Client, which enable interaction with Sharepoint's RE...

  • 0 kudos
rahuja
by Contributor
  • 2831 Views
  • 2 replies
  • 0 kudos

Resolved! Cloning Git Repository in Databricks via Rest API Endpoint using Azure Service principal

HelloI have written a python script that uses Databricks Rest API(s). I am trying to clone/ update an Azure Devops Repository inside databricks using Azure Service Principal. I am able to retrieve the credential_id for the service principal I am usin...

  • 2831 Views
  • 2 replies
  • 0 kudos
Latest Reply
rahuja
Contributor
  • 0 kudos

@nicole_lu_PM  So sorry for coming back to this issue after such a long time. But I looked into it and it seems like this concept of OBO token is applicable in case we use Databricks with AWS as our cloud provider. In case of Azure most of the commen...

  • 0 kudos
1 More Replies
ShashiPrakash
by New Contributor II
  • 4058 Views
  • 2 replies
  • 1 kudos

Resolved! Unity Catalog Table in Databricks Asset Bundle

I am looking to deploy unity catalog schemas and tables via Databricks Asset Bundle (DAB). We can do schema evolution of tables via notebooks as well, but we already have 1000+ notebooks and implementing via notebooks will be an effort hence was look...

  • 4058 Views
  • 2 replies
  • 1 kudos
Latest Reply
ShashiPrakash
New Contributor II
  • 1 kudos

Thanks for the prompt response @saurabh18cs . Yes that was the alternating i was considering. I believe it will be the warehouses group command which will explore. Will you be able to share any best practice document to manage the SQL project file, w...

  • 1 kudos
1 More Replies
RobCox
by New Contributor II
  • 1279 Views
  • 2 replies
  • 0 kudos

DAB - Common cluster configs possible?

I've been trying various solutions and perhaps maybe just thinking about this the wrong way.We're migrating over from Synapse where we're used to have a defined set of DBX Cluster profiles to run our jobs against, these are all job clusters created v...

  • 1279 Views
  • 2 replies
  • 0 kudos
Latest Reply
saurabh18cs
Honored Contributor III
  • 0 kudos

hi, you can also parametrize your job clusters ?? job_clusters:      - job_cluster_key: Job_cluster        new_cluster:          spark_version: ${var.spark_version}          spark_conf: ${var.spark_configuration}          azure_attributes:           ...

  • 0 kudos
1 More Replies
ShivangiB
by New Contributor III
  • 1347 Views
  • 3 replies
  • 0 kudos

Zorder and Liquid Clustering Performance while reading and writing data

when i am writing to a liquid clustering table it is taking more time compared to zorder

  • 1347 Views
  • 3 replies
  • 0 kudos
Latest Reply
ShivangiB
New Contributor III
  • 0 kudos

We are trying to understabnd the overall behavior of liquid clustering

  • 0 kudos
2 More Replies
DatabricksQuery
by New Contributor
  • 661 Views
  • 1 replies
  • 0 kudos

Databricks Job Listener Concept for Tracking Personal Jobs

Hello everyoneI want to know if any listener mechanism in Databricks can track the configuration of Databricks jobs deployed through CI/CD. With the help of this listener, we can track our custom jobs that are not part of the CI/CD process. This way,...

  • 661 Views
  • 1 replies
  • 0 kudos
Latest Reply
saurabh18cs
Honored Contributor III
  • 0 kudos

Hi , I don't think Databricks provides a built-in listener mechanism to track changes to job configurations directly. However, you can implement a custom solution to monitor and track changes to Databricks jobs deployed through CI/CD pipelines using ...

  • 0 kudos
khishore
by Contributor
  • 7042 Views
  • 9 replies
  • 6 kudos

Resolved! i haven't received my certificate or the badge for Databricks Certified Data Engineer Associate

Hi @Lindsay Olson​ @Kaniz Fatma​ ,I have cleared my Databricks Certified Data Engineer Associate on 29 October 2022. but haven't received my badge or certificate yet .Can you guys please help .Thanks

  • 7042 Views
  • 9 replies
  • 6 kudos
Latest Reply
gokul2
New Contributor III
  • 6 kudos

Hi @Lindsay Olson​ @Kaniz Fatma​ ,I have cleared my Databricks Certified Data Engineer Associate on 01 December 2024.you have shared my certificate to this mail id (927716@congizant.com) on December 2 but my origination has blocked external sites, ki...

  • 6 kudos
8 More Replies
chethankumar
by New Contributor III
  • 3981 Views
  • 4 replies
  • 1 kudos

How to execute SQL statement using terraform

Is there a way to execute SQL statements using Terraform I can see it can be possible using API as bellow, https://docs.databricks.com/api/workspace/statementexecution/executestatementbut I want to know is a strength way to run like bellow code provi...

  • 3981 Views
  • 4 replies
  • 1 kudos
Latest Reply
KartikeyaJain
New Contributor III
  • 1 kudos

The official Databricks provider in Terraform only allows you to create SQL queries, not execute them. To actually run queries, you can either:Use the http provider to make API calls to the Databricks REST API to execute SQL queries.Alternatively, if...

  • 1 kudos
3 More Replies
naga93
by New Contributor
  • 2435 Views
  • 1 replies
  • 0 kudos

How to read Delta Lake table with Spaces/Special Characters in Column Names in Dremio

Hello,I am currently writing a Delta Lake table from Databricks to Unity Catalog using PySpark 3.5.0 (15.4 LTS Databricks runtime). We want the EXTERNAL Delta Lake tables to be readable from both UC and Dremio. Our Dremio build version is 25.0.6.The ...

  • 2435 Views
  • 1 replies
  • 0 kudos
Latest Reply
Brahmareddy
Esteemed Contributor
  • 0 kudos

Hi naga93,How are you doing today?, As per my understanding, you’ve done a great job navigating all the tricky parts of Delta + Unity Catalog + Dremio integration! You're absolutely right to set minReaderVersion to 2 and disable deletion vectors to m...

  • 0 kudos
surajitDE
by Contributor
  • 1583 Views
  • 1 replies
  • 0 kudos

How can we change from GC to G1GC in serverless

My DLT jobs are experiencing throttling due to the following error message:[GC (GCLocker Initiated GC) [PSYoungGen: 5431990K->102912K(5643264K)] 9035507K->3742053K(17431552K), 0.1463381 secs] [Times: user=0.29 sys=0.00, real=0.14 secs]I came across s...

  • 1583 Views
  • 1 replies
  • 0 kudos
Latest Reply
Brahmareddy
Esteemed Contributor
  • 0 kudos

Hi surajitDE,How are you doing today?, As per my understanding, You're absolutely right to look into the GC (Garbage Collection) behavior—when you're seeing messages like GCLocker Initiated GC and frequent young gen collections, it usually means your...

  • 0 kudos
drag7ter
by Contributor
  • 1855 Views
  • 3 replies
  • 0 kudos

Overwriting delta table takes lot of time

I'm trying simply to overwrite data into delta table. The Table size is not really huge it has 50 Mil of rows and 1.9Gb in size.For running this code I use various cluster configurations starting from 1 node cluster 64Gb 16 Vcpu and also I tried to s...

  • 1855 Views
  • 3 replies
  • 0 kudos
Latest Reply
thackman
Databricks Partner
  • 0 kudos

1) You might need to cache the dataframe so it's not recomputing for the write2) What type of cloud storage are you using? We've noticed slow delta writes as well. We are using Azure standard storage which is backed by spinning disks. It's limited to...

  • 0 kudos
2 More Replies
Labels