cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Asaph
by New Contributor
  • 860 Views
  • 4 replies
  • 0 kudos

Issue with databricks.sdk - AccountClient Service Principals API

Hi everyone,I’ve been trying to work with the databricks.sdk Python library to manage service principals programmatically. However, I’m running into an issue when attempting to create a service principal using the AccountClient class. Below is the co...

  • 860 Views
  • 4 replies
  • 0 kudos
Latest Reply
nick533
New Contributor III
  • 0 kudos

This can be an issue with authentication or configuration being missing. When constructing the AccountClient class instance, please ensure that the required authentication details are present. Additionally, since this action is account-level, make su...

  • 0 kudos
3 More Replies
drag7ter
by Contributor
  • 807 Views
  • 4 replies
  • 1 kudos

Disable caching in Serverless SQL Warehouse

I have Serverless SQL Warehouse claster, and I run my sql code in sql editor. When I run query for the first time I see it take 30 secs total time, but all next time I see in query profiling that it gets result set from cache and takes 1-2 secs total...

  • 807 Views
  • 4 replies
  • 1 kudos
Latest Reply
NandiniN
Databricks Employee
  • 1 kudos

I am wondering if it is using Remote result cache, in that case the config should work. There are 4 types of cache mentioned here https://docs.databricks.com/en/sql/user/queries/query-caching.html#types-of-query-caches-in-databricks-sql Local cache: ...

  • 1 kudos
3 More Replies
om_bk_00
by New Contributor III
  • 762 Views
  • 1 replies
  • 0 kudos

How to pass parameters for jobs containing for_each_task

resources:  jobs:    X:      name: X      tasks:        - task_key: X          for_each_task:            inputs: "{{job.parameters.input}}"            task:              task_key: X              existing_cluster_id: ${var.my_cluster_id}              ...

  • 762 Views
  • 1 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

To reference job parameters in the inputs field, use the syntax {{job.parameters.<name>}}. Kindly refer to https://docs.databricks.com/en/jobs/for-each.html

  • 0 kudos
gvvishnu
by New Contributor
  • 731 Views
  • 1 replies
  • 0 kudos

can databricks support murmur hash function

current project we are using murmur hash function in hadoop.we are planning for migration to databricks.can databricks support murmur hash function ? 

  • 731 Views
  • 1 replies
  • 0 kudos
Latest Reply
brockb
Databricks Employee
  • 0 kudos

Hi @gvvishnu , Thanks for your question. My understanding is that the Apache Spark `hash()` function implements the `org.apache.spark.sql.catalyst.expressions.Murmur3Hash` expression. You can see this in the Spark source code here: https://github.com...

  • 0 kudos
shhhhhh
by New Contributor III
  • 639 Views
  • 5 replies
  • 0 kudos

How to connect from Serverless Plane to On-Prem SQL Server

so has anybody tried connecting Databricks Serverless in the Serverless plane to on-prem SQL server.  we can connect databricks normal cluster with federated queries with External Data connections to on-prem SQL serverwe can connect Serverless to Azu...

  • 639 Views
  • 5 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

No, Private link is to set up your workspace with no access to the internet, have you tried allowing the NCC ips with on the on Prem firewall?

  • 0 kudos
4 More Replies
Greg_c
by New Contributor II
  • 276 Views
  • 1 replies
  • 0 kudos

Passing parameters (variables?) in DAGs

Regarding DAGs and tasks in them - can I pass a parameter/variable in a task?I have the same structure like here: https://github.com/databricks/bundle-examples/blob/main/default_sql/resources/default_sql_sql_job.ymland I want to pass variables to .sq...

  • 276 Views
  • 1 replies
  • 0 kudos
Latest Reply
filipniziol
Esteemed Contributor
  • 0 kudos

Hi @Greg_c ,In Databricks Asset Bundles you have a possibility to pass parameter to SQL File Task.Here is end to end example:1. My SQL File (with :id parameter): 2. The job YAML:resources: jobs: run_sql_file_job: name: run_sql_file_job ...

  • 0 kudos
priyansh
by New Contributor III
  • 1088 Views
  • 3 replies
  • 1 kudos

What stuff does UCX can not do?

Hey folks! I want to know what are the limitations of UCX?, means what are the thing specially during migration we have to do manually?UCX is currently in developing mode that means it may have some drawbacks too, I want to know what are thsose?

  • 1088 Views
  • 3 replies
  • 1 kudos
Latest Reply
monstercop
New Contributor II
  • 1 kudos

Guess you will find some differences in before and after, such as the use of the wildcard to point to folders in ADLS2 for external tables is supported in hive but not in UC catalogs.

  • 1 kudos
2 More Replies
yvishal519
by Contributor
  • 651 Views
  • 1 replies
  • 0 kudos

Identifying Full Refresh vs. Incremental Runs in Delta Live Tables

Hello Community,I am working with a Delta Live Tables (DLT) pipeline that primarily operates in incremental mode. However, there are specific scenarios where I need to perform a full refresh of the pipeline. I am looking for an efficient and reliable...

  • 651 Views
  • 1 replies
  • 0 kudos
Latest Reply
Takuya-Omi
Valued Contributor III
  • 0 kudos

Hello,There are two ways to determine whether a DLT pipeline is running in Full Refresh or Incremental mode:DLT Event Log SchemaThe details column in the DLT event log schema includes information on "full_refresh". You can use this to identify whethe...

  • 0 kudos
Klusener
by New Contributor III
  • 621 Views
  • 7 replies
  • 11 kudos

Resolved! Out of Memory after adding distinct operation

I have a spark pipeline which reads selected data from a table_1 as view and performs few aggregation via group by in next step and writes to target table. table_1 has large data ~30GB, compressed csv.Step-1:create or replace temporary view base_data...

  • 621 Views
  • 7 replies
  • 11 kudos
Latest Reply
MadhuB
Contributor III
  • 11 kudos

Hi @Klusener Distinct is a very expense operation. For your case, I recommend to use either of the below deduplication strategies.Most efficient methoddf_deduped = df.dropDuplicates(subset=['unique_key_columns'])For complex dedupe process - Partition...

  • 11 kudos
6 More Replies
lauraxyz
by Contributor
  • 560 Views
  • 5 replies
  • 0 kudos

Notebook in path workspace/repos/.internal/**_commits/** was unable to be accessed

I have a workflow job (source is git) to access a notebook and execute it.  From the job, it failed with error:Py4JJavaError: An error occurred while calling o466.run. : com.databricks.WorkflowException: com.databricks.NotebookExecutionException: FAI...

  • 560 Views
  • 5 replies
  • 0 kudos
Latest Reply
lauraxyz
Contributor
  • 0 kudos

Just some clarification:  the caller notebook can be found with no issues, no matter the task's source is GIT or WORKSPACE.  However, the callee notebook, which is called by the caller notebook with dbutils.notebook.run(), cannot be found if the call...

  • 0 kudos
4 More Replies
majo2
by New Contributor II
  • 2468 Views
  • 2 replies
  • 2 kudos

tqdm progressbar in Databricks jobs

Hi,I'm using Databricks workflows to run a train job using `pytorch` + `lightning`. `lightning` has a built in progressbar built on `tqdm` that tracks the progress. It works OK in when I run the notebook outside of a workflow. But when I try to run n...

  • 2468 Views
  • 2 replies
  • 2 kudos
Latest Reply
ludovicc
New Contributor II
  • 2 kudos

I have found that only progressbar2 can work in both interactive notebooks and workflow notebooks. It's limited, but better than nothing. Tqdm is broken in workflows.

  • 2 kudos
1 More Replies
Kayla
by Valued Contributor II
  • 386 Views
  • 3 replies
  • 0 kudos

GCP Serverless SQL Warehouse Tag Propagation?

I have a serverless SQL Warehouse with a tag on it that is not making it to GCP.We have various job and AP clusters with tags that I can see in GCP- trying to have everything tagged for the purpose of monitoring billing/usage centrally.Do serverless ...

  • 386 Views
  • 3 replies
  • 0 kudos
Latest Reply
Isi
Contributor
  • 0 kudos

Hey @Kayla ,This is because serverless infrastructure is fully managed by Databricks, and you do not have direct control over the underlying resources as you do with standard clusters (non-serverless SQL warehouse)You can track SQL Warehouse usage wi...

  • 0 kudos
2 More Replies
ideal_knee
by New Contributor III
  • 2883 Views
  • 6 replies
  • 8 kudos

Reading an Iceberg table with AWS Glue Data Catalog as metastore

I have created an Iceberg table using AWS Glue, however whenever I try to read it using a Databricks cluster, I get `java.lang.InstantiationException`. I have tried every combination of Spark configs for my Databricks compute cluster that I can think...

  • 2883 Views
  • 6 replies
  • 8 kudos
Latest Reply
ideal_knee
New Contributor III
  • 8 kudos

In case someone happens upon this in the future, I ended up using Unity Catalog with Hive metastore federation for Glue. The Iceberg support is currently "coming soon in Public Preview."

  • 8 kudos
5 More Replies
kasiviss42
by New Contributor III
  • 1276 Views
  • 10 replies
  • 2 kudos

Unity Credential Scope id not found in thread locals

i am facing issue :- [UNITY_CREDENTIAL_SCOPE_MISSING_SCOPE] Missing Credential Scope. Unity Credential Scope id not found in thread locals.Issue occurs:-when we try to list files using dbutils.fs.lsand also this occurs at times when we try to write o...

  • 1276 Views
  • 10 replies
  • 2 kudos
Latest Reply
ashishCh
New Contributor II
  • 2 kudos

Thanks for the reply.Its working in dbr 15.4 but I want to use it with 13.3, is there a workaround?

  • 2 kudos
9 More Replies
Greg_c
by New Contributor II
  • 773 Views
  • 4 replies
  • 0 kudos

Best practices for ensuring data quality in batch pipelines

Hello everyone,I couldn't find a topic on this - what are your best practices to ensuring data quality in batch pipelines?I've got a big pipeline processing data once per day. We though about either going with DBT or DLT but DLT seems more directed f...

  • 773 Views
  • 4 replies
  • 0 kudos
Latest Reply
Isi
Contributor
  • 0 kudos

Hey Greg_cI use DBT daily for batch data ingestion, and I believe it’s a great option. However, it’s important to consider that adopting DBT introduces additional complexity, and the team should carefully evaluate the impact of adding a new tool to t...

  • 0 kudos
3 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels