cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

jeremy98
by Contributor III
  • 1644 Views
  • 22 replies
  • 1 kudos

wheel package to install in a serveless workflow

Hi guys, Which is the way through Databricks Asset Bundle to declare a new job definition having a serveless compute associated on each task that composes the workflow and be able that inside each notebook task definition is possible to catch the dep...

  • 1644 Views
  • 22 replies
  • 1 kudos
Latest Reply
jeremy98
Contributor III
  • 1 kudos

Ping @Alberto_Umana 

  • 1 kudos
21 More Replies
dbx-user7354
by New Contributor III
  • 3886 Views
  • 7 replies
  • 3 kudos

Pyspark Dataframes orderby only orders within partition when having multiple worker

I came across a pyspark issue when sorting the dataframe by a column. It seems like pyspark only orders the data within partitions when having multiple worker, even though it shouldn't.  from pyspark.sql import functions as F import matplotlib.pyplot...

dbxuser7354_0-1711014288660.png dbxuser7354_1-1711014300462.png
  • 3886 Views
  • 7 replies
  • 3 kudos
Latest Reply
Avinash_Narala
Valued Contributor II
  • 3 kudos

Hi @dbx-user7354 ,OrderBy() should perform a global sort as showed in plot-2, but as per your problem it is sorting the data within the partitions which is the behavior of sortWithinPartitions(), so to solve this error. Please try with the latest DBR...

  • 3 kudos
6 More Replies
SwathiChidurala
by New Contributor II
  • 7284 Views
  • 2 replies
  • 3 kudos

Resolved! deltaformat

Hi,I am a student who learning databricks, In the below code I tried to write data in delta format to a gold layer. I authenticated using the service principle method to read, write and execute data , I assigned the storage blob contributor role, but...

  • 7284 Views
  • 2 replies
  • 3 kudos
Latest Reply
Avinash_Narala
Valued Contributor II
  • 3 kudos

Hi @SwathiChidurala ,The error is because you don't have the folder trip_zone inside the gold folder, so you can try by removing the trip_zone from the location or adding the folder trip_zone inside the gold folder in adls and then try it again.If th...

  • 3 kudos
1 More Replies
Abdurrahman
by New Contributor II
  • 489 Views
  • 3 replies
  • 3 kudos

Move files from DBFS to Workspace Folders databricks

I want to move a zip file from DBFS to a workspace folder.I am using dbutils.fs.cp("dbfs file path", "workspace folder path"), in databricks notebook and I am seeing the following error - ExecutionError: An error occurred while calling o455.cp. : jav...

  • 489 Views
  • 3 replies
  • 3 kudos
Latest Reply
nick533
New Contributor III
  • 3 kudos

Permission denied appears to be the cause of the error message. To read from the DBFS path and write to the workspace folder, please make sure you have the required permissions. The following permissions may be required:The DBFS file path can be read...

  • 3 kudos
2 More Replies
nhakobian
by New Contributor
  • 276 Views
  • 1 replies
  • 0 kudos

Python Artifact Installation Error on Runtime 16.1 on Shared Clusters

I've run into an issue with no clear path to resolution.Due to various integrations we have in Unity Catalog, some jobs we have to run in a Shared Cluster environment in order to authenticate properly to the underlying data resource. When setting up ...

  • 276 Views
  • 1 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

The deprecation of the Enable libraries and init scripts on shared Unity Catalog clusters setting is in Databricks Runtime 16.0 and above. Please refer to the documentation here  for deprecation. Disabling this feature at the workspace level would pr...

  • 0 kudos
ashraf1395
by Honored Contributor
  • 463 Views
  • 1 replies
  • 2 kudos

Resolved! Connecting Fivetran with databricks

So, We are migrating a hive metastore to UC catalog. We have some fivetran connections.We are creating all tables as external locations and we have specified the external locations at the schema level.So when we specify the destination in the fivetra...

ashraf1395_1-1737527775298.png
  • 463 Views
  • 1 replies
  • 2 kudos
Latest Reply
NandiniN
Databricks Employee
  • 2 kudos

This message is just mentioning, if you do not provide the {{path}} it will use the default location which is on DBFS. When configuring the Fivetran connector, you will be prompted to select the catalog name, schema name, and then specify the externa...

  • 2 kudos
Asaph
by New Contributor
  • 671 Views
  • 4 replies
  • 0 kudos

Issue with databricks.sdk - AccountClient Service Principals API

Hi everyone,I’ve been trying to work with the databricks.sdk Python library to manage service principals programmatically. However, I’m running into an issue when attempting to create a service principal using the AccountClient class. Below is the co...

  • 671 Views
  • 4 replies
  • 0 kudos
Latest Reply
nick533
New Contributor III
  • 0 kudos

This can be an issue with authentication or configuration being missing. When constructing the AccountClient class instance, please ensure that the required authentication details are present. Additionally, since this action is account-level, make su...

  • 0 kudos
3 More Replies
drag7ter
by Contributor
  • 580 Views
  • 4 replies
  • 1 kudos

Disable caching in Serverless SQL Warehouse

I have Serverless SQL Warehouse claster, and I run my sql code in sql editor. When I run query for the first time I see it take 30 secs total time, but all next time I see in query profiling that it gets result set from cache and takes 1-2 secs total...

  • 580 Views
  • 4 replies
  • 1 kudos
Latest Reply
NandiniN
Databricks Employee
  • 1 kudos

I am wondering if it is using Remote result cache, in that case the config should work. There are 4 types of cache mentioned here https://docs.databricks.com/en/sql/user/queries/query-caching.html#types-of-query-caches-in-databricks-sql Local cache: ...

  • 1 kudos
3 More Replies
om_bk_00
by New Contributor III
  • 534 Views
  • 1 replies
  • 0 kudos

How to pass parameters for jobs containing for_each_task

resources:  jobs:    X:      name: X      tasks:        - task_key: X          for_each_task:            inputs: "{{job.parameters.input}}"            task:              task_key: X              existing_cluster_id: ${var.my_cluster_id}              ...

  • 534 Views
  • 1 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

To reference job parameters in the inputs field, use the syntax {{job.parameters.<name>}}. Kindly refer to https://docs.databricks.com/en/jobs/for-each.html

  • 0 kudos
gvvishnu
by New Contributor
  • 520 Views
  • 1 replies
  • 0 kudos

can databricks support murmur hash function

current project we are using murmur hash function in hadoop.we are planning for migration to databricks.can databricks support murmur hash function ? 

  • 520 Views
  • 1 replies
  • 0 kudos
Latest Reply
brockb
Databricks Employee
  • 0 kudos

Hi @gvvishnu , Thanks for your question. My understanding is that the Apache Spark `hash()` function implements the `org.apache.spark.sql.catalyst.expressions.Murmur3Hash` expression. You can see this in the Spark source code here: https://github.com...

  • 0 kudos
shhhhhh
by New Contributor III
  • 530 Views
  • 5 replies
  • 0 kudos

How to connect from Serverless Plane to On-Prem SQL Server

so has anybody tried connecting Databricks Serverless in the Serverless plane to on-prem SQL server.  we can connect databricks normal cluster with federated queries with External Data connections to on-prem SQL serverwe can connect Serverless to Azu...

  • 530 Views
  • 5 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

No, Private link is to set up your workspace with no access to the internet, have you tried allowing the NCC ips with on the on Prem firewall?

  • 0 kudos
4 More Replies
Greg_c
by New Contributor II
  • 227 Views
  • 1 replies
  • 0 kudos

Passing parameters (variables?) in DAGs

Regarding DAGs and tasks in them - can I pass a parameter/variable in a task?I have the same structure like here: https://github.com/databricks/bundle-examples/blob/main/default_sql/resources/default_sql_sql_job.ymland I want to pass variables to .sq...

  • 227 Views
  • 1 replies
  • 0 kudos
Latest Reply
filipniziol
Esteemed Contributor
  • 0 kudos

Hi @Greg_c ,In Databricks Asset Bundles you have a possibility to pass parameter to SQL File Task.Here is end to end example:1. My SQL File (with :id parameter): 2. The job YAML:resources: jobs: run_sql_file_job: name: run_sql_file_job ...

  • 0 kudos
priyansh
by New Contributor III
  • 889 Views
  • 3 replies
  • 1 kudos

What stuff does UCX can not do?

Hey folks! I want to know what are the limitations of UCX?, means what are the thing specially during migration we have to do manually?UCX is currently in developing mode that means it may have some drawbacks too, I want to know what are thsose?

  • 889 Views
  • 3 replies
  • 1 kudos
Latest Reply
monstercop
New Contributor II
  • 1 kudos

Guess you will find some differences in before and after, such as the use of the wildcard to point to folders in ADLS2 for external tables is supported in hive but not in UC catalogs.

  • 1 kudos
2 More Replies
yvishal519
by Contributor
  • 511 Views
  • 1 replies
  • 0 kudos

Identifying Full Refresh vs. Incremental Runs in Delta Live Tables

Hello Community,I am working with a Delta Live Tables (DLT) pipeline that primarily operates in incremental mode. However, there are specific scenarios where I need to perform a full refresh of the pipeline. I am looking for an efficient and reliable...

  • 511 Views
  • 1 replies
  • 0 kudos
Latest Reply
Takuya-Omi
Valued Contributor II
  • 0 kudos

Hello,There are two ways to determine whether a DLT pipeline is running in Full Refresh or Incremental mode:DLT Event Log SchemaThe details column in the DLT event log schema includes information on "full_refresh". You can use this to identify whethe...

  • 0 kudos
Klusener
by New Contributor III
  • 538 Views
  • 7 replies
  • 11 kudos

Resolved! Out of Memory after adding distinct operation

I have a spark pipeline which reads selected data from a table_1 as view and performs few aggregation via group by in next step and writes to target table. table_1 has large data ~30GB, compressed csv.Step-1:create or replace temporary view base_data...

  • 538 Views
  • 7 replies
  • 11 kudos
Latest Reply
MadhuB
Contributor III
  • 11 kudos

Hi @Klusener Distinct is a very expense operation. For your case, I recommend to use either of the below deduplication strategies.Most efficient methoddf_deduped = df.dropDuplicates(subset=['unique_key_columns'])For complex dedupe process - Partition...

  • 11 kudos
6 More Replies

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now
Labels