cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

data-grassroots
by New Contributor III
  • 4133 Views
  • 7 replies
  • 1 kudos

Resolved! Ingesting Files - Same file name, modified content

We have a data feed with files whose filenames stays the same but the contents change over time (brand_a.csv, brand_b.csv, brand_c.csv ....).Copy Into seems to ignore the files when they change.If we set the Force flag to true and run it, we end up w...

  • 4133 Views
  • 7 replies
  • 1 kudos
Latest Reply
data-grassroots
New Contributor III
  • 1 kudos

Thanks for the validation, Werners! That's the path we've been heading down (copy + merge). I still have some DLT experiments planned but - at least for this situation - copy + merge works just fine.

  • 1 kudos
6 More Replies
peter_ticker
by New Contributor II
  • 826 Views
  • 17 replies
  • 2 kudos

XML Auto Loader rescuedDataColumn Doesn't Rescue Array Fields

Hiya! I'm interested whether anyone has a solution to the following problem. If you load XML using Auto Loader or otherwise and set the schema to be such that a single value is assumed for a given xpath but the actual XML contains multiple values (i....

  • 826 Views
  • 17 replies
  • 2 kudos
Latest Reply
Witold
Honored Contributor
  • 2 kudos

Let me rephrase it. You can't use Message as the rowTag, because it's the root element. rowTag implies that it's a tag within the root element, which might occur multiple times. Check the docs around reading and write XML files, there you'll find exa...

  • 2 kudos
16 More Replies
evangelos
by New Contributor III
  • 276 Views
  • 5 replies
  • 0 kudos

Resolved! Databricks asset bundles: name_prefix doesn't work with presets

Hello!I am deploying a databricks workflow using bundles and want to attach the prefix "prod_" to the name of my job.My target uses the `mode: production` and I follow the instructions in https://learn.microsoft.com/en-us/azure/databricks/dev-tools/b...

  • 276 Views
  • 5 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

You need to attach the prefix "prod_" to the name of your job in a Databricks workflow using bundles, you need to ensure that the name_prefix preset is correctly configured in your databricks.yml file.   targets: prod: mode: production pres...

  • 0 kudos
4 More Replies
oakhill
by New Contributor III
  • 439 Views
  • 3 replies
  • 1 kudos

How do we create a job cluster in Databricks Asset Bundles for use across different jobs?

When developing jobs on DABs, we use new_cluster to create a cluster for a particular job. I think it's a lot of lines and YAML when what I really need is a "small cluster" and "big cluster" to reference for certain kind of jobs. Tags would be on the...

  • 439 Views
  • 3 replies
  • 1 kudos
Latest Reply
filipniziol
Contributor III
  • 1 kudos

Hi @oakhill ,You can specify you job cluster configuration in your variables:variables: small_cluster_id: description: "The small cluster with 2 workers used by the jobs" type: complex default: spark_version: "15.4.x-scala2.12" ...

  • 1 kudos
2 More Replies
saniok
by New Contributor II
  • 199 Views
  • 2 replies
  • 0 kudos

How to Handle Versioning in Databricks Asset Bundles?

 Hi everyone,In our organization, we are transitioning from defining Databricks jobs using the UI to managing them with asset bundles. Since asset bundles can be deployed across multiple workspaces—each potentially having multiple targets (e.g., stag...

  • 199 Views
  • 2 replies
  • 0 kudos
Latest Reply
Alberto_Umana
Databricks Employee
  • 0 kudos

Hi @saniok,   In databricks.yml file you can include version information in this file to manage different versions of your bundles.Example: bundle: name: my-bundle version: 1.0.0 resources: jobs: my-job: name: my-job ...

  • 0 kudos
1 More Replies
Avinash_Narala
by Valued Contributor II
  • 399 Views
  • 7 replies
  • 3 kudos

Resolved! SQL Server to Databricks Migration

Hi,I want to build a python function to migrate SQL Server tables to Databricks.Is there any guide/ best practices on how to do so.It'll be really helpful if there is any.Regards,Avinash N

  • 399 Views
  • 7 replies
  • 3 kudos
Latest Reply
filipniziol
Contributor III
  • 3 kudos

Hi @Avinash_Narala ,If it is lift and shift, then try this:1. Set up Lakehouse Federation to SQL Server2. Use CTAS statements to copy each table into Unity Catalog CREATE TABLE catalog_name.schema_name.table_name AS SELECT * FROM sql_server_catalog_...

  • 3 kudos
6 More Replies
jeremy98
by Contributor
  • 1007 Views
  • 22 replies
  • 1 kudos

wheel package to install in a serveless workflow

Hi guys, Which is the way through Databricks Asset Bundle to declare a new job definition having a serveless compute associated on each task that composes the workflow and be able that inside each notebook task definition is possible to catch the dep...

  • 1007 Views
  • 22 replies
  • 1 kudos
Latest Reply
jeremy98
Contributor
  • 1 kudos

Ping @Alberto_Umana 

  • 1 kudos
21 More Replies
dbx-user7354
by New Contributor III
  • 3491 Views
  • 7 replies
  • 3 kudos

Pyspark Dataframes orderby only orders within partition when having multiple worker

I came across a pyspark issue when sorting the dataframe by a column. It seems like pyspark only orders the data within partitions when having multiple worker, even though it shouldn't.  from pyspark.sql import functions as F import matplotlib.pyplot...

dbxuser7354_0-1711014288660.png dbxuser7354_1-1711014300462.png
  • 3491 Views
  • 7 replies
  • 3 kudos
Latest Reply
Avinash_Narala
Valued Contributor II
  • 3 kudos

Hi @dbx-user7354 ,OrderBy() should perform a global sort as showed in plot-2, but as per your problem it is sorting the data within the partitions which is the behavior of sortWithinPartitions(), so to solve this error. Please try with the latest DBR...

  • 3 kudos
6 More Replies
SwathiChidurala
by New Contributor II
  • 7086 Views
  • 2 replies
  • 3 kudos

Resolved! deltaformat

Hi,I am a student who learning databricks, In the below code I tried to write data in delta format to a gold layer. I authenticated using the service principle method to read, write and execute data , I assigned the storage blob contributor role, but...

  • 7086 Views
  • 2 replies
  • 3 kudos
Latest Reply
Avinash_Narala
Valued Contributor II
  • 3 kudos

Hi @SwathiChidurala ,The error is because you don't have the folder trip_zone inside the gold folder, so you can try by removing the trip_zone from the location or adding the folder trip_zone inside the gold folder in adls and then try it again.If th...

  • 3 kudos
1 More Replies
Abdurrahman
by New Contributor II
  • 246 Views
  • 3 replies
  • 3 kudos

Move files from DBFS to Workspace Folders databricks

I want to move a zip file from DBFS to a workspace folder.I am using dbutils.fs.cp("dbfs file path", "workspace folder path"), in databricks notebook and I am seeing the following error - ExecutionError: An error occurred while calling o455.cp. : jav...

  • 246 Views
  • 3 replies
  • 3 kudos
Latest Reply
nick533
New Contributor III
  • 3 kudos

Permission denied appears to be the cause of the error message. To read from the DBFS path and write to the workspace folder, please make sure you have the required permissions. The following permissions may be required:The DBFS file path can be read...

  • 3 kudos
2 More Replies
nhakobian
by New Contributor
  • 162 Views
  • 1 replies
  • 0 kudos

Python Artifact Installation Error on Runtime 16.1 on Shared Clusters

I've run into an issue with no clear path to resolution.Due to various integrations we have in Unity Catalog, some jobs we have to run in a Shared Cluster environment in order to authenticate properly to the underlying data resource. When setting up ...

  • 162 Views
  • 1 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

The deprecation of the Enable libraries and init scripts on shared Unity Catalog clusters setting is in Databricks Runtime 16.0 and above. Please refer to the documentation here  for deprecation. Disabling this feature at the workspace level would pr...

  • 0 kudos
ashraf1395
by Valued Contributor II
  • 250 Views
  • 1 replies
  • 2 kudos

Resolved! Connecting Fivetran with databricks

So, We are migrating a hive metastore to UC catalog. We have some fivetran connections.We are creating all tables as external locations and we have specified the external locations at the schema level.So when we specify the destination in the fivetra...

ashraf1395_1-1737527775298.png
  • 250 Views
  • 1 replies
  • 2 kudos
Latest Reply
NandiniN
Databricks Employee
  • 2 kudos

This message is just mentioning, if you do not provide the {{path}} it will use the default location which is on DBFS. When configuring the Fivetran connector, you will be prompted to select the catalog name, schema name, and then specify the externa...

  • 2 kudos
Asaph
by New Contributor
  • 327 Views
  • 4 replies
  • 0 kudos

Issue with databricks.sdk - AccountClient Service Principals API

Hi everyone,I’ve been trying to work with the databricks.sdk Python library to manage service principals programmatically. However, I’m running into an issue when attempting to create a service principal using the AccountClient class. Below is the co...

  • 327 Views
  • 4 replies
  • 0 kudos
Latest Reply
nick533
New Contributor III
  • 0 kudos

This can be an issue with authentication or configuration being missing. When constructing the AccountClient class instance, please ensure that the required authentication details are present. Additionally, since this action is account-level, make su...

  • 0 kudos
3 More Replies
drag7ter
by Contributor
  • 257 Views
  • 4 replies
  • 1 kudos

Disable caching in Serverless SQL Warehouse

I have Serverless SQL Warehouse claster, and I run my sql code in sql editor. When I run query for the first time I see it take 30 secs total time, but all next time I see in query profiling that it gets result set from cache and takes 1-2 secs total...

  • 257 Views
  • 4 replies
  • 1 kudos
Latest Reply
NandiniN
Databricks Employee
  • 1 kudos

I am wondering if it is using Remote result cache, in that case the config should work. There are 4 types of cache mentioned here https://docs.databricks.com/en/sql/user/queries/query-caching.html#types-of-query-caches-in-databricks-sql Local cache: ...

  • 1 kudos
3 More Replies
om_bk_00
by New Contributor III
  • 112 Views
  • 1 replies
  • 0 kudos

How to pass parameters for jobs containing for_each_task

resources:  jobs:    X:      name: X      tasks:        - task_key: X          for_each_task:            inputs: "{{job.parameters.input}}"            task:              task_key: X              existing_cluster_id: ${var.my_cluster_id}              ...

  • 112 Views
  • 1 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

To reference job parameters in the inputs field, use the syntax {{job.parameters.<name>}}. Kindly refer to https://docs.databricks.com/en/jobs/for-each.html

  • 0 kudos

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels