cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

ashraf1395
by Honored Contributor
  • 2842 Views
  • 1 replies
  • 2 kudos

Resolved! Connecting Fivetran with databricks

So, We are migrating a hive metastore to UC catalog. We have some fivetran connections.We are creating all tables as external locations and we have specified the external locations at the schema level.So when we specify the destination in the fivetra...

ashraf1395_1-1737527775298.png
  • 2842 Views
  • 1 replies
  • 2 kudos
Latest Reply
NandiniN
Databricks Employee
  • 2 kudos

This message is just mentioning, if you do not provide the {{path}} it will use the default location which is on DBFS. When configuring the Fivetran connector, you will be prompted to select the catalog name, schema name, and then specify the externa...

  • 2 kudos
drag7ter
by Contributor
  • 3926 Views
  • 4 replies
  • 1 kudos

Disable caching in Serverless SQL Warehouse

I have Serverless SQL Warehouse claster, and I run my sql code in sql editor. When I run query for the first time I see it take 30 secs total time, but all next time I see in query profiling that it gets result set from cache and takes 1-2 secs total...

  • 3926 Views
  • 4 replies
  • 1 kudos
Latest Reply
NandiniN
Databricks Employee
  • 1 kudos

I am wondering if it is using Remote result cache, in that case the config should work. There are 4 types of cache mentioned here https://docs.databricks.com/en/sql/user/queries/query-caching.html#types-of-query-caches-in-databricks-sql Local cache: ...

  • 1 kudos
3 More Replies
om_bk_00
by New Contributor III
  • 4920 Views
  • 1 replies
  • 0 kudos

How to pass parameters for jobs containing for_each_task

resources:  jobs:    X:      name: X      tasks:        - task_key: X          for_each_task:            inputs: "{{job.parameters.input}}"            task:              task_key: X              existing_cluster_id: ${var.my_cluster_id}              ...

  • 4920 Views
  • 1 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

To reference job parameters in the inputs field, use the syntax {{job.parameters.<name>}}. Kindly refer to https://docs.databricks.com/en/jobs/for-each.html

  • 0 kudos
gvvishnu
by New Contributor
  • 5276 Views
  • 1 replies
  • 0 kudos

can databricks support murmur hash function

current project we are using murmur hash function in hadoop.we are planning for migration to databricks.can databricks support murmur hash function ? 

  • 5276 Views
  • 1 replies
  • 0 kudos
Latest Reply
brockb
Databricks Employee
  • 0 kudos

Hi @gvvishnu , Thanks for your question. My understanding is that the Apache Spark `hash()` function implements the `org.apache.spark.sql.catalyst.expressions.Murmur3Hash` expression. You can see this in the Spark source code here: https://github.com...

  • 0 kudos
shhhhhh
by New Contributor III
  • 3046 Views
  • 5 replies
  • 0 kudos

How to connect from Serverless Plane to On-Prem SQL Server

so has anybody tried connecting Databricks Serverless in the Serverless plane to on-prem SQL server.  we can connect databricks normal cluster with federated queries with External Data connections to on-prem SQL serverwe can connect Serverless to Azu...

  • 3046 Views
  • 5 replies
  • 0 kudos
Latest Reply
Walter_C
Databricks Employee
  • 0 kudos

No, Private link is to set up your workspace with no access to the internet, have you tried allowing the NCC ips with on the on Prem firewall?

  • 0 kudos
4 More Replies
Greg_c
by New Contributor II
  • 1966 Views
  • 1 replies
  • 0 kudos

Passing parameters (variables?) in DAGs

Regarding DAGs and tasks in them - can I pass a parameter/variable in a task?I have the same structure like here: https://github.com/databricks/bundle-examples/blob/main/default_sql/resources/default_sql_sql_job.ymland I want to pass variables to .sq...

  • 1966 Views
  • 1 replies
  • 0 kudos
Latest Reply
filipniziol
Esteemed Contributor
  • 0 kudos

Hi @Greg_c ,In Databricks Asset Bundles you have a possibility to pass parameter to SQL File Task.Here is end to end example:1. My SQL File (with :id parameter): 2. The job YAML:resources: jobs: run_sql_file_job: name: run_sql_file_job ...

  • 0 kudos
Klusener
by Contributor
  • 4005 Views
  • 7 replies
  • 11 kudos

Resolved! Out of Memory after adding distinct operation

I have a spark pipeline which reads selected data from a table_1 as view and performs few aggregation via group by in next step and writes to target table. table_1 has large data ~30GB, compressed csv.Step-1:create or replace temporary view base_data...

  • 4005 Views
  • 7 replies
  • 11 kudos
Latest Reply
MadhuB
Valued Contributor
  • 11 kudos

Hi @Klusener Distinct is a very expense operation. For your case, I recommend to use either of the below deduplication strategies.Most efficient methoddf_deduped = df.dropDuplicates(subset=['unique_key_columns'])For complex dedupe process - Partition...

  • 11 kudos
6 More Replies
majo2
by New Contributor II
  • 5090 Views
  • 2 replies
  • 2 kudos

tqdm progressbar in Databricks jobs

Hi,I'm using Databricks workflows to run a train job using `pytorch` + `lightning`. `lightning` has a built in progressbar built on `tqdm` that tracks the progress. It works OK in when I run the notebook outside of a workflow. But when I try to run n...

  • 5090 Views
  • 2 replies
  • 2 kudos
Latest Reply
ludovicc
New Contributor II
  • 2 kudos

I have found that only progressbar2 can work in both interactive notebooks and workflow notebooks. It's limited, but better than nothing. Tqdm is broken in workflows.

  • 2 kudos
1 More Replies
Kayla
by Valued Contributor II
  • 1387 Views
  • 3 replies
  • 0 kudos

GCP Serverless SQL Warehouse Tag Propagation?

I have a serverless SQL Warehouse with a tag on it that is not making it to GCP.We have various job and AP clusters with tags that I can see in GCP- trying to have everything tagged for the purpose of monitoring billing/usage centrally.Do serverless ...

  • 1387 Views
  • 3 replies
  • 0 kudos
Latest Reply
Isi
Honored Contributor III
  • 0 kudos

Hey @Kayla ,This is because serverless infrastructure is fully managed by Databricks, and you do not have direct control over the underlying resources as you do with standard clusters (non-serverless SQL warehouse)You can track SQL Warehouse usage wi...

  • 0 kudos
2 More Replies
kasiviss42
by New Contributor III
  • 4496 Views
  • 10 replies
  • 2 kudos

Unity Credential Scope id not found in thread locals

i am facing issue :- [UNITY_CREDENTIAL_SCOPE_MISSING_SCOPE] Missing Credential Scope. Unity Credential Scope id not found in thread locals.Issue occurs:-when we try to list files using dbutils.fs.lsand also this occurs at times when we try to write o...

  • 4496 Views
  • 10 replies
  • 2 kudos
Latest Reply
ashishCh
New Contributor II
  • 2 kudos

Thanks for the reply.Its working in dbr 15.4 but I want to use it with 13.3, is there a workaround?

  • 2 kudos
9 More Replies
Greg_c
by New Contributor II
  • 11900 Views
  • 4 replies
  • 0 kudos

Best practices for ensuring data quality in batch pipelines

Hello everyone,I couldn't find a topic on this - what are your best practices to ensuring data quality in batch pipelines?I've got a big pipeline processing data once per day. We though about either going with DBT or DLT but DLT seems more directed f...

  • 11900 Views
  • 4 replies
  • 0 kudos
Latest Reply
Isi
Honored Contributor III
  • 0 kudos

Hey Greg_cI use DBT daily for batch data ingestion, and I believe it’s a great option. However, it’s important to consider that adopting DBT introduces additional complexity, and the team should carefully evaluate the impact of adding a new tool to t...

  • 0 kudos
3 More Replies
Phani1
by Databricks MVP
  • 5043 Views
  • 5 replies
  • 1 kudos

Cluster idle time and usage details

How can we find out the usage details of the Databricks cluster? Specifically, we need to know how many nodes are in use, how long the cluster is idle, the time it takes to start up, and the jobs it is running along with their durations. Is there a q...

  • 5043 Views
  • 5 replies
  • 1 kudos
Latest Reply
Isi
Honored Contributor III
  • 1 kudos

Hey @hboleto It’s difficult to accurately estimate the final cost of a Serverless cluster, as it is fully managed by Databricks. In contrast, Classic clusters allow for finer resource tuning since you can define spot instances and other instance type...

  • 1 kudos
4 More Replies
Divya_sreeE
by Databricks Partner
  • 954 Views
  • 1 replies
  • 0 kudos

Unable to pass the task variables from Python Wheel to ForEach task

I understand that task variables are supported in Databricks notebook , but there is a requirement from client to use python wheel package in Databricks workflow . We are not able to set the task variables using dbutils in python wheel file. Kindly s...

  • 954 Views
  • 1 replies
  • 0 kudos
Latest Reply
saurabh18cs
Honored Contributor III
  • 0 kudos

Hi @Divya_sreeE you can pass dynamic variables between tasks using Databricks' job parameters.1) In your first Python wheel task, generate the dynamic variables and use the Databricks REST API to update the job parameters.2) In the For Each loop, ret...

  • 0 kudos
jasperputs
by New Contributor III
  • 12520 Views
  • 5 replies
  • 3 kudos

Resolved! Add Identity Column to Existing Table

Hello everyone. I am working with tables that need an identity column. I currently have a view in which I cast the different columns to the data type that I want. Now I want the result of this view to be inserted or merged into a table. The schema of...

image
  • 12520 Views
  • 5 replies
  • 3 kudos
Latest Reply
ramankr48
Databricks Partner
  • 3 kudos

Hello @Jasper Puts​ how did you solve this issue of creating a identity column to existing table.I'm also getting the same error as you got.

  • 3 kudos
4 More Replies
WYO
by New Contributor II
  • 3954 Views
  • 3 replies
  • 1 kudos

Export data from databricks to prem

Hello everyoneI need to export some data to sql server management studio on premise.I need to verify that the new data on databricks is aligned with the older data that we have on premise.Is it possible to export data as an Excel sheet or .csv file?R...

  • 3954 Views
  • 3 replies
  • 1 kudos
Latest Reply
Avinash_Narala
Databricks Partner
  • 1 kudos

You can compare your databricks data with on-prem sql server data in two ways:Firstly, you have to make connection between sql-server and databricks using volumes.  Using volumes we can mount sql server data into databricks unity catalog1.we can comp...

  • 1 kudos
2 More Replies
Labels