cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

mdelvaux
by New Contributor
  • 780 Views
  • 0 replies
  • 0 kudos

BigQuery as foreign catalog - full object structs

Hi -We have mounted BigQuery, hosting Google Analytics data, as a foreign catalog.When querying the tables, objects are returned as strings, with all keys obfuscated by "f" or "v", likely to avoid replicating object keys across all records and hence ...

  • 780 Views
  • 0 replies
  • 0 kudos
Neli
by New Contributor III
  • 1785 Views
  • 3 replies
  • 0 kudos

Decrease frequency of Databricks Asset Bundle API

We are using DABs for our deployment and to invoke any workflow. Behinds the scenes, it calls below API to get the status of workflow. Currently, it checks every few seconds. Is there a way to decrease this frequency from seconds to minutes.  GET /ap...

  • 1785 Views
  • 3 replies
  • 0 kudos
Latest Reply
Frank_Kennedy
New Contributor II
  • 0 kudos

Hey!  If the API is checking the job status too frequently, you might want to consider implementing a custom polling mechanism. Instead of relying on the default frequency, you can build a simple script or function that pauses for a longer interval b...

  • 0 kudos
2 More Replies
thiagoawstest
by Contributor
  • 3998 Views
  • 1 replies
  • 0 kudos

Error Sent message larger than max

Hello, I'm receiving a large amount of data in a dataframe, when trying to record or display it, I receive the error below. How can I fix it, or where do I change the setting?SparkConnectGrpcException: <_MultiThreadedRendezvous of RPC that terminated...

  • 3998 Views
  • 1 replies
  • 0 kudos
Latest Reply
fghedin
Databricks Partner
  • 0 kudos

Hi @Retired_mod, I'm facing the same error. Can you provide the full name of the spark conf we have to change?Thank you

  • 0 kudos
costi9992
by Databricks Partner
  • 4536 Views
  • 3 replies
  • 2 kudos

Access Databricks API using IDP token

Hello,We have a databricks account & workspace, provided by AWS with SSO enabled. Is there any way to access databricks workspace API ( jobs/clusters, etc ) using a token retrieved from IdentityProvider ? We can access databricks workspace API with A...

  • 4536 Views
  • 3 replies
  • 2 kudos
Latest Reply
fpopa
New Contributor II
  • 2 kudos

Hey - Costin and Anonymous user, have you managed to get this working, do you have examples by any chance?I'm also trying something similar but I haven't been able to make it work.> authenticate and access the Databricks REST API by setting the Autho...

  • 2 kudos
2 More Replies
csmcpherson
by Databricks Partner
  • 2578 Views
  • 2 replies
  • 0 kudos

AWS NAT (Network Address Translation) Automated On-demand Destruct / Create

Hi folks, Our company typically uses Databrick during a 12 hour block, however the AWS NAT for elastic compute is up 24 hours, and I'd rather not pay for those hours.I gather AWS lambda and cloudwatch can be used to schedule / trigger NAT destruction...

  • 2578 Views
  • 2 replies
  • 0 kudos
Latest Reply
csmcpherson
Databricks Partner
  • 0 kudos

For interest, this is how I ended up solving the situation, with pointers from AWS support:<< CREATE NAT >>import boto3 import logging from datetime import datetime ec2 = boto3.client('ec2') cloudwatch = boto3.client('logs') def lambda_handler(even...

  • 0 kudos
1 More Replies
NickLee
by New Contributor III
  • 1875 Views
  • 2 replies
  • 1 kudos

How to update num_workers dynamically in a job cluster

I am setting up a workflows with the UI. In the first task, a dynamic value for the next task's num_workers is calculated based on actual data size. In the subsequent task, I'd like to use this calculated num_workers to update the job cluster's defau...

NickLee_0-1722018584496.png
  • 1875 Views
  • 2 replies
  • 1 kudos
Latest Reply
NickLee
New Contributor III
  • 1 kudos

wonder if anyone has similar experience? thanks

  • 1 kudos
1 More Replies
tramtran
by Contributor
  • 9007 Views
  • 3 replies
  • 5 kudos

Resolved! Driver: Out of Memory

Hi everyone,I have a streaming job with 29 notebooks that runs continuously. Initially, I allocated 28 GB of memory to the driver, but the job failed with a "Driver Out of Memory" error after 4 hours of execution.To address this, I increased the driv...

  • 9007 Views
  • 3 replies
  • 5 kudos
Latest Reply
xorbix_rshiva
Databricks MVP
  • 5 kudos

It looks like _source_cdc_time is the timestamp for when the CDC transaction occurred in your source system. This would be a good choice for a timestamp column for your watermark, since you would be deduping values according to the time the transacti...

  • 5 kudos
2 More Replies
alex-syk
by New Contributor II
  • 15006 Views
  • 1 replies
  • 1 kudos

Delta table and AnalysisException: [PATH_NOT_FOUND] Path does not exist

I am performing some tests with delta tables. For each test, I write a delta table to Azure Blob Storage. Then I manually delete the delta table. After deleting the table and running my code again, I get this error:  AnalysisException: [PATH_NOT_FOUN...

Capture.PNG Capture.PNG Capture.PNG Capture.PNG
  • 15006 Views
  • 1 replies
  • 1 kudos
Latest Reply
kumar_ravi
New Contributor III
  • 1 kudos

yes it is weird , workaround for thisfiles = dbutils.fs.ls("s3 bucket or azure blob path")file_paths = [file.path for file in files]if target_path not in file_paths:        dbutils.fs.mkdirs(target_path)

  • 1 kudos
aschiff
by Contributor II
  • 735062 Views
  • 33 replies
  • 5 kudos

GC Driver Error

I am using a cluster in databricks to connect to a Tableau workbook through the JDBC connector. My Tableau workbook has been unable to load due to resources not being available through the data connection. I went to look at the driver log for my clus...

  • 735062 Views
  • 33 replies
  • 5 kudos
Latest Reply
galang123
New Contributor II
  • 5 kudos

yesasd

  • 5 kudos
32 More Replies
KosmaS
by New Contributor III
  • 10226 Views
  • 3 replies
  • 7 kudos

Resolved! Efficient caching/persisting

To cache/persist an action needs to be triggered. I'm just wondering, will it make any difference if, after persisting some df, I use, for instance, take(5) instead of count()?Will it be a bit more effective, because of sending results from 5 partiti...

  • 10226 Views
  • 3 replies
  • 7 kudos
Latest Reply
Rishabh-Pandey
Databricks MVP
  • 7 kudos

Yes take (5) will be more efficient in some ways.When you cache or persist a DataFrame in Spark, you are instructing Spark to store the DataFrame's intermediate data in memory (or on disk, depending on the storage level). This can significantly speed...

  • 7 kudos
2 More Replies
Labels