cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

Raja_Databricks
by New Contributor III
  • 3403 Views
  • 6 replies
  • 7 kudos

Resolved! Liquid Clustering With Merge

Hi there,I'm working with a large Delta table (2TB) and I'm looking for the best way to efficiently update it with new data (10GB). I'm particularly interested in using Liquid Clustering for faster queries, but I'm unsure if it supports updates effic...

  • 3403 Views
  • 6 replies
  • 7 kudos
Latest Reply
RV-Gokul
New Contributor II
  • 7 kudos

@youssefmrini @erigaud I have a similar issue, and I've pretty much tried the solution mentioned above. However, I'm not noticing any changes when I use a temporary table or persist the table.My main table contains 3.1 terabytes of data with 42 billi...

  • 7 kudos
5 More Replies
robbe
by New Contributor III
  • 439 Views
  • 2 replies
  • 1 kudos

Resolved! Get job ID from Asset Bundles

When using Asset Bundles to deploy jobs, how does one get the job ID of the resources that are created?I would like to deploy some jobs through asset bundles, get the job IDs, and then trigger these jobs programmatically outside the CI/CD pipeline us...

  • 439 Views
  • 2 replies
  • 1 kudos
Latest Reply
robbe
New Contributor III
  • 1 kudos

Thanks @mhiltner. I don't need to run jobs, just to get the ID. So I think that solution 2) is the way to go here. I'll accept the solution.

  • 1 kudos
1 More Replies
Eiki
by New Contributor
  • 108 Views
  • 1 replies
  • 0 kudos

How to use the same job cluster in diferents job runs inside the one workflow

I created a Workflow with notebooks and some job runs, but I would to use only one job cluster to run every job runs, without creating a new job cluster for each job run. Because I didn't want to increase the execution time with each new job cluster ...

  • 108 Views
  • 1 replies
  • 0 kudos
Latest Reply
brockb
Valued Contributor
  • 0 kudos

Hi,If I understand correctly, you are hoping to reduce overall job execution time by reducing the Cloud Service Provider instance provisioning time. Is that correct?If so, you may want to consider: Using a Pool of instances: https://docs.databricks.c...

  • 0 kudos
diego_poggioli
by Contributor
  • 277 Views
  • 1 replies
  • 0 kudos

Streaming foreachBatch _jdf jvm attribute not supported

I'm trying to perform a merge inside a streaming foreachbatch using the command: microBatchDF._jdf.sparkSession().sql(self.merge_query)Streaming runs fine if I use a Personal cluster while if I use a Shared cluster streaming fails with the following ...

  • 277 Views
  • 1 replies
  • 0 kudos
Latest Reply
holly
Contributor II
  • 0 kudos

Can you share what runtime your cluster is using?  This error doesn't surprise me, Unity Catalog Shared clusters have many security limitations, but the list is reducing over time. https://docs.databricks.com/en/compute/access-mode-limitations.html#s...

  • 0 kudos
nehaa
by New Contributor II
  • 212 Views
  • 2 replies
  • 0 kudos

Databricks dashboard publish

How to publish the dashboard created from a notebook? I don't see publish option within the File anymoreWhen referred to the old video they seem to have an option to publish the dashboard  

nehaa_0-1718822255560.png nehaa_2-1718822335195.png
  • 212 Views
  • 2 replies
  • 0 kudos
Latest Reply
Walter_C
Honored Contributor
  • 0 kudos

Can you share the link to the video you are referring?As per docs, no publish option is currently available, you can do a Present Dashboard to see it.https://docs.databricks.com/en/notebooks/dashboards.html

  • 0 kudos
1 More Replies
jenitjain
by New Contributor
  • 215 Views
  • 2 replies
  • 0 kudos

Certifications questions

What are the timings and days between which we can get certified? Can we purchase a certification at the location or are we supposed to purchase it beforehand?

  • 215 Views
  • 2 replies
  • 0 kudos
Latest Reply
Cert-Team
Honored Contributor III
  • 0 kudos

Online exams can be purchased and taken anytime. Is this question related to DAIS?

  • 0 kudos
1 More Replies
manish1987c
by New Contributor III
  • 126 Views
  • 1 replies
  • 0 kudos

Python project || write write_micro_batch in structure streaming

Hi Team,We are in progress of developing a framework in python using datatbricks-connect in VSO,However when we are trying to run micro batches in the foreachbatch function it is giving us error message saying that "few objects are not serializable: ...

  • 126 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @manish1987c,  Ensure that the Python version you’re using locally matches the one on your Databricks cluster. Minor version differences are usually acceptable (e.g., 3.10.11 versus 3.10.10).

  • 0 kudos
swathiG
by New Contributor III
  • 506 Views
  • 7 replies
  • 1 kudos

Databrick IP address

I'm trying to call api in databricks notebook .But while calling  api inside databricks notebook  it is giving error saying "403" forbidden. I think it seems issue with IP address can any one help me to know which IP of databricks need to be whitelis...

swathiG_0-1718859645301.png
  • 506 Views
  • 7 replies
  • 1 kudos
Latest Reply
swathiG
New Contributor III
  • 1 kudos

@jacovangelder can you please let me know where can I get compute IP address

  • 1 kudos
6 More Replies
mysteryuser000
by New Contributor
  • 143 Views
  • 1 replies
  • 0 kudos

dlt pipeline will not create live tables

I have created a dlt pipeline based four sql notebooks, each containing between 1 and 3 queries.  Each query begins with "create or refresh live table ..." yet each one outputs a materialized view.  I have tried deleting the materialized views and ru...

  • 143 Views
  • 1 replies
  • 0 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 0 kudos

Hi @mysteryuser000,  In Databricks SQL, materialized views are Unity Catalog managed tables that allow you to precompute results based on the latest data in source tables. Unlike other implementations, the results returned reflect the state of data w...

  • 0 kudos
laudhon
by New Contributor II
  • 599 Views
  • 6 replies
  • 3 kudos

Why is My MIN MAX Query Still Slow on a 29TB Delta Table After Liquid Clustering and Optimization?

Hello,I have a large Delta table with a size of 29TB. I implemented Liquid Clustering on this table, but running a simple MIN MAX query on the set cluster column is still extremely slow. I have already optimized the table. Am I missing something in m...

  • 599 Views
  • 6 replies
  • 3 kudos
Latest Reply
LuisRSanchez
New Contributor III
  • 3 kudos

Hithis operation should take seconds because it use the precomputed statistics for the table. Then few elements to verify:if the data type is datetime or integer should work, if it is string data type then it needs to read all data.verify the column ...

  • 3 kudos
5 More Replies
dat_77
by New Contributor
  • 196 Views
  • 1 replies
  • 0 kudos

Change Default Parallelism ?

HII attempted to parallelize my Spark read process by setting the default parallelism using spark.conf.set("spark.default.parallelism", "X"). However, despite setting this configuration, when I checked sc.defaultParallelism in my notebook, it display...

  • 196 Views
  • 1 replies
  • 0 kudos
Latest Reply
irfan_elahi
New Contributor III
  • 0 kudos

sc.defaultParallelism is based on the number of worker cores in the cluster. It can't be overridden. The reason you are seeing 200 tasks is because of spark.sql.shuffle.partitions (whose default value is 200). This determines the number of shuffle pa...

  • 0 kudos
Awoke101
by New Contributor III
  • 659 Views
  • 8 replies
  • 0 kudos

Resolved! Pandas_UDF not working on shared access mode but works on personal cluster

 The "dense_vector" column does not output on show(). Instead I get the error below. Any idea why it doesn't work on the shared access mode? Any alternatives? from fastembed import TextEmbedding, SparseTextEmbedding from pyspark.sql.pandas.functions ...

Data Engineering
pandas_udf
shared_access
udf
  • 659 Views
  • 8 replies
  • 0 kudos
Latest Reply
jacovangelder
Contributor III
  • 0 kudos

For some reason a moderator is removing my pip freeze? no idea why. Maybe too long/spammy for a comment.Anyway, I am using DBR 14.3 LTS with Shared Access Mode. I haven't installed any other version apart from fastembed==0.3.1. Included a screenshot ...

  • 0 kudos
7 More Replies
DavidS1
by New Contributor
  • 154 Views
  • 1 replies
  • 0 kudos

Cost comparison of DLT to custom pipeline

Hello, our company currently has a number of custom pipelines written in python for ETL, and I want to do an evaluation of DLT to see if that will make things more efficient.  A problem is that there is a restriction on using DLT "because it is too e...

  • 154 Views
  • 1 replies
  • 0 kudos
Latest Reply
Zume
New Contributor II
  • 0 kudos

DLT is expensive in my opinion. I tried to run a simple notebook that just reads a parquet file into a dataframe and write it out to a cloud storage and I got an error that i hit my CPU instance limit for my azure subscription. I just gave up after t...

  • 0 kudos
dbx_687_3__1b3Q
by New Contributor III
  • 4777 Views
  • 5 replies
  • 4 kudos

Resolved! Databricks Asset Bundle (DAB) from a Git repo?

My earlier question was about creating a Databricks Asset Bundle (DAB) from an existing workspace. I was able to get that working but after further consideration and some experimenting, I need to alter my question. My question is now "how do I create...

  • 4777 Views
  • 5 replies
  • 4 kudos
Latest Reply
nicole_lu_PM
New Contributor III
  • 4 kudos

We are very close to having an end-to-end solution for deploying DABs from a Git folder (Repo) in the Workspace! Check out my talk on DAIS24 here https://github.com/databricks/dais-cow-bff (video link on README). We are waiting for the feature that a...

  • 4 kudos
4 More Replies
ksenija
by Contributor
  • 1512 Views
  • 8 replies
  • 1 kudos

DLT pipeline error key not found: user

When I try to create a DLT pipeline from a foreign catalog (BigQuery), I get this error: java.util.NoSuchElementException: key not found: user.I used the same script to copy Salesforce data and that worked completely fine.

  • 1512 Views
  • 8 replies
  • 1 kudos
Latest Reply
ksenija
Contributor
  • 1 kudos

Hi @lucasrocha ,Any luck with this error? I guess it's something with connection to BigQuery, but I didn't find anything.Best regards,Ksenija

  • 1 kudos
7 More Replies
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!

Labels