cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

laudhon
by New Contributor II
  • 6149 Views
  • 6 replies
  • 3 kudos

Resolved! Why is My MIN MAX Query Still Slow on a 29TB Delta Table After Liquid Clustering and Optimization?

Hello,I have a large Delta table with a size of 29TB. I implemented Liquid Clustering on this table, but running a simple MIN MAX query on the set cluster column is still extremely slow. I have already optimized the table. Am I missing something in m...

  • 6149 Views
  • 6 replies
  • 3 kudos
Latest Reply
LuisRSanchez
New Contributor III
  • 3 kudos

Hithis operation should take seconds because it use the precomputed statistics for the table. Then few elements to verify:if the data type is datetime or integer should work, if it is string data type then it needs to read all data.verify the column ...

  • 3 kudos
5 More Replies
dat_77
by New Contributor
  • 6760 Views
  • 1 replies
  • 0 kudos

Change Default Parallelism ?

HII attempted to parallelize my Spark read process by setting the default parallelism using spark.conf.set("spark.default.parallelism", "X"). However, despite setting this configuration, when I checked sc.defaultParallelism in my notebook, it display...

  • 6760 Views
  • 1 replies
  • 0 kudos
Latest Reply
irfan_elahi
Databricks Employee
  • 0 kudos

sc.defaultParallelism is based on the number of worker cores in the cluster. It can't be overridden. The reason you are seeing 200 tasks is because of spark.sql.shuffle.partitions (whose default value is 200). This determines the number of shuffle pa...

  • 0 kudos
Awoke101
by New Contributor III
  • 8283 Views
  • 8 replies
  • 0 kudos

Resolved! Pandas_UDF not working on shared access mode but works on personal cluster

 The "dense_vector" column does not output on show(). Instead I get the error below. Any idea why it doesn't work on the shared access mode? Any alternatives? from fastembed import TextEmbedding, SparseTextEmbedding from pyspark.sql.pandas.functions ...

Data Engineering
pandas_udf
shared_access
udf
  • 8283 Views
  • 8 replies
  • 0 kudos
Latest Reply
jacovangelder
Databricks MVP
  • 0 kudos

For some reason a moderator is removing my pip freeze? no idea why. Maybe too long/spammy for a comment.Anyway, I am using DBR 14.3 LTS with Shared Access Mode. I haven't installed any other version apart from fastembed==0.3.1. Included a screenshot ...

  • 0 kudos
7 More Replies
DavidS1
by New Contributor
  • 1384 Views
  • 1 replies
  • 0 kudos

Cost comparison of DLT to custom pipeline

Hello, our company currently has a number of custom pipelines written in python for ETL, and I want to do an evaluation of DLT to see if that will make things more efficient.  A problem is that there is a restriction on using DLT "because it is too e...

  • 1384 Views
  • 1 replies
  • 0 kudos
Latest Reply
Zume
New Contributor II
  • 0 kudos

DLT is expensive in my opinion. I tried to run a simple notebook that just reads a parquet file into a dataframe and write it out to a cloud storage and I got an error that i hit my CPU instance limit for my azure subscription. I just gave up after t...

  • 0 kudos
ksenija
by Contributor
  • 6410 Views
  • 8 replies
  • 1 kudos

DLT pipeline error key not found: user

When I try to create a DLT pipeline from a foreign catalog (BigQuery), I get this error: java.util.NoSuchElementException: key not found: user.I used the same script to copy Salesforce data and that worked completely fine.

  • 6410 Views
  • 8 replies
  • 1 kudos
Latest Reply
ksenija
Contributor
  • 1 kudos

Hi @lucasrocha ,Any luck with this error? I guess it's something with connection to BigQuery, but I didn't find anything.Best regards,Ksenija

  • 1 kudos
7 More Replies
Chinu
by New Contributor III
  • 4576 Views
  • 3 replies
  • 2 kudos

How do I access to DLT advanced configuration from python notebook?

Hi Team, Im trying to get DLT Advanced Configuration value from the python dlt notebook. For example, I set "something": "some path" in Advanced configuration in DLT and I want to get the value from my dlt notebook. I tried "dbutils.widgets.get("some...

  • 4576 Views
  • 3 replies
  • 2 kudos
Latest Reply
Mo
Databricks Employee
  • 2 kudos

here you can find the documentation on how to use parameters in dlt (sql and python): https://docs.databricks.com/en/delta-live-tables/settings.html#parameterize-dataset-declarations-in-python-or-sql

  • 2 kudos
2 More Replies
vanagnostopoulo
by New Contributor III
  • 3304 Views
  • 5 replies
  • 0 kudos

validate bundle does not work on windows 10 PRO x64

I use the databricks clidatabricks_cli_0.221.1_windows_amd64-signedand when I rundatabricks bundle validatein my project I get"Error: no shell found" In git-bash it works but I have other problems there.

  • 3304 Views
  • 5 replies
  • 0 kudos
Latest Reply
jacovangelder
Databricks MVP
  • 0 kudos

That's strange. Just checking, are you running it from the right folder? (location of databricks.yml file)? 

  • 0 kudos
4 More Replies
WynanddB
by New Contributor III
  • 9148 Views
  • 4 replies
  • 3 kudos

Invalid characters in column name

I get the following error   com.databricks.sql.transaction.tahoe.DeltaAnalysisException: [DELTA_INVALID_CHARACTERS_IN_COLUMN_NAMES] Found invalid character(s) among ' ,;{}()\n\t=' in the column names of your schema.It's a new instance of databricks a...

  • 9148 Views
  • 4 replies
  • 3 kudos
Latest Reply
jacovangelder
Databricks MVP
  • 3 kudos

My guess is you have a new line character (\n) in one of the CSV header columns. You don't very easily spot them. Have you checked for that? You can also try .option("header","true") so Spark doesn't think of your header as content. Might also want t...

  • 3 kudos
3 More Replies
Nastia
by New Contributor III
  • 3586 Views
  • 7 replies
  • 2 kudos

Resolved! I keep getting dataset from spark.table command (instead of dataframe)

I am trying to create a simple dlt pipeline: @dlt.tabledef today_latest_execution():  return spark.sql("SELECT * FROM LIVE.last_execution") @on_event_hookdef write_events_to_x(event  if (     today_latest_execution().count() == 0      try:       ... ...

  • 3586 Views
  • 7 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

what if you do:return spark.sql("SELECT * FROM LIVE.last_execution").toDF()

  • 2 kudos
6 More Replies
RobinK
by Contributor
  • 17583 Views
  • 12 replies
  • 14 kudos

Resolved! Databricks Jobs do not run on job compute but on shared compute

Hello,since last night none of our ETL jobs in Databricks are running anymore, although we have not made any code changes.The identical jobs (deployed with Databricks asset bundles) run on an all-purpose cluster, but fail on a job cluster. We have no...

  • 17583 Views
  • 12 replies
  • 14 kudos
Latest Reply
jcap
New Contributor II
  • 14 kudos

I do not believe this is solved, similar to a comment over here:https://community.databricks.com/t5/data-engineering/databrickssession-broken-for-15-1/td-p/70585We are also seeing this error in 14.3 LTS from a simple example:from pyspark.sql.function...

  • 14 kudos
11 More Replies
Przemk00
by New Contributor II
  • 1201 Views
  • 1 replies
  • 0 kudos

Facilitate if/else condition in conjuction with parameters

The current state: I have a working workflow with 3 tasks with several parameters.The change: I want to modify the workflow to add 4 tasks - if/else so that based on one of the parameters (call it xyz) the workflow will not proceed after 1st task.The...

  • 1201 Views
  • 1 replies
  • 0 kudos
Latest Reply
Przemk00
New Contributor II
  • 0 kudos

The logic should be simple if the xyz parameter equals 1000 then run the other 2 tasks otherwise, do not run the rest.

  • 0 kudos
WWoman
by Contributor
  • 1416 Views
  • 2 replies
  • 1 kudos

Is there a way to create a local CSV file by creating a local external table?

Hello,I have a user that would like to create a CSV file on their local file system by creating an external table (USING CSV) and specifying a local file for the path parameter using SQL. They will be running this command from a local client (DbVisua...

  • 1416 Views
  • 2 replies
  • 1 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 1 kudos

not sure if this would work, but you could run Unity Catalog locally (possible since last week) and define the csv file as a table in that local UC. then query it.

  • 1 kudos
1 More Replies
aap_scott
by New Contributor
  • 1209 Views
  • 1 replies
  • 0 kudos

Cannot navigate to workspace directory in multi-node cluster

When I open a terminal on a multi-node cluster, I cannot navigate to the workspace directoryHowever, on a single node cluster, it works fineThanks in advance 

aap_scott_0-1718313043369.png aap_scott_1-1718313219551.png
  • 1209 Views
  • 1 replies
  • 0 kudos
Latest Reply
NateAnth
Databricks Employee
  • 0 kudos

If this cluster is backed by an AWS Graviton instance, there is currently a limitation with the web terminal not being able to interact with the Workspace Filesystem.  Please give it a try in the notebook cell with the %sh magic command or switch to ...

  • 0 kudos
Labels