cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Data + AI Summit 2024 - Data Engineering & Streaming

Forum Posts

LavaLiah_85929
by New Contributor II
  • 986 Views
  • 2 replies
  • 0 kudos

"desc history" shows versions older than the default logRetentionDuration of 30 days

I have a cdc enabled table where no data changes were made since July 28. Then updates started occurring from November 22 onwards. The first checkpoint occurred on Nov 28. Based on the corresponding timestamp of checkpoint and log files, it looks lik...

  • 986 Views
  • 2 replies
  • 0 kudos
Latest Reply
shyam_9
Valued Contributor
  • 0 kudos

Hi @Laval Liahkim​, could you please try running the VACUUM with 30 days retention?Please confirm when you last run the cmd with the 30-day retention period. Also, when you created this table and do you see old data files were deleted?Also, when disk...

  • 0 kudos
1 More Replies
espenol
by New Contributor III
  • 2219 Views
  • 3 replies
  • 0 kudos

How to debug Workflow Jobs timing out and DLT pipelines running forever?

So I'm the designated data engineer for a proof of concept we're running, I'm working with one infrastructure guy who's setting up everything in Terraform (company policy). He's got the setup down for Databricks so we can configure clusters and run n...

  • 2219 Views
  • 3 replies
  • 0 kudos
Latest Reply
shan_chandra
Esteemed Contributor
  • 0 kudos

@Espen Solvang​ - Just thought of checking with you, could you please let us know if you require further assistance on this?

  • 0 kudos
2 More Replies
nimble
by New Contributor
  • 2036 Views
  • 2 replies
  • 0 kudos

How can I run a streaming query on a new table with tbl property: change data feed enabled?

In Databricks on AWS, I am trying to run a streaming query (trigger=Once) with delta.enableChangeDataFeed=true in the table definition as instructed, but this always fails with :ERROR: Some streams terminated before this command could finish!   com.d...

  • 2036 Views
  • 2 replies
  • 0 kudos
Latest Reply
swethaNandan
New Contributor III
  • 0 kudos

Hi @daniel e​ Can you try running the select command on table changes from 0th version and see if you get output?SELECT * FROM table_changes('tableName', 0)Also, Please share the streaming query that you are running.

  • 0 kudos
1 More Replies
Raghu_Bindingan
by New Contributor III
  • 2913 Views
  • 4 replies
  • 2 kudos

Truncate delta live table and try to repopulate it in the pipeline

Has anyone attempted to truncate a delta live gold level table that gets populated via a pipeline and then tried to repopulate it by starting the pipeline. I have this situation wherein i need to reprocess all data in my gold table, so i stopped the ...

  • 2913 Views
  • 4 replies
  • 2 kudos
Latest Reply
Rajeev45
New Contributor III
  • 2 kudos

Please can you confirm if the job is still failing with the same error even after “FULL REFRESH ALL” option? If so please share the full stack trace and is it failing in any of the below steps?Creating updateWaiting for resourcesInitializingResetting...

  • 2 kudos
3 More Replies
DevOps88
by New Contributor II
  • 1589 Views
  • 2 replies
  • 3 kudos

Does exist the ability to run jobs with integration tests from the Databricks interface?

Currently, Nutter could be run inside a common CI/CD pipeline from GitLab, but need the possibility to run jobs with integration tests from the Databricks interface.How to use Nutter directly from Databricks?Does exist any integration test examples a...

  • 1589 Views
  • 2 replies
  • 3 kudos
Latest Reply
jose_gonzalez
Moderator
  • 3 kudos

Hi @Dmitrii Kalashnikov​,You could find examples and more details here https://github.com/alexott/databricks-nutter-repos-demo

  • 3 kudos
1 More Replies
Trodenn
by New Contributor III
  • 2987 Views
  • 5 replies
  • 1 kudos

Resolved! ApprodxQuantile does not seem to be working with delta live tables (DLT)

HI,I am tying to use the approxQuantile() function and populate a list that I made, yet somehow, whenever I try to run the code it's as if the list is empty and there are no values in it.Code is written as below:@dlt.table(name = "customer_order_silv...

Screenshot_20230130_053953
  • 2987 Views
  • 5 replies
  • 1 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 1 kudos

Maybe try to use (and the first test in the separate notebook) standard df = spark.read.table("customer_order_silver") to calculate approxQuantile.Of course, you need to set that customer_order_silver has a target location in the catalog, so read us...

  • 1 kudos
4 More Replies
guru1
by New Contributor II
  • 3434 Views
  • 2 replies
  • 0 kudos

Resolved! facing issue mentioned in body when connecting event hub with databricks , followed earlier discussion on this but no solution

ERROR: Query termination received for [id=37bada03-131b-4fbb-8992-a427263fef2c, runId=cf3d7c18-780e-43ae-aed0-9daf2939b823], with exception: java.lang.IllegalArgumentException: Input byte array has wrong 4-byte ending unit at java.util.Base64$Decoder...

  • 3434 Views
  • 2 replies
  • 0 kudos
Latest Reply
Annapurna_Hiriy
New Contributor III
  • 0 kudos

The issue could be due to the mismatch in the eventHub jar and the dependencies added. Also, not all the required dependencies may be added.Suggestions:Using the azure_eventhubs_spark_2_12_.jar eventHub spark jar along with the following dependencies...

  • 0 kudos
1 More Replies
ravinchi
by New Contributor III
  • 3188 Views
  • 5 replies
  • 9 kudos

I'd like to ingest data into my ADLS from sql server in an incremental manner using Delta Live Tables.

I'd like to ingest data into my ADLS from sql server in an incremental manner using Delta Live Tables. I do not want to use any staging tables. I was using CDC, While I call dlt.apply_changes, its asking me to specify source and target. SInce source ...

  • 3188 Views
  • 5 replies
  • 9 kudos
Latest Reply
Sandeep
Contributor III
  • 9 kudos

If you have a CDC feed, looks like we can use this: https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-cdc.html

  • 9 kudos
4 More Replies
nagini_sitarama
by New Contributor III
  • 1976 Views
  • 3 replies
  • 2 kudos

Error while optimizing the table . Failure of InSet.sql for UTF8String collection

count of the table : 1125089 for october month data , So I am optimizing the table. optimize table where batchday >="2022-10-01" and batchday<="2022-10-31"I am getting error like : GC overhead limit exceeded    at org.apache.spark.unsafe.types.UTF8St...

image.png
  • 1976 Views
  • 3 replies
  • 2 kudos
Latest Reply
Priyanka_Biswas
Valued Contributor
  • 2 kudos

Hi @Nagini Sitaraman​ To understand the issue better I would like to get some more information. Does the error occur at the driver side or executor side? Can you please share the full error stack trace? You may need to check the spark UI to find wher...

  • 2 kudos
2 More Replies
Aviral-Bhardwaj
by Esteemed Contributor III
  • 9945 Views
  • 2 replies
  • 13 kudos

Understanding Rename in Databricks Now there are multiple ways to rename Spark Data Frame Columns or Expressions. We can rename columns or expressions...

Understanding Rename in DatabricksNow there are multiple ways to rename Spark Data Frame Columns or Expressions.We can rename columns or expressions using alias as part of selectWe can add or rename columns or expressions using withColumn on top of t...

  • 9945 Views
  • 2 replies
  • 13 kudos
Latest Reply
Ajay-Pandey
Esteemed Contributor III
  • 13 kudos

Very informative, Thanks for sharing

  • 13 kudos
1 More Replies
AlexDavies
by Contributor
  • 2229 Views
  • 2 replies
  • 2 kudos

Issue connecting to SQL warehouse spark thrift server

we have a library that allows dotnet applications to talk to databricks clusters (https://github.com/clearbank/SparkSqlClient). This communicates with the clusters over the spark thrift serverAlthough this works great for clusters in the "data scienc...

  • 2229 Views
  • 2 replies
  • 2 kudos
Latest Reply
AlexDavies
Contributor
  • 2 kudos

I have tried those connection details however it they give me 400 errors when trying to connect directly using the hive thrift server contract (https://github.com/apache/hive/blob/master/service-rpc/if/TCLIService.thrift). I do not get the issues whe...

  • 2 kudos
1 More Replies
cristianc
by Contributor
  • 1321 Views
  • 2 replies
  • 1 kudos

Unexpected workspace setup dialog in the account

Greetings,Recently we were doing cleanups in AWS and removed some Databricks related resources that were used only once for setting up our workspace and were not used since then.Since there is no plan to create any other workspaces the decision was t...

unexpected_workspace_create_dialog
  • 1321 Views
  • 2 replies
  • 1 kudos
Latest Reply
cristianc
Contributor
  • 1 kudos

The resources that were cleaned up were just the ones that were used for the initial setup of the workspace, everything else important for the day to day operation are in place and we are actively using the workspace, therefore there is no plan to de...

  • 1 kudos
1 More Replies
ftc
by New Contributor II
  • 844 Views
  • 1 replies
  • 2 kudos

Can Databricks Certified Data Engineer Professional exam questions be short and easy to understand?

The Databricks Certified Data Engineer Professional exam most questions are too long for those English as second language. Not enough time to read through the questions and sometimes hard to comprehend

  • 844 Views
  • 1 replies
  • 2 kudos
Latest Reply
eimis_pacheco
Contributor
  • 2 kudos

I strongly agree with you. There is not a Spanish version of this exam. Those exam are long even for native speakers just imagine for people with English as a second language. For instance, since Amazon does not have a Spanish version, they took this...

  • 2 kudos
BF
by New Contributor II
  • 4790 Views
  • 3 replies
  • 2 kudos

Resolved! Pyspark - How do I convert date/timestamp of format like /Date(1593786688000+0200)/ in pyspark?

Hi all, I've a dataframe with CreateDate column with this format:CreateDate/Date(1593786688000+0200)//Date(1446032157000+0100)//Date(1533904635000+0200)//Date(1447839805000+0100)//Date(1589451249000+0200)/and I want to convert that format to date/tim...

  • 4790 Views
  • 3 replies
  • 2 kudos
Latest Reply
Chaitanya_Raju
Honored Contributor
  • 2 kudos

Hi @Bruno Franco​ ,Can you please try the below code, hope it might for you.from pyspark.sql.functions import from_unixtime from pyspark.sql import functions as F final_df = df_src.withColumn("Final_Timestamp", from_unixtime((F.regexp_extract(col("Cr...

  • 2 kudos
2 More Replies
whh99
by New Contributor II
  • 1594 Views
  • 3 replies
  • 1 kudos

Given user id, what API can we use to find out which cluster the user is connected to?

I want to know the cluster that user is connected to in databricks. It would be great if we can also get the duration that the user is connected.

  • 1594 Views
  • 3 replies
  • 1 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 1 kudos

Hi @Hui Hui Wong​  (Customer)​, We haven’t heard from you since the last response from @Daniel Sahal​ (Customer)​ , and I was checking back to see if his suggestions helped you.Or else, If you have any solution, please share it with the community, as...

  • 1 kudos
2 More Replies

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group
Labels