cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

brickster_2018
by Databricks Employee
  • 8277 Views
  • 6 replies
  • 2 kudos
  • 8277 Views
  • 6 replies
  • 2 kudos
Latest Reply
VasuBajaj
New Contributor II
  • 2 kudos

A .CRC file (Cyclic Redundancy Check) is an internal checksum file used by Spark (and Hadoop) to ensure data integrity when reading and writing files.Data Integrity Check – .CRC files store checksums of actual data files. When reading a file, Spark/H...

  • 2 kudos
5 More Replies
kolangareth
by New Contributor III
  • 6215 Views
  • 11 replies
  • 3 kudos

Resolved! to_date not functioning as expected after introduction of arbitrary replaceWhere in Databricks 9.1 LTS

I am trying to do a dynamic partition overwrite on delta table using replaceWhere option. This was working fine until I upgraded the DB runtime to 9.1 LTS from 8.3.x. I am concatenating 'year', 'month' and 'day' columns and then using to_date functio...

  • 6215 Views
  • 11 replies
  • 3 kudos
Latest Reply
ltreweek
New Contributor II
  • 3 kudos

SELECT TO_DATE('20250217','YYYYMMDD'); gives the error: PARSE_SYNTAX_ERROR  syntax error at or near 'select'. sqlstate: 42601.  It datagrip, it works no problem and displays the date.

  • 3 kudos
10 More Replies
diguid
by New Contributor III
  • 3565 Views
  • 3 replies
  • 13 kudos

Using foreachBatch within Delta Live Tables framework

Hey there!​I was wondering if there's any way of declaring a delta live table where we use foreachBatch to process the output of a streaming query.​Here's a simplification of my code:​def join_data(df_1, df_2): df_joined = ( df_1 ...

  • 3565 Views
  • 3 replies
  • 13 kudos
Latest Reply
cgrant
Databricks Employee
  • 13 kudos

foreachBatch support in DLT is coming soon, and you now have the ability to write to non-DLT sinks as well

  • 13 kudos
2 More Replies
boskicl
by New Contributor III
  • 29865 Views
  • 6 replies
  • 10 kudos

Resolved! Table write command stuck "Filtering files for query."

Hello all,Background:I am having an issue today with databricks using pyspark-sql and writing a delta table. The dataframe is made by doing an inner join between two tables and that is the table which I am trying to write to a delta table. The table ...

filtering job_info spill_memory
  • 29865 Views
  • 6 replies
  • 10 kudos
Latest Reply
timo199
New Contributor II
  • 10 kudos

Even if I vacuum and optimize, it keeps getting stuck.cluster type is r6gd.xlarge min:4, max:6driver type is r6gd.2xlarge

  • 10 kudos
5 More Replies
tinai_long
by New Contributor III
  • 10427 Views
  • 12 replies
  • 6 kudos

Resolved! How to refresh a single table in Delta Live Tables?

Suppose I have a Delta Live Tables framework with 2 tables: Table 1 ingests from a json source, Table 2 reads from Table 1 and runs some transformation.In other words, the data flow is json source -> Table 1 -> Table 2. Now if I find some bugs in the...

  • 10427 Views
  • 12 replies
  • 6 kudos
Latest Reply
cpayne_vax
New Contributor III
  • 6 kudos

Answering my own question: nowadays (February 2024) this can all be done via the UI.When viewing your DLT pipeline there is a "Select tables for refresh" button in the header. If you click this, you can select individual tables, and then in the botto...

  • 6 kudos
11 More Replies
Tahseen0354
by Valued Contributor
  • 24308 Views
  • 9 replies
  • 5 kudos

Resolved! Getting "Job aborted due to stage failure" SparkException when trying to download full result

I have generated a result using SQL. But whenever I try to download the full result (1 million rows), it is throwing SparkException. I can download the preview result but not the full result. Why ? What happens under the hood when I try to download ...

  • 24308 Views
  • 9 replies
  • 5 kudos
Latest Reply
ac567
New Contributor III
  • 5 kudos

Job aborted due to stage failure: Task 6506 in stage 46.0 failed 4 times, most recent failure: Lost task 6506.3 in stage 46.0 (TID 12896) (10.**.***.*** executor 12): java.lang.OutOfMemoryError: Cannot reserve 4194304 bytes of direct buffer memory (a...

  • 5 kudos
8 More Replies
brendanc19
by New Contributor III
  • 4389 Views
  • 6 replies
  • 2 kudos

Resolved! Does cancelling a job run rollback any actions performed by query plan?

If I were to stop a rather large job run, say half way thru execution, will any actions performed on our Delta tables persist or will they be rolled back?Are there any other risks that I need to be aware of in terms of cancelling a job run half way t...

  • 4389 Views
  • 6 replies
  • 2 kudos
Latest Reply
fabian_r
New Contributor II
  • 2 kudos

Hi, is there any way to ensure transaction control in delta protocol in 2024 across tables for failing jobs?

  • 2 kudos
5 More Replies
labromb
by Contributor
  • 13271 Views
  • 10 replies
  • 4 kudos

How to pass configuration values to a Delta Live Tables job through the Delta Live Tables API

Hi Community,I have successfully run a job through the API but would need to be able to pass parameters (configuration) to the DLT workflow via the APII have tried passing JSON in this format:{ "full_refresh": "true", "configuration": [ ...

  • 13271 Views
  • 10 replies
  • 4 kudos
Latest Reply
Edthehead
Contributor III
  • 4 kudos

You cannot pass parameters from a Databricks job to a DLT pipeline. Atleast not yet. You can see from the DLT rest API that there is no option for it to accept any parameters.But there is a workaround.But there is a workaround.With the assumption tha...

  • 4 kudos
9 More Replies
Data_Engineer3
by Contributor III
  • 3469 Views
  • 5 replies
  • 0 kudos

Default maximum spark streaming chunk size in delta files in each batch?

working with delta files spark structure streaming , what is the maximum default chunk size in each batch?How do identify this type of spark configuration in databricks?#[Databricks SQL]​ #[Spark streaming]​ #[Spark structured streaming]​ #Spark​ 

  • 3469 Views
  • 5 replies
  • 0 kudos
Latest Reply
NandiniN
Databricks Employee
  • 0 kudos

doc - https://docs.databricks.com/en/structured-streaming/delta-lake.html  Also, what is the challenge while using foreachbatch?

  • 0 kudos
4 More Replies
SRK
by Contributor III
  • 3776 Views
  • 5 replies
  • 7 kudos

How to handle schema validation for Json file. Using Databricks Autoloader?

Following are the details of the requirement:1.      I am using databricks notebook to read data from Kafka topic and writing into ADLS Gen2 container i.e., my landing layer.2.      I am using Spark code to read data from Kafka and write into landing...

  • 3776 Views
  • 5 replies
  • 7 kudos
Latest Reply
maddy08
New Contributor II
  • 7 kudos

just to clarify, are you reading kafka and writing into adls in json files? like for each message from kafka is 1 json file in adls ?

  • 7 kudos
4 More Replies
AsfandQ
by New Contributor III
  • 17671 Views
  • 7 replies
  • 6 kudos

Resolved! Delta tables: Cannot set default column mapping mode to "name" in Python for delta tables

Hello,I am trying to write Delta files for some CSV data. When I docsv_dataframe.write.format("delta").save("/path/to/table.delta")I get: AnalysisException: Found invalid character(s) among " ,;{}()\n\t=" in the column names of yourschema.Having look...

  • 17671 Views
  • 7 replies
  • 6 kudos
Latest Reply
Personal1
New Contributor II
  • 6 kudos

I still get the error when I try any method. The column names with spaces are throwing error [DELTA_INVALID_CHARACTERS_IN_COLUMN_NAMES] Found invalid character(s) among ' ,;{}()\n\t=' in the column names of your schema.df1.write.format("delta") \ .mo...

  • 6 kudos
6 More Replies
DJey
by New Contributor III
  • 15266 Views
  • 6 replies
  • 2 kudos

Resolved! MergeSchema Not Working

Hi All, I have a scenario where my Exisiting Delta Table looks like below:Now I have an incremental data with an additional column i.e. owner:Dataframe Name --> scdDFBelow is the code snippet to merge Incremental Dataframe to targetTable, but the new...

image image image image
  • 15266 Views
  • 6 replies
  • 2 kudos
Latest Reply
Amin112
New Contributor II
  • 2 kudos

In Databricks Runtime 15.2 and above, you can specify schema evolution in a merge statement using SQL or Delta table APIs:MERGE WITH SCHEMA EVOLUTION INTO targetUSING sourceON source.key = target.keyWHEN MATCHED THENUPDATE SET *WHEN NOT MATCHED THENI...

  • 2 kudos
5 More Replies
Valentin1
by New Contributor III
  • 8542 Views
  • 6 replies
  • 3 kudos

Delta Live Tables Incremental Batch Loads & Failure Recovery

Hello Databricks community,I'm working on a pipeline and would like to implement a common use case using Delta Live Tables. The pipeline should include the following steps:Incrementally load data from Table A as a batch.If the pipeline has previously...

  • 8542 Views
  • 6 replies
  • 3 kudos
Latest Reply
lprevost
Contributor II
  • 3 kudos

I totally agree that this is a gap in the Databricks solution.  This gap exists between a static read and real time streaming.   My problem (and suspect there are many use cases) is that I have slowly changing data coming into structured folders via ...

  • 3 kudos
5 More Replies
bgerhardi
by New Contributor III
  • 11126 Views
  • 12 replies
  • 13 kudos

Surrogate Keys with Delta Live

We are considering moving to Delta Live tables from a traditional sql-based data warehouse. Worrying me is this FAQ on identity columns Delta Live Tables frequently asked questions | Databricks on AWS this seems to suggest that we basically can't cre...

  • 11126 Views
  • 12 replies
  • 13 kudos
Latest Reply
Pelle123
New Contributor II
  • 13 kudos

I am wondering if there is an answer to this question now.

  • 13 kudos
11 More Replies
YFL
by New Contributor III
  • 7326 Views
  • 11 replies
  • 6 kudos

Resolved! When delta is a streaming source, how can we get the consumer lag?

Hi, I want to keep track of the streaming lag from the source table, which is a delta table. I see that in query progress logs, there is some information about the last version and the last file in the version for the end offset, but this don't give ...

  • 7326 Views
  • 11 replies
  • 6 kudos
Latest Reply
Anonymous
Not applicable
  • 6 kudos

Hey @Yerachmiel Feltzman​ I hope all is well.Just wanted to check in if you were able to resolve your issue or do you need more help? We'd love to hear from you.Thanks!

  • 6 kudos
10 More Replies
Labels