Data Engineering

Forum Posts

Sorted by:

by Maksym • New Contributor III

01-19-2022 1:36:14 AM

9393 Views
5 replies
7 kudos

Resolved! Databricks Autoloader is getting stuck and does not pass to the next batch

I have a simple job scheduled every 5 min. Basically it listens to cloudfiles on storage account and writes them into delta table, extremely simple. The code is something like this:df = (spark .readStream .format("cloudFiles") .option('cloudFil...

Data Engineering

9393 Views
5 replies
7 kudos

01-19-2022 1:36:14 AM

View Replies

Latest Reply

lassebe
New Contributor II

08-31-2023 1:22:00 AM

7 kudos

I had the same issue: files would randomly not be loaded.Setting `.option("cloudFiles.useIncrementalListing", False)` Seemed to do the trick!

7 kudos

08-31-2023 1:22:00 AM

4 More Replies

by ghofigjong • New Contributor

02-27-2023 12:29:55 AM

8004 Views
4 replies
2 kudos

Resolved! How does partition pruning work on a merge into statement?

I have a delta table that is partitioned by Year, Date and month. I'm trying to merge data to this on all three partition columns + an extra column (an ID). My merge statement is below:MERGE INTO delta.<path of delta table> oldData using df newData ...

Data Engineering

8004 Views
4 replies
2 kudos

02-27-2023 12:29:55 AM

View Replies

Latest Reply

Umesh_S
New Contributor II

03-30-2023 1:24:57 PM

2 kudos

Isn't the suggested idea only filtering the input dataframe (resulting in a smaller amount of data to match across the whole delta table) rather than prune the delta table for relevant partitions to scan?

2 kudos

03-30-2023 1:24:57 PM

3 More Replies

by maxutil • New Contributor II

11-09-2022 8:45:05 AM

19054 Views
6 replies
3 kudos

Resolved! Invalid Characters in Column Names " ,;{}()\n\t="

I'm reading data into a dataframe withdf = spark.read.json("s3://somepath/")I've tried first creating a delta table using the DeltaTable API with:DeltaTable.createIfNotExists(spark)\ .location(target_path)\ .addColumns(df.sche...

Data Engineering

19054 Views
6 replies
3 kudos

11-09-2022 8:45:05 AM

View Replies

Latest Reply

VZLA
Databricks Employee

01-15-2025 2:07:25 AM

3 kudos

Glad it helped @jb1z , happy to help.

3 kudos

01-15-2025 2:07:25 AM

5 More Replies

by SQL • New Contributor II

11-15-2021 9:19:33 AM

2739 Views
6 replies
1 kudos

Presto hive table to delta table conversion

Hi Everyone, I am using the below sql query to generate the days in order in hive & it is working fine. The table got migrated to delta and my query is failing. It would be appreciated if someone helps me to figure out the issue.SQL Query :with ex...

Data Engineering

2739 Views
6 replies
1 kudos

11-15-2021 9:19:33 AM

View Replies

Latest Reply

thelogicplus
Contributor

01-14-2025 11:35:54 PM

1 kudos

Hi @SQL @jose_gonzalez , Have you tried code conversion tool fromTravinto technologies ? They have hive to delta table conversion

1 kudos

01-14-2025 11:35:54 PM

5 More Replies

by boskicl • New Contributor III

03-23-2022 11:04:23 AM

30024 Views
6 replies
10 kudos

Resolved! Table write command stuck "Filtering files for query."

Hello all,Background:I am having an issue today with databricks using pyspark-sql and writing a delta table. The dataframe is made by doing an inner join between two tables and that is the table which I am trying to write to a delta table. The table ...

Data Engineering

30024 Views
6 replies
10 kudos

03-23-2022 11:04:23 AM

View Replies

Latest Reply

timo199
New Contributor II

12-29-2024 11:22:43 PM

10 kudos

Even if I vacuum and optimize, it keeps getting stuck.cluster type is r6gd.xlarge min:4, max:6driver type is r6gd.2xlarge

10 kudos

12-29-2024 11:22:43 PM

5 More Replies

by bobbysidhartha • New Contributor

01-13-2023 4:26:15 AM

16723 Views
2 replies
0 kudos

How to parallelly merge data into partitions of databricks delta table using PySpark/Spark streaming?

I have a PySpark streaming pipeline which reads data from a Kafka topic, data undergoes thru various transformations and finally gets merged into a databricks delta table. In the beginning we were loading data into the delta table by using the merge ...

Data Engineering

16723 Views
2 replies
0 kudos

01-13-2023 4:26:15 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-10-2023 7:41:06 AM

0 kudos

@bobbysidhartha :When merging data into a partitioned Delta table in parallel, it is important to ensure that each job only accesses and modifies the files in its own partition to avoid concurrency issues. One way to achieve this is to use partition...

0 kudos

04-10-2023 7:41:06 AM

1 More Replies

by tototox • New Contributor III

05-11-2023 7:08:57 AM

14298 Views
4 replies
2 kudos

how to check table size by partition?

I want to check the size of the delta table by partition.As you can see, only the size of the table can be checked, but not by partition.

Data Engineering

14298 Views
4 replies
2 kudos

05-11-2023 7:08:57 AM

View Replies

Latest Reply

Carsten_Herbe
New Contributor II

12-17-2024 4:16:39 AM

2 kudos

The previous two answers did not work for me (DBX 15.4).I found a hacky way using the delta log: find latest (group of) checkpoint (parquet) file(s) in delta log and use it as source prefix `000000000000xxxxxxx.checkpoint`:SELECT partition_column_1,...

2 kudos

12-17-2024 4:16:39 AM

3 More Replies

by MRTN • New Contributor III

05-16-2023 6:10:35 AM

12446 Views
5 replies
3 kudos

Resolved! Feature request delta tables : drop duplicate rows

A deltaTable.dropDuplicates(columns) would be a very nice feature, simplifying the complex procedures that are suggested online. Or am I missing any existing procedures that can be done withouth merge operations or similar?

Data Engineering

12446 Views
5 replies
3 kudos

05-16-2023 6:10:35 AM

View Replies

Latest Reply

MRTN
New Contributor III

05-16-2023 2:43:21 PM

3 kudos

I created a feature request in the delta table project: [Feature Request] data deduplication on existing delta table · Issue #1767 · delta-io/delta (github.com)

3 kudos

05-16-2023 2:43:21 PM

4 More Replies

by Jana • New Contributor III

02-15-2022 9:26:54 AM

8745 Views
9 replies
4 kudos

Resolved! Parsing 5 GB json file is running long on cluster

I was creating delta table from ADLS json input file. but the job was running long while creating delta table from json. Below is my cluster configuration. Is the issue related to cluster config ? Do I need to upgrade the cluster config ?The cluster ...

Data Engineering

8745 Views
9 replies
4 kudos

02-15-2022 9:26:54 AM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

03-01-2022 12:48:29 AM

4 kudos

with multiline = true, the json is read as a whole and processed as such.I'd try with a beefier cluster.

4 kudos

03-01-2022 12:48:29 AM

8 More Replies

by RKNutalapati • Valued Contributor

10-16-2021 2:16:34 AM

4345 Views
5 replies
4 kudos

Read and saving Blob data from oracle to databricks S3 is slow

I am trying to import a table from oracle which has around 1.3 mill rows and one of the column is a Blob, the total size of data on oracle is around 250+ GB. read and save to S3 as delta table is taking around 60 min. I tried with parallel(200 thread...

Data Engineering

4345 Views
5 replies
4 kudos

10-16-2021 2:16:34 AM

View Replies

Latest Reply

vinita_mehta
New Contributor II

10-16-2024 7:21:05 AM

4 kudos

Any update on this topic what should be the best option to read from oracle and write in ADLS.

4 kudos

10-16-2024 7:21:05 AM

4 More Replies

by elgeo • Valued Contributor II

10-31-2022 5:46:17 AM

4214 Views
6 replies
8 kudos

Clean up _delta_log files

Hello experts. We are trying to clarify how to clean up the large amount of files that are being accumulated in the _delta_log folder (json, crc and checkpoint files). We went through the related posts in the forum and followed the below:SET spark.da...

Data Engineering

4214 Views
6 replies
8 kudos

10-31-2022 5:46:17 AM

View Replies

Latest Reply

Brad
Contributor II

10-13-2024 4:39:17 PM

8 kudos

Awesome, thanks for response.

8 kudos

10-13-2024 4:39:17 PM

5 More Replies

by DJey • New Contributor III

06-02-2023 6:52:05 AM

15279 Views
6 replies
2 kudos

Resolved! MergeSchema Not Working

Hi All, I have a scenario where my Exisiting Delta Table looks like below:Now I have an incremental data with an additional column i.e. owner:Dataframe Name --> scdDFBelow is the code snippet to merge Incremental Dataframe to targetTable, but the new...

Data Engineering

15279 Views
6 replies
2 kudos

06-02-2023 6:52:05 AM

View Replies

Latest Reply

Amin112
New Contributor II

09-26-2024 8:51:35 PM

2 kudos

In Databricks Runtime 15.2 and above, you can specify schema evolution in a merge statement using SQL or Delta table APIs:MERGE WITH SCHEMA EVOLUTION INTO targetUSING sourceON source.key = target.keyWHEN MATCHED THENUPDATE SET *WHEN NOT MATCHED THENI...

2 kudos

09-26-2024 8:51:35 PM

5 More Replies

by suresh1122 • New Contributor III

12-12-2022 10:14:09 PM

14922 Views
12 replies
7 kudos

dataframe takes unusually long time to save as a delta table using sql for a very small dataset with 30k rows. It takes around 2hrs. Is there a solution for this problem?

I am trying to save a dataframe after a series of data manipulations using Udf functions to a delta table. I tried using this code( df .write .format('delta') .mode('overwrite') .option('overwriteSchema', 'true') .saveAsTable('output_table'))but this...

Data Engineering

14922 Views
12 replies
7 kudos

12-12-2022 10:14:09 PM

View Replies

Latest Reply

Lakshay
Databricks Employee

08-24-2023 7:57:45 AM

7 kudos

You should also look into the sql plan if the writing phase is indeed the part that is taking time. Since spark works on lazy evaluation, there might be some other phase that might be taking time

7 kudos

08-24-2023 7:57:45 AM

11 More Replies

by YFL • New Contributor III

12-09-2021 12:45:43 PM

7330 Views
11 replies
6 kudos

Resolved! When delta is a streaming source, how can we get the consumer lag?

Hi, I want to keep track of the streaming lag from the source table, which is a delta table. I see that in query progress logs, there is some information about the last version and the last file in the version for the end offset, but this don't give ...

Data Engineering

7330 Views
11 replies
6 kudos

12-09-2021 12:45:43 PM

View Replies

Latest Reply

Anonymous
Not applicable

05-12-2022 6:44:35 AM

6 kudos

Hey @Yerachmiel Feltzman I hope all is well.Just wanted to check in if you were able to resolve your issue or do you need more help? We'd love to hear from you.Thanks!

6 kudos

05-12-2022 6:44:35 AM

10 More Replies

by ptambe • New Contributor III

12-17-2021 12:16:13 AM

5284 Views
6 replies
3 kudos

Resolved! Is Concurrent Writes from multiple databricks clusters to same delta table on S3 Supported?

Does databricks have support for writing to same Delta Table from multiple clusters concurrently. I am specifically interested to know if there is any solution for https://github.com/delta-io/delta/issues/41 implemented in databricks OR if you have a...

Data Engineering

5284 Views
6 replies
3 kudos

12-17-2021 12:16:13 AM

View Replies

Latest Reply

dennyglee
Databricks Employee

12-20-2021 8:57:33 AM

3 kudos

Please note, the issue noted above [Storage System] Support for AWS S3 (multiple clusters/drivers/JVMs) is for Delta Lake OSS. As noted in this issue as well as Issue 324, as of this writing, S3 lacks putIfAbsent transactional consistency. For Del...

3 kudos

12-20-2021 8:57:33 AM

5 More Replies

Databricks Community

Resolved! Databricks Autoloader is getting stuck and does not pass to the next batch

Resolved! How does partition pruning work on a merge into statement?

Resolved! Invalid Characters in Column Names " ,;{}()\n\t="

Presto hive table to delta table conversion

Resolved! Table write command stuck "Filtering files for query."

How to parallelly merge data into partitions of databricks delta table using PySpark/Spark streaming?

how to check table size by partition?

Resolved! Feature request delta tables : drop duplicate rows

Resolved! Parsing 5 GB json file is running long on cluster

Read and saving Blob data from oracle to databricks S3 is slow

Clean up _delta_log files

Resolved! MergeSchema Not Working

dataframe takes unusually long time to save as a delta table using sql for a very small dataset with 30k rows. It takes around 2hrs. Is there a solution for this problem?

Resolved! When delta is a streaming source, how can we get the consumer lag?

Resolved! Is Concurrent Writes from multiple databricks clusters to same delta table on S3 Supported?