Data Engineering

Forum Posts

Sorted by:

by vr • Contributor

11-26-2022 4:26:24 PM

11527 Views
11 replies
9 kudos

Why is execution too fast?

I have a table, full scan of which takes ~20 minutes on my cluster. The table has "Time" TIMESTAMP column and "day" DATE column. The latter is computed (manually) as "Time" truncated to day and used for partitioning.I query the table using predicate ...

Data Engineering

11527 Views
11 replies
9 kudos

11-26-2022 4:26:24 PM

View Replies

Latest Reply

UmaMahesh1
Honored Contributor III

11-27-2022 6:40:45 AM

9 kudos

Hi @Vladimir Ryabtsev ,Because you are creating a delta table, I think that you are seeing a performance improvement because of Dynamic Partition pruning, According to the documentation, "Partition pruning can take place at query compilation time wh...

9 kudos

11-27-2022 6:40:45 AM

10 More Replies

by Rahul_Samant • Contributor

03-14-2022 3:55:28 AM

12641 Views
4 replies
4 kudos

Resolved! Bucketing on Delta Tables

getting error as below while creating buckets on delta table.Error in SQL statement: AnalysisException: Delta bucketed tables are not supported.have fall back to parquet table due to this for some use cases. is their any alternative for this. i have...

Data Engineering

12641 Views
4 replies
4 kudos

03-14-2022 3:55:28 AM

View Replies

Latest Reply

Anonymous
Not applicable

05-10-2022 5:57:58 AM

4 kudos

Hi @Rahul Samant , we checked internally on this due to certain limitations bucketing is not supported on delta tables, the only alternative for bucketing is to leverage the z ordering, below is the link for reference https://docs.databricks.com/de...

4 kudos

05-10-2022 5:57:58 AM

3 More Replies

by Erik • Valued Contributor III

10-15-2021 3:19:04 AM

6058 Views
6 replies
7 kudos

Databricks query performance when filtering on a column correlated to the partition-column

(This is a copy of a question I asked on stackoverflow here, but maybe this community is a better fit for the question):Setting: Delta-lake, Databricks SQL compute used by powerbi. I am wondering about the following scenario: We have a column `timest...

Data Engineering

6058 Views
6 replies
7 kudos

10-15-2021 3:19:04 AM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

10-15-2021 6:54:39 AM

7 kudos

In query I would just query first by date (generated from timestamp which we want to query) and than by exact timestamp, so it will use partitioning benefit.

7 kudos

10-15-2021 6:54:39 AM

5 More Replies

by soundari • New Contributor

10-06-2021 2:29:07 AM

2561 Views
1 replies
1 kudos

Resolved! Identify the partitionValues written yesterday from delta

We have a streaming data written into delta. We will not write all the partitions every day. Hence i am thinking of running compact spark job, to run only on partitions that has been modified yesterday. Is it possible to query the partitionsValues wr...

Data Engineering

2561 Views
1 replies
1 kudos

10-06-2021 2:29:07 AM

View Replies

Latest Reply

Deepak_Bhutada
Contributor III

10-11-2021 9:51:12 AM

1 kudos

Hi @Gnanasoundari Soundarajan Based on the details you provided, you are not overwriting all the partitions every day which means you might be using append mode while writing the data on day 1. On day 2, you want to access those partition values and...

1 kudos

10-11-2021 9:51:12 AM

by narek_margaryan • New Contributor II

10-06-2021 12:51:06 PM

3014 Views
1 replies
3 kudos

Resolved! Do Spark nodes read data from storage in a sequence?

I'm new to Spark and trying to understand how some of its components work.I understand that once the data is loaded into the memory of separate nodes, they process partitions in parallel, within their own memory (RAM).But I'm wondering whether the in...

Data Engineering

3014 Views
1 replies
3 kudos

10-06-2021 12:51:06 PM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

10-08-2021 12:11:36 AM

3 kudos

@Narek Margaryan , Normally the reading is done in parallel because the underlying file system is already distributed (if you use HDFS-based storage or something like, a data lake f.e.).The number of partitions in the file itself also matters.This l...

3 kudos

10-08-2021 12:11:36 AM

by irfanaziz • Contributor II

08-11-2021 12:59:10 AM

2058 Views
2 replies
3 kudos

Does anyone know why the optimize does not complete?

I feel there is some issue with a few partitions of the delta file. The optimize runs fine and completes within few minutes for other partitions but for this particular partition the optimize keeps running forever. OPTIMIZE delta.`/mnt/prod-abc/Ini...

Data Engineering

2058 Views
2 replies
3 kudos

08-11-2021 12:59:10 AM

View Replies

Latest Reply

Anonymous
Not applicable

09-09-2021 10:00:50 AM

3 kudos

@nafri A - Thank you for letting us know.

3 kudos

09-09-2021 10:00:50 AM

1 More Replies

by User16790091296 • Contributor II

05-28-2021 12:22:29 PM

4005 Views
1 replies
0 kudos

What's the difference between Z-Ordering and Partitioning?

Data Engineering

4005 Views
1 replies
0 kudos

05-28-2021 12:22:29 PM

View Replies

Latest Reply

sajith_appukutt
Honored Contributor II

06-24-2021 3:02:47 PM

0 kudos

Partitioning is a way of distributing the data by keys so that you can restrict the amount of data scanned by each query and improve performance / avoid conflicts General rules of thumb for choosing the right partition columns Cardinality of a colu...

0 kudos

06-24-2021 3:02:47 PM

by brickster_2018 • Databricks Employee

06-22-2021 4:16:50 PM

2371 Views
1 replies
0 kudos

Resolved! Z-order or Partitioning? Which is better for Data skipping?

For Delta tables, among Z-order and Partioning which is recommended technique for efficient Data Skipping

Data Engineering

2371 Views
1 replies
0 kudos

06-22-2021 4:16:50 PM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-22-2021 4:19:13 PM

0 kudos

Partition pruning is the most efficient way to ensure Data skipping. However, choosing the right column for partitioning is very important. It's common to see choosing the wrong column for partitioning can cause a large number of small file problems ...

0 kudos

06-22-2021 4:19:13 PM

by xxMathieuxxZara • New Contributor

07-22-2015 1:15:47 PM

6906 Views
6 replies
0 kudos

Parquet file merging or other optimisation tips

Hi, I need some guide lines for a performance issue with Parquet files : I am loading a set of parquet files using : df = sqlContext.parquetFile( folder_path ) My parquet folder has 6 sub division keys It was initially ok with a first sample of data...

Data Engineering

6906 Views
6 replies
0 kudos

07-22-2015 1:15:47 PM

View Replies

Latest Reply

User16301467532
New Contributor II

07-24-2015 10:28:19 AM

0 kudos

Having a large # of small files or folders can significantly deteriorate the performance of loading the data. The best way is to keep the folders/files merged so that each file is around 64MB size. There are different ways to achieve this: your writ...

0 kudos

07-24-2015 10:28:19 AM

5 More Replies

Databricks Community

Why is execution too fast?

Resolved! Bucketing on Delta Tables

Databricks query performance when filtering on a column correlated to the partition-column

Resolved! Identify the partitionValues written yesterday from delta

Resolved! Do Spark nodes read data from storage in a sequence?

Does anyone know why the optimize does not complete?

What's the difference between Z-Ordering and Partitioning?

Resolved! Z-order or Partitioning? Which is better for Data skipping?

Parquet file merging or other optimisation tips