Data Engineering

Forum Posts

Sorted by:

by ghofigjong • New Contributor

02-27-2023 12:29:55 AM

7952 Views
4 replies
2 kudos

Resolved! How does partition pruning work on a merge into statement?

I have a delta table that is partitioned by Year, Date and month. I'm trying to merge data to this on all three partition columns + an extra column (an ID). My merge statement is below:MERGE INTO delta.<path of delta table> oldData using df newData ...

Data Engineering

7952 Views
4 replies
2 kudos

02-27-2023 12:29:55 AM

View Replies

Latest Reply

Umesh_S
New Contributor II

03-30-2023 1:24:57 PM

2 kudos

Isn't the suggested idea only filtering the input dataframe (resulting in a smaller amount of data to match across the whole delta table) rather than prune the delta table for relevant partitions to scan?

2 kudos

03-30-2023 1:24:57 PM

3 More Replies

by darioAnt • New Contributor II

05-31-2023 1:37:33 AM

1958 Views
1 replies
2 kudos

Filtering delta table by CONCAT of a partition column and a non-partition one

Hi,I know how filtering a delta table on a partition column is a very powerful time-saving approach, but what if this column appears as a CONCAT in the where-clause?I explain my case: I have a delta table with only one partition column, say called co...

Data Engineering

1958 Views
1 replies
2 kudos

05-31-2023 1:37:33 AM

View Replies

Latest Reply

darioAnt
New Contributor II

05-31-2023 6:21:20 AM

2 kudos

I did myself a test and the answer is no:with a Concat filter, spark sql does not know I am using a partition-based column, so it scan all the table.

2 kudos

05-31-2023 6:21:20 AM

by Stokholm • New Contributor III

03-28-2023 1:45:14 AM

15420 Views
9 replies
1 kudos

Pushdown of datetime filter to date partition.

Hi Everybody,I have 20 years of data, 600m rows.I have partitioned them on year and month to generated a files size which seems reasonable.(128Mb)All data is queried using timestamp, as all queries needs to filter on the exact hours.So my requirement...

Data Engineering

15420 Views
9 replies
1 kudos

03-28-2023 1:45:14 AM

View Replies

Latest Reply

Stokholm
New Contributor III

04-24-2023 6:55:54 AM

1 kudos

Hi Guys, thanks for your advices. I found a solution. We upgrade the Databricks Runtime to 12.2 and now the pushdown of the partitionfilter works. The documentation said that 10.4 would be adequate, but obviously it wasn't enough.

1 kudos

04-24-2023 6:55:54 AM

8 More Replies

by andrew0117 • Contributor

04-16-2023 8:39:11 PM

5770 Views
4 replies
0 kudos

Resolved! partition on a csv file

When I use SQL code like "create table myTable (column1 string, column2 string) using csv options('delimiter' = ',', 'header' = 'true') location 'pathToCsv'" to create a table from a single CSV file stored in a folder within an Azure Data Lake contai...

Data Engineering

5770 Views
4 replies
0 kudos

04-16-2023 8:39:11 PM

View Replies

Latest Reply

pvignesh92
Honored Contributor

04-17-2023 4:40:46 AM

0 kudos

Hi @andrew li, When you specify a path with LOCATION keyword, Spark will consider that to be an EXTERNAL table. So when you dropped the table, you underlying data if any will not be cleared. So in you case, as this is an external table, you folder s...

0 kudos

04-17-2023 4:40:46 AM

3 More Replies

by jerry-xu-sa • New Contributor II

03-06-2023 11:45:02 PM

2951 Views
2 replies
1 kudos

Order of a dataframe is not perserved after calling cache() and limit()

Here are the simple steps to reproduce it. Note that col "foo" and "bar" are just redundant cols to make sure the dataframe doesn't fit into a single partition. // generate a random df val rand = new scala.util.Random val df = (1 to 3000).map(i => (r...

Data Engineering

2951 Views
2 replies
1 kudos

03-06-2023 11:45:02 PM

View Replies

Latest Reply

Anonymous
Not applicable

03-31-2023 5:58:05 PM

1 kudos

Hi @Jerry Xu Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.Please help us select the best solution by clicking on "Select As Best" if it does.Your feedback wil...

1 kudos

03-31-2023 5:58:05 PM

1 More Replies

by youssefmrini • Databricks Employee

03-03-2023 5:27:03 AM

1141 Views
1 replies
2 kudos

Resolved! Is there a way to reprocess data for a certain time using DLT ?

Data Engineering

1141 Views
1 replies
2 kudos

03-03-2023 5:27:03 AM

View Replies

Latest Reply

youssefmrini
Databricks Employee

03-03-2023 5:27:31 AM

2 kudos

You will have to do a full refresh.

2 kudos

03-03-2023 5:27:31 AM

by pasiasty2077 • New Contributor

01-12-2023 1:24:47 AM

7959 Views
1 replies
1 kudos

Partition filter is skipped when table is used in where condition, why?

Hi,maybe someone can help me i do want to run very narrow query SELECT * FROM my_table WHERE snapshot_date IN('2023-01-06', '2023-01-07') -- part of the physical plan: -- Location: PreparedDeltaFileIndex [dbfs:/...] -- PartitionFilters: [cast(snaps...

Data Engineering

7959 Views
1 replies
1 kudos

01-12-2023 1:24:47 AM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

01-12-2023 2:19:33 AM

1 kudos

No hints on partition pruning afaik.The reason the partitions were not pruned is because the second query generates a completely different plan.To be able to filter the partitions, a join first has to happen. And in this case it means the table has...

1 kudos

01-12-2023 2:19:33 AM

by Mohit_Kumar_Sut • New Contributor III

10-03-2022 9:54:59 AM

5552 Views
5 replies
1 kudos

Write in Single CSV file

We are reading 520GB partitions files from CSV and when we write in a Single CSV using repartition(1) it is taking 25+ hours. please let us know an optimized way to create a single CSV file so that our process could complete within 5 hours.

Data Engineering

5552 Views
5 replies
1 kudos

10-03-2022 9:54:59 AM

View Replies

Latest Reply

Anonymous
Not applicable

11-11-2022 10:51:02 PM

1 kudos

Hi @mohit kumar suthar Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you...

1 kudos

11-11-2022 10:51:02 PM

4 More Replies

by Vickyster • New Contributor II

09-07-2022 8:33:36 PM

1435 Views
0 replies
0 kudos

Column partitioning is not working in delta live table when `columnMapping` table property is enabled.

I'm trying to create delta live table on top of json files placed in azure blob. The json files contains white spaces in column names instead of renaming I tried `columnMapping` table property which let me create the table with spaces but the column ...

Data Engineering

1435 Views
0 replies
0 kudos

09-07-2022 8:33:36 PM

by andrej • New Contributor II

07-14-2022 7:28:02 AM

2655 Views
4 replies
1 kudos

Partition pruning with generated columns

I have a large table which contains a date_time column.The table contains 2 generated columns year, and month which are extracted from the date_time values and are used for partitioning.I have the following question.If I run the querySELECT *FROM tab...

Data Engineering

2655 Views
4 replies
1 kudos

07-14-2022 7:28:02 AM

View Replies

Latest Reply

Vidula
Honored Contributor

09-04-2022 7:04:54 AM

1 kudos

Hi @Andrej Znidarsic Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.T...

1 kudos

09-04-2022 7:04:54 AM

3 More Replies

by isaac_gritz • Databricks Employee

08-22-2022 11:54:29 PM

9002 Views
4 replies
3 kudos

Performance Tuning Best Practices

Recommendations for performance tuning best practices on DatabricksWe recommend also checking out this article from my colleague @Franco Patano on best practices for performance tuning on Databricks.Performance tuning your workloads is an important...

Data Engineering

9002 Views
4 replies
3 kudos

08-22-2022 11:54:29 PM

View Replies

Latest Reply

isaac_gritz
Databricks Employee

08-22-2022 11:55:26 PM

3 kudos

Let us know in the comments if you have any other performance tuning tips & tricks

3 kudos

08-22-2022 11:55:26 PM

3 More Replies

by Maverick1 • Valued Contributor II

05-20-2022 3:37:02 AM

6451 Views
3 replies
6 kudos

Is there any way to overwrite a partition in delta table without specifying each and every partition in replace where? For non dated partitions, this is really a mess with delta tables.

Is there any way to overwrite a partition in delta table without specifying each and every partition in replace where. For non dated partitions, this is really a mess with delta tables.Most of my DE teams don't want to adopt delta because of these gl...

Data Engineering

6451 Views
3 replies
6 kudos

05-20-2022 3:37:02 AM

View Replies

Latest Reply

Anonymous
Not applicable

06-06-2022 5:57:43 AM

6 kudos

Hi @Saurabh Verma following up did you get a chance to check @Hubert Dudek previous comments ?

6 kudos

06-06-2022 5:57:43 AM

2 More Replies

by HarshaK • New Contributor III

04-07-2022 1:43:53 AM

14791 Views
4 replies
6 kudos

Resolved! Partition By () on Delta Files

Hi All,I am trying to Partition By () on Delta file in pyspark language and using command:df.write.format("delta").mode("overwrite").option("overwriteSchema","true").partitionBy("Partition Column").save("Partition file path") -- It doesnt seems to w...

Data Engineering

14791 Views
4 replies
6 kudos

04-07-2022 1:43:53 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-26-2022 9:36:14 AM

6 kudos

Hey @Harsha kriplani Hope you are well. Thank you for posting in here. It is awesome that you found a solution. Would you like to mark Hubert's answer as best? It would be really helpful for the other members too.Cheers!

6 kudos

04-26-2022 9:36:14 AM

3 More Replies

by Erik • Valued Contributor III

11-05-2021 11:45:45 AM

3767 Views
4 replies
2 kudos

Resolved! Does Z-ordering speed up reading of a single file?

Situation: we have one partion per date, and it just so happens that each partition ends up (after optimize) as *a single* 128mb file. We partition on date, and zorder on userid, and our query is something like "find max value of column A where useri...

Data Engineering

3767 Views
4 replies
2 kudos

11-05-2021 11:45:45 AM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

11-07-2021 10:52:51 PM

2 kudos

Z-Order will make sure that in case you need to read multiple files, these files are co-located.For a single file this does not matter as a single file is always local to itself.If you are certain that your spark program will only read a single file,...

2 kudos

11-07-2021 10:52:51 PM

3 More Replies

by User16826992666 • Valued Contributor

06-24-2021 3:06:12 PM

4374 Views
1 replies
0 kudos

How do I choose which column to partition by?

I am in the process of building my data pipeline, but I am unsure of how to choose which fields in my data I should use for partitioning. What should I be considering when choosing a partitioning strategy?

Data Engineering

4374 Views
1 replies
0 kudos

06-24-2021 3:06:12 PM

View Replies

Latest Reply

brickster_2018
Databricks Employee

06-24-2021 4:22:00 PM

0 kudos

The important factors deciding partition columns are:Even distribution of data. Choose the column that is commonly or widely accessed or queried. Do not create multiple levels of partition, as you can end up with a large number of small files.

0 kudos

06-24-2021 4:22:00 PM