cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

darioAnt
by New Contributor II
  • 1739 Views
  • 1 replies
  • 2 kudos

Filtering delta table by CONCAT of a partition column and a non-partition one

Hi,I know how filtering a delta table on a partition column is a very powerful time-saving approach, but what if this column appears as a CONCAT in the where-clause?I explain my case: I have a delta table with only one partition column, say called co...

  • 1739 Views
  • 1 replies
  • 2 kudos
Latest Reply
darioAnt
New Contributor II
  • 2 kudos

I did myself a test and the answer is no:with a Concat filter, spark sql does not know I am using a partition-based column, so it scan all the table.

  • 2 kudos
Stokholm
by New Contributor III
  • 14291 Views
  • 9 replies
  • 1 kudos

Pushdown of datetime filter to date partition.

Hi Everybody,I have 20 years of data, 600m rows.I have partitioned them on year and month to generated a files size which seems reasonable.(128Mb)All data is queried using timestamp, as all queries needs to filter on the exact hours.So my requirement...

  • 14291 Views
  • 9 replies
  • 1 kudos
Latest Reply
Stokholm
New Contributor III
  • 1 kudos

Hi Guys, thanks for your advices. I found a solution. We upgrade the Databricks Runtime to 12.2 and now the pushdown of the partitionfilter works. The documentation said that 10.4 would be adequate, but obviously it wasn't enough.

  • 1 kudos
8 More Replies
andrew0117
by Contributor
  • 5009 Views
  • 4 replies
  • 0 kudos

Resolved! partition on a csv file

When I use SQL code like "create table myTable (column1 string, column2 string) using csv options('delimiter' = ',', 'header' = 'true') location 'pathToCsv'" to create a table from a single CSV file stored in a folder within an Azure Data Lake contai...

  • 5009 Views
  • 4 replies
  • 0 kudos
Latest Reply
pvignesh92
Honored Contributor
  • 0 kudos

Hi @andrew li​, When you specify a path with LOCATION keyword, Spark will consider that to be an EXTERNAL table. So when you dropped the table, you underlying data if any will not be cleared. So in you case, as this is an external table, you folder s...

  • 0 kudos
3 More Replies
jerry-xu-sa
by New Contributor II
  • 2585 Views
  • 2 replies
  • 1 kudos

Order of a dataframe is not perserved after calling cache() and limit()

Here are the simple steps to reproduce it. Note that col "foo" and "bar" are just redundant cols to make sure the dataframe doesn't fit into a single partition. // generate a random df val rand = new scala.util.Random val df = (1 to 3000).map(i => (r...

  • 2585 Views
  • 2 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hi @Jerry Xu​ Thank you for your question! To assist you better, please take a moment to review the answer and let me know if it best fits your needs.Please help us select the best solution by clicking on "Select As Best" if it does.Your feedback wil...

  • 1 kudos
1 More Replies
ghofigjong
by New Contributor
  • 6353 Views
  • 2 replies
  • 1 kudos

Resolved! How does partition pruning work on a merge into statement?

I have a delta table that is partitioned by Year, Date and month. I'm trying to merge data to this on all three partition columns + an extra column (an ID). My merge statement is below:MERGE INTO delta.<path of delta table> oldData using df newData ...

  • 6353 Views
  • 2 replies
  • 1 kudos
Latest Reply
Umesh_S
New Contributor II
  • 1 kudos

Isn't the suggested idea only filtering the input dataframe (resulting in a smaller amount of data to match across the whole delta table) rather than prune the delta table for relevant partitions to scan?

  • 1 kudos
1 More Replies
pasiasty2077
by New Contributor
  • 7671 Views
  • 1 replies
  • 1 kudos

Partition filter is skipped when table is used in where condition, why?

Hi,maybe someone can help me i do want to run very narrow query SELECT * FROM my_table WHERE snapshot_date IN('2023-01-06', '2023-01-07')   -- part of the physical plan: -- Location: PreparedDeltaFileIndex [dbfs:/...] -- PartitionFilters: [cast(snaps...

  • 7671 Views
  • 1 replies
  • 1 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 1 kudos

No hints on partition pruning afaik.The reason the partitions were not pruned is because the second query generates a completely different plan.To be able to filter the partitions, a join first has to happen. And in this case it means the table has...

  • 1 kudos
Mohit_Kumar_Sut
by New Contributor III
  • 4858 Views
  • 5 replies
  • 1 kudos

Write in Single CSV file

We are reading 520GB partitions files from CSV and when we write in a Single CSV using repartition(1) it is taking 25+ hours. please let us know an optimized way to create a single CSV file so that our process could complete within 5 hours.

  • 4858 Views
  • 5 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

Hi @mohit kumar suthar​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you...

  • 1 kudos
4 More Replies
Vickyster
by New Contributor II
  • 1241 Views
  • 0 replies
  • 0 kudos

Column partitioning is not working in delta live table when `columnMapping` table property is enabled.

I'm trying to create delta live table on top of json files placed in azure blob. The json files contains white spaces in column names instead of renaming I tried `columnMapping` table property which let me create the table with spaces but the column ...

  • 1241 Views
  • 0 replies
  • 0 kudos
andrej
by New Contributor II
  • 2302 Views
  • 4 replies
  • 1 kudos

Partition pruning with generated columns

I have a large table which contains a date_time column.The table contains 2 generated columns year, and month which are extracted from the date_time values and are used for partitioning.I have the following question.If I run the querySELECT *FROM tab...

  • 2302 Views
  • 4 replies
  • 1 kudos
Latest Reply
Vidula
Honored Contributor
  • 1 kudos

Hi @Andrej Znidarsic​ Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. We'd love to hear from you.T...

  • 1 kudos
3 More Replies
isaac_gritz
by Databricks Employee
  • 7396 Views
  • 4 replies
  • 2 kudos

Performance Tuning Best Practices

Recommendations for performance tuning best practices on DatabricksWe recommend also checking out this article from my colleague @Franco Patano​ on best practices for performance tuning on Databricks.​Performance tuning your workloads is an important...

Performance Tuning Framework.png
  • 7396 Views
  • 4 replies
  • 2 kudos
Latest Reply
isaac_gritz
Databricks Employee
  • 2 kudos

Let us know in the comments if you have any other performance tuning tips & tricks

  • 2 kudos
3 More Replies
Maverick1
by Valued Contributor II
  • 5136 Views
  • 3 replies
  • 6 kudos

Is there any way to overwrite a partition in delta table without specifying each and every partition in replace where? For non dated partitions, this is really a mess with delta tables.

Is there any way to overwrite a partition in delta table without specifying each and every partition in replace where. For non dated partitions, this is really a mess with delta tables.Most of my DE teams don't want to adopt delta because of these gl...

  • 5136 Views
  • 3 replies
  • 6 kudos
Latest Reply
Anonymous
Not applicable
  • 6 kudos

Hi @Saurabh Verma​ following up did you get a chance to check @Hubert Dudek​ previous comments ?

  • 6 kudos
2 More Replies
HarshaK
by New Contributor III
  • 12597 Views
  • 4 replies
  • 6 kudos

Resolved! Partition By () on Delta Files

Hi All,I am trying to Partition By () on Delta file in pyspark language and using command:df.write.format("delta").mode("overwrite").option("overwriteSchema","true").partitionBy("Partition Column").save("Partition file path") -- It doesnt seems to w...

  • 12597 Views
  • 4 replies
  • 6 kudos
Latest Reply
Anonymous
Not applicable
  • 6 kudos

Hey @Harsha kriplani​ Hope you are well. Thank you for posting in here. It is awesome that you found a solution. Would you like to mark Hubert's answer as best?  It would be really helpful for the other members too.Cheers!

  • 6 kudos
3 More Replies
Erik
by Valued Contributor III
  • 3423 Views
  • 4 replies
  • 2 kudos

Resolved! Does Z-ordering speed up reading of a single file?

Situation: we have one partion per date, and it just so happens that each partition ends up (after optimize) as *a single* 128mb file. We partition on date, and zorder on userid, and our query is something like "find max value of column A where useri...

  • 3423 Views
  • 4 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

Z-Order will make sure that in case you need to read multiple files, these files are co-located.For a single file this does not matter as a single file is always local to itself.If you are certain that your spark program will only read a single file,...

  • 2 kudos
3 More Replies
User16826992666
by Valued Contributor
  • 3829 Views
  • 1 replies
  • 0 kudos

How do I choose which column to partition by?

I am in the process of building my data pipeline, but I am unsure of how to choose which fields in my data I should use for partitioning. What should I be considering when choosing a partitioning strategy?

  • 3829 Views
  • 1 replies
  • 0 kudos
Latest Reply
brickster_2018
Databricks Employee
  • 0 kudos

The important factors deciding partition columns are:Even distribution of data. Choose the column that is commonly or widely accessed or queried. Do not create multiple levels of partition, as you can end up with a large number of small files.

  • 0 kudos
Labels