cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

pshah83
by New Contributor II
  • 926 Views
  • 2 replies
  • 1 kudos

Use output of SHOW PARTITION commands in Sub-Query/CTE/Function

I am using SHOW PARTITIONS <<table_name>> to get all the partitions of a table. I want to use max() on the output of this command to get the latest partition for the table.However, I am not able to use SHOW PARTITIONS <<table_name>> in a CTE/sub-quer...

  • 926 Views
  • 2 replies
  • 1 kudos
Latest Reply
Kaniz
Community Manager
  • 1 kudos

Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?This...

  • 1 kudos
1 More Replies
SamCallister
by New Contributor II
  • 13882 Views
  • 8 replies
  • 3 kudos

Dynamic Partition Overwrite for Delta Tables

Spark supports dynamic partition overwrite for parquet tables by setting the config: spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic") before writing to a partitioned table. With delta tables is appears you need to manually specif...

  • 13882 Views
  • 8 replies
  • 3 kudos
Latest Reply
alijen
New Contributor II
  • 3 kudos

@SamCallister wrote: Spark supports dynamic partition overwrite for parquet tables by setting the config:spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")before writing to a partitioned table. With delta tables is appears you need ...

  • 3 kudos
7 More Replies
thushar
by Contributor
  • 1789 Views
  • 4 replies
  • 0 kudos

Delta file partitions

Have one function to create files with partitions, in that the partitions are created based on metadata (getPartitionColumns) that we are keeping. In a table we have two columns that are mentioned as partition columns, say 'Team' and 'Speciality'. Wh...

  • 1789 Views
  • 4 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Thushar R​ Hope everything is going great.Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we ...

  • 0 kudos
3 More Replies
Bartek
by Contributor
  • 2979 Views
  • 3 replies
  • 7 kudos

Resolved! Number of partitions in Spark UI Simulator experiment

I am learning how to optimize Spark applications with experiments from Spark UI Simulator. There is experiment #1​596 about data skew and in command 2 there is comment about how many partitions will be set as default:// Factor of 8 cores and greater ...

obraz
  • 2979 Views
  • 3 replies
  • 7 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 7 kudos

Hi @Bartosz Maciejewski​ Generally we arrive at the number of shuffle partitions using the following method.Input Size Data - 100 GBIdeal partition target size - 128 MBCores - 8Ideal number of partitions = (100*1028)/128 = 803.25 ~ 804To utiltize the...

  • 7 kudos
2 More Replies
Personal1
by New Contributor
  • 2321 Views
  • 3 replies
  • 2 kudos

Resolved! Understanding Partitions in Spark Local Mode

I have few fundamental questions in Spark3 while running a simple Spark app in my local mac machine (with 6 cores in total). Please help.local[*] runs my Spark application in local mode with all the cores present on my mac, correct? It also means tha...

  • 2321 Views
  • 3 replies
  • 2 kudos
Latest Reply
Kaniz
Community Manager
  • 2 kudos

Hi @Abhishek Pradhan​ , Just a friendly follow-up. Do you still need help, or @Werner Stinckens​ 's response help you to find the solution? Please let us know.

  • 2 kudos
2 More Replies
Soma
by Valued Contributor
  • 1947 Views
  • 6 replies
  • 3 kudos

Resolved! Dynamically supplying partitions to autoloader

We are having a streaming use case and we see a lot of time in listing from azure.Is it possible to supply partition to autoloader dynamically on the fly

  • 1947 Views
  • 6 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

@somanath Sankaran​ - Thank you for posting your solution. Would you be happy to mark your answer as best so that other members may find it more quickly?

  • 3 kudos
5 More Replies
User16826992666
by Valued Contributor
  • 1098 Views
  • 1 replies
  • 0 kudos

Can I move some partitions of a Delta table to a different location?

I am partitioning my Delta table by date. Older data is rarely accessed, so I am wondering if I can move some of the files off to colder storage options. What would happen if I did this? Is this a supported pattern or would it break the table?

  • 1098 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

You could look at S3 Intelligent-Tiering - https://aws.amazon.com/about-aws/whats-new/2018/11/s3-intelligent-tiering/

  • 0 kudos
aladda
by Honored Contributor II
  • 1099 Views
  • 1 replies
  • 0 kudos

Resolved! I read that Delta supports concurrent writes to separate partitions of the table but I'm getting an error when doing so

I’m running 3 separate dbt processes in parallel. all of them are reading data from different databrick databases, creating different staging tables by using dbt alias, but they all at the end update/insert to the same target table. the 3 processes r...

  • 1099 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Honored Contributor II
  • 0 kudos

You’re likely running into the issue described here and a solution to it as well. While Delta does support concurrent writers to separate partitions of a table, depending on your query structure join/filter/where in particular, there may still be a n...

  • 0 kudos
aladda
by Honored Contributor II
  • 937 Views
  • 1 replies
  • 0 kudos
  • 937 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Honored Contributor II
  • 0 kudos

Coalesce essentially groups multiple partitions into a larger partitions. So use coalesce when you want to reduce the number of partitions (and also tasks) without impacting sort order.  Ex:- when you want to write-out a single CSV file output instea...

  • 0 kudos
User16826992666
by Valued Contributor
  • 726 Views
  • 1 replies
  • 0 kudos

Resolved! I know my partitions are skewed, is there anything I can do to help my performance?

I know the skew in my dataset has the potential to cause issues with my job performance, so just wondering if there is anything I can do to help my performance other than repartitioning the whole dataset.

  • 726 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

For scenarios like this, it is recommend to use a cluster with Databricks Runtime 7.3 LTS or above where AQE is enabled. AQE dynamically handles skew in sort merge join and shuffle hash join by splitting (and replicating if needed) skewed tasks into ...

  • 0 kudos
Anonymous
by Not applicable
  • 6559 Views
  • 1 replies
  • 0 kudos

Resolved! Ideal number and size of partitions

Spark by default uses 200 partitions when doing transformations. The 200 partitions might be too large if a user is working with small data, hence it can slow down the query. Conversely, the 200 partitions might be too small if the data is big. So ho...

  • 6559 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

You could tweak the default value 200 by changing spark.sql.shuffle.partitions configuration to match your data volume. Here is a sample python code for calculating the valueHowever if you have multiple workloads with different data volumes, instead ...

  • 0 kudos
1stcommander
by New Contributor II
  • 6772 Views
  • 2 replies
  • 0 kudos

Parquet partitionBy - date column to nested folders

Hi, when writing a DataFrame to parquet using partitionBy(<date column>), the resulting folder structure looks like this: root |----------------- day1 |----------------- day2 |----------------- day3 Is it possible to create a structure like to foll...

  • 6772 Views
  • 2 replies
  • 0 kudos
Latest Reply
Saphira
New Contributor II
  • 0 kudos

Hey @1stcommander​ You'll have to create those columns yourself. If it's something you will have to do often you could always write a function. In any case, imho it's not that much work. Im not sure what your problem is with the partition pruning. It...

  • 0 kudos
1 More Replies
Labels