Data Engineering

Forum Posts

Sorted by:

by Soma • Valued Contributor

12-16-2021 6:39:39 AM

5196 Views
6 replies
3 kudos

Resolved! Dynamically supplying partitions to autoloader

We are having a streaming use case and we see a lot of time in listing from azure.Is it possible to supply partition to autoloader dynamically on the fly

Data Engineering

5196 Views
6 replies
3 kudos

12-16-2021 6:39:39 AM

View Replies

Latest Reply

Anonymous
Not applicable

01-26-2022 8:02:13 AM

3 kudos

@somanath Sankaran - Thank you for posting your solution. Would you be happy to mark your answer as best so that other members may find it more quickly?

3 kudos

01-26-2022 8:02:13 AM

5 More Replies

by SamCallister • New Contributor II

11-22-2019 1:06:32 PM

23487 Views
8 replies
5 kudos

Dynamic Partition Overwrite for Delta Tables

Spark supports dynamic partition overwrite for parquet tables by setting the config: spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic") before writing to a partitioned table. With delta tables is appears you need to manually specif...

Data Engineering

23487 Views
8 replies
5 kudos

11-22-2019 1:06:32 PM

View Replies

Latest Reply

alijen
New Contributor II

08-16-2023 2:15:27 AM

5 kudos

@SamCallister wrote: Spark supports dynamic partition overwrite for parquet tables by setting the config:spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")before writing to a partitioned table. With delta tables is appears you need ...

5 kudos

08-16-2023 2:15:27 AM

7 More Replies

by thushar • Contributor

03-08-2023 11:57:42 PM

4985 Views
4 replies
0 kudos

Delta file partitions

Have one function to create files with partitions, in that the partitions are created based on metadata (getPartitionColumns) that we are keeping. In a table we have two columns that are mentioned as partition columns, say 'Team' and 'Speciality'. Wh...

Data Engineering

4985 Views
4 replies
0 kudos

03-08-2023 11:57:42 PM

View Replies

Latest Reply

Anonymous
Not applicable

03-31-2023 5:52:51 PM

0 kudos

Hi @Thushar R Hope everything is going great.Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we ...

0 kudos

03-31-2023 5:52:51 PM

3 More Replies

by Bartek • Contributor

11-26-2022 3:11:34 PM

10549 Views
3 replies
10 kudos

Resolved! Number of partitions in Spark UI Simulator experiment

I am learning how to optimize Spark applications with experiments from Spark UI Simulator. There is experiment #1596 about data skew and in command 2 there is comment about how many partitions will be set as default:// Factor of 8 cores and greater ...

Data Engineering

10549 Views
3 replies
10 kudos

11-26-2022 3:11:34 PM

View Replies

Latest Reply

UmaMahesh1
Honored Contributor III

11-27-2022 10:20:28 AM

10 kudos

Hi @Bartosz Maciejewski Generally we arrive at the number of shuffle partitions using the following method.Input Size Data - 100 GBIdeal partition target size - 128 MBCores - 8Ideal number of partitions = (100*1028)/128 = 803.25 ~ 804To utiltize the...

10 kudos

11-27-2022 10:20:28 AM

2 More Replies

by pshah83 • New Contributor II

08-11-2022 4:20:55 PM

2635 Views
0 replies
2 kudos

Use output of SHOW PARTITION commands in Sub-Query/CTE/Function

I am using SHOW PARTITIONS <<table_name>> to get all the partitions of a table. I want to use max() on the output of this command to get the latest partition for the table.However, I am not able to use SHOW PARTITIONS <<table_name>> in a CTE/sub-quer...

Data Engineering

2635 Views
0 replies
2 kudos

08-11-2022 4:20:55 PM

by Personal1 • New Contributor II

10-13-2021 4:02:55 PM

5393 Views
2 replies
2 kudos

Resolved! Understanding Partitions in Spark Local Mode

I have few fundamental questions in Spark3 while running a simple Spark app in my local mac machine (with 6 cores in total). Please help.local[*] runs my Spark application in local mode with all the cores present on my mac, correct? It also means tha...

Data Engineering

5393 Views
2 replies
2 kudos

10-13-2021 4:02:55 PM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

10-15-2021 1:24:59 AM

2 kudos

That is a lot of questions in one topic.Let's give it a try:[1] this all depends on the values of the concerning parameters and the program you run(think joins, unions, repartition etc)[2] spark.default.parallelism is by default the number of cores *...

2 kudos

10-15-2021 1:24:59 AM

1 More Replies

by User16826992666 • Databricks Employee

06-25-2021 9:25:00 AM

2975 Views
1 replies
0 kudos

Can I move some partitions of a Delta table to a different location?

I am partitioning my Delta table by date. Older data is rarely accessed, so I am wondering if I can move some of the files off to colder storage options. What would happen if I did this? Is this a supported pattern or would it break the table?

Data Engineering

2975 Views
1 replies
0 kudos

06-25-2021 9:25:00 AM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-25-2021 4:02:00 PM

0 kudos

You could look at S3 Intelligent-Tiering - https://aws.amazon.com/about-aws/whats-new/2018/11/s3-intelligent-tiering/

0 kudos

06-25-2021 4:02:00 PM

by aladda • Databricks Employee

06-21-2021 1:09:36 PM

4055 Views
1 replies
0 kudos

Resolved! I read that Delta supports concurrent writes to separate partitions of the table but I'm getting an error when doing so

I’m running 3 separate dbt processes in parallel. all of them are reading data from different databrick databases, creating different staging tables by using dbt alias, but they all at the end update/insert to the same target table. the 3 processes r...

Data Engineering

4055 Views
1 replies
0 kudos

06-21-2021 1:09:36 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-21-2021 1:10:01 PM

0 kudos

You’re likely running into the issue described here and a solution to it as well. While Delta does support concurrent writers to separate partitions of a table, depending on your query structure join/filter/where in particular, there may still be a n...

0 kudos

06-21-2021 1:10:01 PM

by aladda • Databricks Employee

06-19-2021 8:49:44 PM

3531 Views
1 replies
0 kudos

Resolved! What is the difference between coalesce and repartition when it comes to shuffle partitions in spark

Data Engineering

3531 Views
1 replies
0 kudos

06-19-2021 8:49:44 PM

View Replies

Latest Reply

aladda
Databricks Employee

06-19-2021 8:51:39 PM

0 kudos

Coalesce essentially groups multiple partitions into a larger partitions. So use coalesce when you want to reduce the number of partitions (and also tasks) without impacting sort order. Ex:- when you want to write-out a single CSV file output instea...

0 kudos

06-19-2021 8:51:39 PM

by User16826992666 • Databricks Employee

06-16-2021 8:41:10 PM

2177 Views
1 replies
0 kudos

Resolved! I know my partitions are skewed, is there anything I can do to help my performance?

I know the skew in my dataset has the potential to cause issues with my job performance, so just wondering if there is anything I can do to help my performance other than repartitioning the whole dataset.

Data Engineering

2177 Views
1 replies
0 kudos

06-16-2021 8:41:10 PM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-17-2021 11:16:06 PM

0 kudos

For scenarios like this, it is recommend to use a cluster with Databricks Runtime 7.3 LTS or above where AQE is enabled. AQE dynamically handles skew in sort merge join and shuffle hash join by splitting (and replicating if needed) skewed tasks into ...

0 kudos

06-17-2021 11:16:06 PM

by Anonymous • Not applicable

06-08-2021 7:26:50 PM

16608 Views
1 replies
0 kudos

Resolved! Ideal number and size of partitions

Spark by default uses 200 partitions when doing transformations. The 200 partitions might be too large if a user is working with small data, hence it can slow down the query. Conversely, the 200 partitions might be too small if the data is big. So ho...

Data Engineering

16608 Views
1 replies
0 kudos

06-08-2021 7:26:50 PM

View Replies

Latest Reply

sajith_appukutt
Databricks Employee

06-09-2021 3:35:00 AM

0 kudos

You could tweak the default value 200 by changing spark.sql.shuffle.partitions configuration to match your data volume. Here is a sample python code for calculating the valueHowever if you have multiple workloads with different data volumes, instead ...

0 kudos

06-09-2021 3:35:00 AM

by 1stcommander • New Contributor II

11-11-2019 6:10:40 AM

10133 Views
2 replies
0 kudos

Parquet partitionBy - date column to nested folders

Hi, when writing a DataFrame to parquet using partitionBy(<date column>), the resulting folder structure looks like this: root |----------------- day1 |----------------- day2 |----------------- day3 Is it possible to create a structure like to foll...

Data Engineering

10133 Views
2 replies
0 kudos

11-11-2019 6:10:40 AM

View Replies

Latest Reply

Saphira
New Contributor II

11-13-2019 6:09:41 AM

0 kudos

Hey @1stcommander You'll have to create those columns yourself. If it's something you will have to do often you could always write a function. In any case, imho it's not that much work. Im not sure what your problem is with the partition pruning. It...

0 kudos

11-13-2019 6:09:41 AM

1 More Replies

Databricks Community

Resolved! Dynamically supplying partitions to autoloader

Dynamic Partition Overwrite for Delta Tables

Delta file partitions

Resolved! Number of partitions in Spark UI Simulator experiment

Use output of SHOW PARTITION commands in Sub-Query/CTE/Function

Resolved! Understanding Partitions in Spark Local Mode

Can I move some partitions of a Delta table to a different location?

Resolved! I read that Delta supports concurrent writes to separate partitions of the table but I'm getting an error when doing so

Resolved! What is the difference between coalesce and repartition when it comes to shuffle partitions in spark

Resolved! I know my partitions are skewed, is there anything I can do to help my performance?

Resolved! Ideal number and size of partitions

Parquet partitionBy - date column to nested folders