cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Soma
by Valued Contributor
  • 3573 Views
  • 6 replies
  • 3 kudos

Resolved! Dynamically supplying partitions to autoloader

We are having a streaming use case and we see a lot of time in listing from azure.Is it possible to supply partition to autoloader dynamically on the fly

  • 3573 Views
  • 6 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

@somanath Sankaran​ - Thank you for posting your solution. Would you be happy to mark your answer as best so that other members may find it more quickly?

  • 3 kudos
5 More Replies
SamCallister
by New Contributor II
  • 18173 Views
  • 8 replies
  • 3 kudos

Dynamic Partition Overwrite for Delta Tables

Spark supports dynamic partition overwrite for parquet tables by setting the config: spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic") before writing to a partitioned table. With delta tables is appears you need to manually specif...

  • 18173 Views
  • 8 replies
  • 3 kudos
Latest Reply
alijen
New Contributor II
  • 3 kudos

@SamCallister wrote: Spark supports dynamic partition overwrite for parquet tables by setting the config:spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")before writing to a partitioned table. With delta tables is appears you need ...

  • 3 kudos
7 More Replies
thushar
by Contributor
  • 3475 Views
  • 4 replies
  • 0 kudos

Delta file partitions

Have one function to create files with partitions, in that the partitions are created based on metadata (getPartitionColumns) that we are keeping. In a table we have two columns that are mentioned as partition columns, say 'Team' and 'Speciality'. Wh...

  • 3475 Views
  • 4 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

Hi @Thushar R​ Hope everything is going great.Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we ...

  • 0 kudos
3 More Replies
Bartek
by Contributor
  • 6287 Views
  • 3 replies
  • 9 kudos

Resolved! Number of partitions in Spark UI Simulator experiment

I am learning how to optimize Spark applications with experiments from Spark UI Simulator. There is experiment #1​596 about data skew and in command 2 there is comment about how many partitions will be set as default:// Factor of 8 cores and greater ...

obraz
  • 6287 Views
  • 3 replies
  • 9 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 9 kudos

Hi @Bartosz Maciejewski​ Generally we arrive at the number of shuffle partitions using the following method.Input Size Data - 100 GBIdeal partition target size - 128 MBCores - 8Ideal number of partitions = (100*1028)/128 = 803.25 ~ 804To utiltize the...

  • 9 kudos
2 More Replies
pshah83
by New Contributor II
  • 1843 Views
  • 0 replies
  • 2 kudos

Use output of SHOW PARTITION commands in Sub-Query/CTE/Function

I am using SHOW PARTITIONS <<table_name>> to get all the partitions of a table. I want to use max() on the output of this command to get the latest partition for the table.However, I am not able to use SHOW PARTITIONS <<table_name>> in a CTE/sub-quer...

  • 1843 Views
  • 0 replies
  • 2 kudos
Personal1
by New Contributor II
  • 3864 Views
  • 2 replies
  • 2 kudos

Resolved! Understanding Partitions in Spark Local Mode

I have few fundamental questions in Spark3 while running a simple Spark app in my local mac machine (with 6 cores in total). Please help.local[*] runs my Spark application in local mode with all the cores present on my mac, correct? It also means tha...

  • 3864 Views
  • 2 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

That is a lot of questions in one topic.Let's give it a try:[1] this all depends on the values of the concerning parameters and the program you run(think joins, unions, repartition etc)[2] spark.default.parallelism is by default the number of cores *...

  • 2 kudos
1 More Replies
User16826992666
by Valued Contributor
  • 1703 Views
  • 1 replies
  • 0 kudos

Can I move some partitions of a Delta table to a different location?

I am partitioning my Delta table by date. Older data is rarely accessed, so I am wondering if I can move some of the files off to colder storage options. What would happen if I did this? Is this a supported pattern or would it break the table?

  • 1703 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

You could look at S3 Intelligent-Tiering - https://aws.amazon.com/about-aws/whats-new/2018/11/s3-intelligent-tiering/

  • 0 kudos
aladda
by Databricks Employee
  • 2422 Views
  • 1 replies
  • 0 kudos

Resolved! I read that Delta supports concurrent writes to separate partitions of the table but I'm getting an error when doing so

I’m running 3 separate dbt processes in parallel. all of them are reading data from different databrick databases, creating different staging tables by using dbt alias, but they all at the end update/insert to the same target table. the 3 processes r...

  • 2422 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

You’re likely running into the issue described here and a solution to it as well. While Delta does support concurrent writers to separate partitions of a table, depending on your query structure join/filter/where in particular, there may still be a n...

  • 0 kudos
aladda
by Databricks Employee
  • 1752 Views
  • 1 replies
  • 0 kudos
  • 1752 Views
  • 1 replies
  • 0 kudos
Latest Reply
aladda
Databricks Employee
  • 0 kudos

Coalesce essentially groups multiple partitions into a larger partitions. So use coalesce when you want to reduce the number of partitions (and also tasks) without impacting sort order.  Ex:- when you want to write-out a single CSV file output instea...

  • 0 kudos
User16826992666
by Valued Contributor
  • 1311 Views
  • 1 replies
  • 0 kudos

Resolved! I know my partitions are skewed, is there anything I can do to help my performance?

I know the skew in my dataset has the potential to cause issues with my job performance, so just wondering if there is anything I can do to help my performance other than repartitioning the whole dataset.

  • 1311 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

For scenarios like this, it is recommend to use a cluster with Databricks Runtime 7.3 LTS or above where AQE is enabled. AQE dynamically handles skew in sort merge join and shuffle hash join by splitting (and replicating if needed) skewed tasks into ...

  • 0 kudos
Anonymous
by Not applicable
  • 10689 Views
  • 1 replies
  • 0 kudos

Resolved! Ideal number and size of partitions

Spark by default uses 200 partitions when doing transformations. The 200 partitions might be too large if a user is working with small data, hence it can slow down the query. Conversely, the 200 partitions might be too small if the data is big. So ho...

  • 10689 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

You could tweak the default value 200 by changing spark.sql.shuffle.partitions configuration to match your data volume. Here is a sample python code for calculating the valueHowever if you have multiple workloads with different data volumes, instead ...

  • 0 kudos
1stcommander
by New Contributor II
  • 8514 Views
  • 2 replies
  • 0 kudos

Parquet partitionBy - date column to nested folders

Hi, when writing a DataFrame to parquet using partitionBy(<date column>), the resulting folder structure looks like this: root |----------------- day1 |----------------- day2 |----------------- day3 Is it possible to create a structure like to foll...

  • 8514 Views
  • 2 replies
  • 0 kudos
Latest Reply
Saphira
New Contributor II
  • 0 kudos

Hey @1stcommander​ You'll have to create those columns yourself. If it's something you will have to do often you could always write a function. In any case, imho it's not that much work. Im not sure what your problem is with the partition pruning. It...

  • 0 kudos
1 More Replies
Labels