cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Readying a partitioned Table in Spark Structured Streaming

Maatari
New Contributor III

Does the pre-partitioning of a Delta Table has an influence on the number of "default" Partition of a Dataframe when readying the data?

Put differently, using spark structured streaming, when readying from a delta table, is the number of Dataframe partition created derived from the partition of the delta table somehow. The analogy here would be what happen when readying from a kafka source, there is a 121 mapping between the topic partition and the DataFrame partition. 

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

Pre-partitioning of a Delta Table does not strictly determine the number of "default" DataFrame partitions when reading data with Spark Structured Streaming. Unlike Kafka, where each DataFrame partition maps one-to-one to a Kafka partition, Delta Lake tables do not enforce this same direct mapping.

Delta Table Partitioning vs. DataFrame Partitions

  • When a Delta table is partitioned (for example, by a column like "date"), this organization impacts how data is physically laid out and can improve read performance (pruning unnecessary files) but does not dictate the number of Spark DataFrame partitions produced by a read operation.

  • When you use spark.readStream.format("delta").load(...), Spark will read Delta table files in parallel and determine DataFrame partitions independently of Delta's partition columns and structure.

  • By default, the number of DataFrame partitions is managed internally by Spark, generally based on the number and size of underlying files, current Spark configuration (like spark.sql.files.maxPartitionBytes), and available cluster resourcesโ€”not the number of Delta partitions.

Kafka Source Analogy

  • With Kafka, each topic partition maps to a distinct stream in Spark Structured Streaming, resulting in a direct one-to-one mapping between Kafka partitions and DataFrame partitions.

  • With Delta, this one-to-one correspondence does not exist. Instead, Spark parallelizes file reads, balancing workload based on file distribution and cluster configuration.

Practical Considerations

  • To control the number of DataFrame partitions when reading a Delta table, use repartition() or adjust Spark file scan parameters after loading the DataFrame if specific partitioning is desired.

  • Delta partition columns drive partition pruning on queries, which helps minimize data processing, but Spark decides partition counts at runtime unless explicitly set.

Source Partition Mapping to DataFrame
Kafka One-to-one (each partition = 1 DF partition)
Delta Table No direct mapping (depends on file layout and Spark settings)
 
 

In summary, Delta Table partitioning affects query efficiency but does not set the default number of DataFrame partitions in the way Kafka source partitions do in Spark Structured Streaming.