cancel
Showing results for 
Search instead for 
Did you mean: 
Women in Data & AI
cancel
Showing results for 
Search instead for 
Did you mean: 

Spark concept A TO Z

Yogic24
Contributor III

hello everyone,

Spark Concepts A to Z!!

A - Adaptive Query Execution (AQE):
Optimizes query plans dynamically based on runtime statistics for improved performance and resource utilization.

B - Broadcast Join:
Sends small datasets to all nodes for local joins, reducing shuffle overhead and enhancing performance.

C - Coalesce:
Reduces partitions in DataFrames/RDDs, optimizing performance by minimizing partition management overhead.

D - DataFrame:
Distributed collection of data organized into named columns, akin to a relational table for high-level data manipulation.

E - Executor:
Distributed agent executing tasks on worker nodes, managing data storage and caching.

F - Fault Tolerance:
Spark's ability to recover from node failures using lineage information to recompute lost data partitions.

G - GroupBy:
Aggregates data based on keys, enabling operations like sum, average, and count on grouped data.

H - HiveContext:
Supports Hive commands, UDFs, and interacts with Hive metastore within Spark SQL.

I - In-Memory Computing:
Stores data in RAM for faster iterative algorithms and processing tasks.

J - Join:
Combines rows from multiple DataFrames/RDDs based on related columns for complex data transformations.

K - Kryo Serialization:
Efficient serialization library optimizing network and disk I/O performance.

L - Lineage:
Tracks transformations on RDDs/DataFrames, ensuring fault tolerance by recomputing lost data.

M - Map:
Applies a function to each element in RDDs/DataFrames, creating new transformed datasets.

N - Narrow Dependency:
Optimizes data processing by ensuring each child partition depends on a single parent partition, reducing shuffling.

O - Optimization:
Improves query performance through techniques like predicate pushdown and cost-based optimization in Spark.

P - Partition:
Logical division of data in RDDs/DataFrames enabling parallel processing across Spark nodes.

Q - Query Execution Plan:
Outlines steps Spark takes to execute queries, from logical to physical and optimized plans.

R - Resilient Distributed Dataset (RDD):
Immutable, distributed collection of objects in Spark.

S - Spark SQL:
Module for structured data processing in Spark.

T - Transformations:
Operations creating new RDDs/DataFrames from existing ones, like map, filter, and reduceByKey.

U - Union:
Merges RDDs/DataFrames into a single dataset, combining their elements.

V - Vectorized Query Execution:
Processes multiple rows together using CPU caches and SIMD instructions for enhanced query performance.

W - Wide Dependency:
Occurs when child partitions depend on multiple parent partitions, often requiring shuffling in Spark processing.

X - XML Data Source:
Allows Spark to read and write XML files, supporting parsing and querying of XML data.

Y - Yarn:
Cluster manager to schedule and execute tasks across distributed nodes.

Z - Z-Ordering:
Optimizes data storage and retrieval by ordering rows based on column values, enhancing query performance.

2 REPLIES 2

Rishabh_Tiwari
Databricks Employee
Databricks Employee

Hi @Yogic24 ,

Thank you for sharing this comprehensive overview of Spark Concepts! It is a fantastic resource for anyone looking to deepen their understanding of Spark, and I appreciate your effort in putting this together. 

Thanks
Rishabh

Yogic24
Contributor III

thanks @Rishabh_Tiwari 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group