Hi @Abhishek Pradhanโ , The Apache Spark Associate Developer is applied for Data Engineer and Data Scientist learning paths. This exam will assess you in Spark architecture and in the use of Spark DataFram API to manipulate data.
What is covered by the exam? ๐
Although the exam covers data manipulation, the SQL language is not assessed. All questions related to data manipulation will be asked to solve using Spark DataFrame API. Spark Streaming is another topic that the exam doesnโt cover.
The exam questions are distributed into three categories:
Spark Architecture โ Conceptual
- Cluster architecture: nodes, drivers, workers, executors, slots, etc.
- Spark execution hierarchy: applications, jobs, stages, tasks, etc.
- Shuffling
- Partitioning
- Lazy evaluation
- Transformations vs Actions
- Narrow vs Wide transformations
Spark Architecture โ Applied
- Execution deployment modes
- Stability
- Storage levels
- Repartitioning
- Coalescing
- Broadcasting
- DataFrames
Spark DataFrame API
- Subsetting DataFrames (select, filter, etc.)
- Column manipulation (casting, creating columns, manipulating existing columns, complex column types)
- String manipulation (Splitting strings, regex)
- Performance-based operations (repartitioning, shuffle partitions, caching)
- Combining DataFrames (joins, broadcasting, unions, etc)
- Reading/writing DataFrames (schemas, overwriting)
- Working with dates (extraction, formatting, etc)
- Aggregations
- Miscellaneous (sorting, missing values, typed UDFs, value extraction, sampling)
Source