cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

pyspark optimizations and best practices

KVNARK
Honored Contributor II

What and all we can implement maximum to attain the best optimization and which are all the best practices using PySpark end to end.

1 ACCEPTED SOLUTION

Accepted Solutions

daniel_sahal
Honored Contributor III
2 REPLIES 2

Ajay-Pandey
Esteemed Contributor II

The most popular Spark optimization techniques are listed below:

1. Data Serialization

Here, an in-memory object is converted into another format that can be stored in a file or sent over a network. This improves the performance of distributed applications.

  1. It is the best spark optimization technique. Serialization improves any distributed application’s performance. By default, Spark uses the Java serializer over the JVM platform. Spark can also use a serializer known as Kryo rather than a Java serializer. The Kryo serializer provides better performance than the Java serializer.

2. Caching

This is an efficient technique that is used when the data is required more often. Cache() and persist() are the methods used in this technique. These methods are used for storing the computations of an RDD, DataSet, and DataFrame. But, cache() stores it in the memory, and persist() stores it in the user-defined storage level.

These methods can help in reducing costs and saving time as repeated computations are used.

Cache and Persist methods of this will store the data set into the memory when the requirement arises. They are useful when you want to store a small data set that is being used frequently in your program. RDD.Cache()would always store the data in memory. RDD.Persist() allows storage of some part of data into the memory and some part on the disk. Caching technique offers efficient optimization in spark through Persist and Cache methods.

3. Data Structure Tuning

We can reduce the memory consumption while using Spark, by tweaking certain Java features that might add overhead. This is possible in the following ways:

  • Use enumerated objects or numeric IDs in place of strings for keys.
  • Avoid using a lot of objects and complicated nested structures.
  • Set the JVM flag to xx:+UseCompressedOops if the memory size is less than 32 GB.

4. Garbage collection optimization

For optimizing garbage collectors, G1 and GC must be used for running Spark applications. The G1 collector manages growing heaps. GC tuning is essential according to the generated logs, to control the unexpected behavior of applications. But before this, you need to modify and optimize the program’s logic and code.

G1GC helps to decrease the execution time of the jobs by optimizing the pause times between the processes.

It is one of the best optimization techniques in spark when there is a huge garbage collection.

5. Memory Management

The memory used for storing computations, such as joins, shuffles, sorting, and aggregations, is called execution memory. The storage memory is used for caching and handling data stored in clusters. Both memories use a unified region M.

When the execution memory is not in use, the storage memory can use the space. Similarly, when storage memory is idle, execution memory can utilize the space. This is one of the most efficient Spark optimization techniques.

daniel_sahal
Honored Contributor III

@KVNARK .​  

This video is cool.

https://www.youtube.com/watch?v=daXEp4HmS-E

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.