cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

pyspark optimizations and best practices

KVNARK
Honored Contributor II

What and all we can implement maximum to attain the best optimization and which are all the best practices using PySpark end to end.

1 ACCEPTED SOLUTION

Accepted Solutions

daniel_sahal
Esteemed Contributor
2 REPLIES 2

Ajay-Pandey
Esteemed Contributor III

The most popular Spark optimization techniques are listed below:

1. Data Serialization

Here, an in-memory object is converted into another format that can be stored in a file or sent over a network. This improves the performance of distributed applications.

  1. It is the best spark optimization technique. Serialization improves any distributed application’s performance. By default, Spark uses the Java serializer over the JVM platform. Spark can also use a serializer known as Kryo rather than a Java serializer. The Kryo serializer provides better performance than the Java serializer.

2. Caching

This is an efficient technique that is used when the data is required more often. Cache() and persist() are the methods used in this technique. These methods are used for storing the computations of an RDD, DataSet, and DataFrame. But, cache() stores it in the memory, and persist() stores it in the user-defined storage level.

These methods can help in reducing costs and saving time as repeated computations are used.

Cache and Persist methods of this will store the data set into the memory when the requirement arises. They are useful when you want to store a small data set that is being used frequently in your program. RDD.Cache()would always store the data in memory. RDD.Persist() allows storage of some part of data into the memory and some part on the disk. Caching technique offers efficient optimization in spark through Persist and Cache methods.

3. Data Structure Tuning

We can reduce the memory consumption while using Spark, by tweaking certain Java features that might add overhead. This is possible in the following ways:

  • Use enumerated objects or numeric IDs in place of strings for keys.
  • Avoid using a lot of objects and complicated nested structures.
  • Set the JVM flag to xx:+UseCompressedOops if the memory size is less than 32 GB.

4. Garbage collection optimization

For optimizing garbage collectors, G1 and GC must be used for running Spark applications. The G1 collector manages growing heaps. GC tuning is essential according to the generated logs, to control the unexpected behavior of applications. But before this, you need to modify and optimize the program’s logic and code.

G1GC helps to decrease the execution time of the jobs by optimizing the pause times between the processes.

It is one of the best optimization techniques in spark when there is a huge garbage collection.

5. Memory Management

The memory used for storing computations, such as joins, shuffles, sorting, and aggregations, is called execution memory. The storage memory is used for caching and handling data stored in clusters. Both memories use a unified region M.

When the execution memory is not in use, the storage memory can use the space. Similarly, when storage memory is idle, execution memory can utilize the space. This is one of the most efficient Spark optimization techniques.

Ajay Kumar Pandey

daniel_sahal
Esteemed Contributor

@KVNARK .​  

This video is cool.

https://www.youtube.com/watch?v=daXEp4HmS-E

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group