cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How do I handle a task not serializable exception?

cfregly
Contributor
 
8 REPLIES 8

cfregly
Contributor

If you see this error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: ...

The above error can be triggered when you intialize a variable on the driver (master), but then try to use it on one of the workers. In that case, Spark Streaming will try to serialize the object to send it over to the worker, and fail if the object is not serializable. Consider the following code snippet:

NotSerializable notSerializable = new NotSerializable(); JavaRDD<String> rdd = sc.textFile("/tmp/myfile");

rdd.map(s -> notSerializable.doSomething(s)).collect();

This will trigger that error. Here are some ideas to fix this error:

  • Make the class Serializable
  • Declare the instance only within the lambda function passed in map.
  • Make the NotSerializable object as a static and create it once per machine.
  • Call rdd.forEachPartition and create the NotSerializable object in there like this:
rdd.forEachPartition(iter -> { NotSerializable notSerializable = new NotSerializable();

// ...Now process iter });

enjoyear
New Contributor II

    I cannot make the class serializable, and I don't want to create the instance in the lambda function again and again. So,

    1. How to make the NotSerializable object as a static and create it once per machine?

    2. If calling rdd.forEachPartition, how can I have return values?

    vida
    Databricks Employee
    Databricks Employee

    Also note that in Databricks Cloud, variables in a cell may be broadcast so they can be accessed from the worker nodes. If you don't need to use that variable ever in a transformation on a worker node, another fix is to declare the variable @TransformersPorryient in a scala notebook:

    @transient val myNonSerializableObjectThatIDoNotUseInATransformation = ....

    vida
    Databricks Employee
    Databricks Employee

    Hi,

    You can use the Singleton Patter to create an object once per machine. This is explained very well on wikipedia:

    https://en.wikipedia.org/wiki/Singleton_pattern

    If you want to return values, you can use the mapPartitions transformation instead of the forEachPartition action.

    cfregly
    Contributor

    said differently/functionally,

    mapPartitions()
    returns a value and does not have side effects .
    forEachPartition()
    does not return a value, but (typically) does have side effects.

    NickStudenski
    New Contributor III

    @cfregly​  @Vida Ha​ 

    I'm having trouble with the same "task not serializable" error when calling foreachPartition.

    My code looks like:

    myDF
    .foreachPartition { (rddpartition: Iterator[Row]) =>
      val url = "jdbc:sqlserver://<myurl>"
      val un = dbutils.secrets.get("my-secret-scope", "my-username")
      val pw = dbutils.secrets.get("my-secret-scope", "my-password")
      val connection = DriverManager.getConnection(url, un, pw)
      var statement = connection.createStatement()
      rddpartition.foreach { (row: Row) =>
        statement.addBatch("INSERT INTO dbo.Table(Field1, Field2) VALUES (${row.get(0)}, ${row.get(1)})")
      }
      statement.executeBatch()
      connection.close()
    }

    The above code results in the error "org.apache.spark.SparkException: Task not serializable"

    When I modify the code to use the username and password as strings instead as shown below, it works just fine. Are you aware of a way to get around the serializability of dbutils.secrets.get?

    DriverManager.getConnection(url, "<username string>", "<password string>")

    vida
    Databricks Employee
    Databricks Employee

    @Nick Studenski​ , Can you try declaring the un and pw variables outside the scope of for each partition? Do it before, so that way you are just passing a variable into that function rather than the dbutils object.

    RajatS
    New Contributor II

    Hi @Nick Studenski​ , Could you share, how you solved your problem ? 

    Connect with Databricks Users in Your Area

    Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

    If there isn’t a group near you, start one and help create a community that brings people together.

    Request a New Group