cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How do I handle a task not serializable exception?

cfregly
Contributor
 
9 REPLIES 9

cfregly
Contributor

If you see this error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: ...

The above error can be triggered when you intialize a variable on the driver (master), but then try to use it on one of the workers. In that case, Spark Streaming will try to serialize the object to send it over to the worker, and fail if the object is not serializable. Consider the following code snippet:

NotSerializable notSerializable = new NotSerializable(); JavaRDD<String> rdd = sc.textFile("/tmp/myfile");

rdd.map(s -> notSerializable.doSomething(s)).collect();

This will trigger that error. Here are some ideas to fix this error:

  • Make the class Serializable
  • Declare the instance only within the lambda function passed in map.
  • Make the NotSerializable object as a static and create it once per machine.
  • Call rdd.forEachPartition and create the NotSerializable object in there like this:
rdd.forEachPartition(iter -> { NotSerializable notSerializable = new NotSerializable();

// ...Now process iter });

enjoyear
New Contributor II

    I cannot make the class serializable, and I don't want to create the instance in the lambda function again and again. So,

    1. How to make the NotSerializable object as a static and create it once per machine?

    2. If calling rdd.forEachPartition, how can I have return values?

    vida
    Contributor II
    Contributor II

    Also note that in Databricks Cloud, variables in a cell may be broadcast so they can be accessed from the worker nodes. If you don't need to use that variable ever in a transformation on a worker node, another fix is to declare the variable @TransformersPorryient in a scala notebook:

    @transient val myNonSerializableObjectThatIDoNotUseInATransformation = ....

    vida
    Contributor II
    Contributor II

    Hi,

    You can use the Singleton Patter to create an object once per machine. This is explained very well on wikipedia:

    https://en.wikipedia.org/wiki/Singleton_pattern

    If you want to return values, you can use the mapPartitions transformation instead of the forEachPartition action.

    cfregly
    Contributor

    said differently/functionally,

    mapPartitions()
    returns a value and does not have side effects .
    forEachPartition()
    does not return a value, but (typically) does have side effects.

    NickStudenski
    New Contributor III

    @cfreglyโ€‹  @Vida Haโ€‹ 

    I'm having trouble with the same "task not serializable" error when calling foreachPartition.

    My code looks like:

    myDF
    .foreachPartition { (rddpartition: Iterator[Row]) =>
      val url = "jdbc:sqlserver://<myurl>"
      val un = dbutils.secrets.get("my-secret-scope", "my-username")
      val pw = dbutils.secrets.get("my-secret-scope", "my-password")
      val connection = DriverManager.getConnection(url, un, pw)
      var statement = connection.createStatement()
      rddpartition.foreach { (row: Row) =>
        statement.addBatch("INSERT INTO dbo.Table(Field1, Field2) VALUES (${row.get(0)}, ${row.get(1)})")
      }
      statement.executeBatch()
      connection.close()
    }

    The above code results in the error "org.apache.spark.SparkException: Task not serializable"

    When I modify the code to use the username and password as strings instead as shown below, it works just fine. Are you aware of a way to get around the serializability of dbutils.secrets.get?

    DriverManager.getConnection(url, "<username string>", "<password string>")

    @Nick Studenskiโ€‹ , Can you try declaring the un and pw variables outside the scope of for each partition? Do it before, so that way you are just passing a variable into that function rather than the dbutils object.

    Kaniz
    Community Manager
    Community Manager

    Hi @Nick Studenskiโ€‹, Just a friendly follow-up. Do you still need help, or do the above responses help you find the solution? Please let us know.

    RajatS
    New Contributor II

    Hi @Nick Studenskiโ€‹ , Could you share, how you solved your problem ? 

    Welcome to Databricks Community: Lets learn, network and celebrate together

    Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

    Click here to register and join today! 

    Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.