Databricks Community

cfregly · ‎03-08-2015

cfregly · ‎03-08-2015

If you see this error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: ...

The above error can be triggered when you intialize a variable on the driver (master), but then try to use it on one of the workers. In that case, Spark Streaming will try to serialize the object to send it over to the worker, and fail if the object is not serializable. Consider the following code snippet:

NotSerializable notSerializable = new NotSerializable(); JavaRDD<String> rdd = sc.textFile("/tmp/myfile");

rdd.map(s -> notSerializable.doSomething(s)).collect();

This will trigger that error. Here are some ideas to fix this error:

Make the class Serializable
Declare the instance only within the lambda function passed in map.
Make the NotSerializable object as a static and create it once per machine.
Call rdd.forEachPartition and create the NotSerializable object in there like this:

rdd.forEachPartition(iter -> { NotSerializable notSerializable = new NotSerializable();

// ...Now process iter });

enjoyear · ‎10-31-2015

I cannot make the class serializable, and I don't want to create the instance in the lambda function again and again. So,

1. How to make the NotSerializable object as a static and create it once per machine?

2. If calling rdd.forEachPartition, how can I have return values?

vida · ‎05-22-2015

Also note that in Databricks Cloud, variables in a cell may be broadcast so they can be accessed from the worker nodes. If you don't need to use that variable ever in a transformation on a worker node, another fix is to declare the variable @TransformersPorryient in a scala notebook:

@transient val myNonSerializableObjectThatIDoNotUseInATransformation = ....

vida · ‎11-02-2015

Hi,

You can use the Singleton Patter to create an object once per machine. This is explained very well on wikipedia:

https://en.wikipedia.org/wiki/Singleton_pattern

If you want to return values, you can use the mapPartitions transformation instead of the forEachPartition action.

cfregly · ‎11-03-2015

said differently/functionally,

mapPartitions()

returns a value and does not have side effects .

forEachPartition()

does not return a value, but (typically) does have side effects.

NickStudenski · ‎05-10-2019

@cfregly @Vida Ha

I'm having trouble with the same "task not serializable" error when calling foreachPartition.

My code looks like:

myDF
.foreachPartition { (rddpartition: Iterator[Row]) =>
  val url = "jdbc:sqlserver://<myurl>"
  val un = dbutils.secrets.get("my-secret-scope", "my-username")
  val pw = dbutils.secrets.get("my-secret-scope", "my-password")
  val connection = DriverManager.getConnection(url, un, pw)
  var statement = connection.createStatement()
  rddpartition.foreach { (row: Row) =>
    statement.addBatch("INSERT INTO dbo.Table(Field1, Field2) VALUES (${row.get(0)}, ${row.get(1)})")
  }
  statement.executeBatch()
  connection.close()
}

The above code results in the error "org.apache.spark.SparkException: Task not serializable"

When I modify the code to use the username and password as strings instead as shown below, it works just fine. Are you aware of a way to get around the serializability of dbutils.secrets.get?

DriverManager.getConnection(url, "<username string>", "<password string>")

vida · ‎09-29-2021

@Nick Studenski , Can you try declaring the un and pw variables outside the scope of for each partition? Do it before, so that way you are just passing a variable into that function rather than the dbutils object.